Hello everyone and welcome to Big Data and Language. Today let's talk about corpus linguistics. You may heard about the corpus linguistics before. This one is the study based on the data. So text data. So you might be interested in this corpus linguistics or you may need to learn about the corpus linguistics, though are you ready let's get started. Okay, so what is the corpus? The word corpus comes from Latin the meaning is body, okay, and the plural is corpora. So if you have one group of texts than we called all corpus but if you have more than one group such as like two groups of text data, then you say, "I have to corpora." Okay. A corpus is a body of naturally occurring language, but rarely a random collection of texts because if depending on your research question, you may want to target the different text data. Corpora "are generally assembled with particular purpose in mind and are often assembled to be informally speaking representative of some text or text type.'' That's what Leech said in 1992. What is a corpus? A corpus is a collection of machine-readable and authentic texts and sample to be represented in a particular language or language variety. What is corpus for? A corpus is made for the study language in a broad sense. The first, in order to test existing linguistic theory and hypothesis.The second, in order to generate and verify new linguistic hypothesis and third, beyond the linguistics in order to provide a textual evidence in text-based humanities and social sciences subjects. So the purpose is reflected in a well-designed corpus. In the previous lectures, we've talked about intuitions. Right. How intuitions and the data are connected and help each other. Let's talk about corpora versus intuitions. Not Necessarily antagonistic, but rather corroborate on each other. Right. So the key of using corpus data is to find the balance between the use of corpus data and the use of one's intuitions. So in the previous lectures we've talked about that. Similarly the corpus and the intuition we need to use in balance. Let's talk about the corpus methodology. It is debatable whether corpus linguistic is a methodology or a branch of linguistics. A corpus linguistic goes well beyond this methodological role and has become an independent discipline. In spite of the name, corpus linguistic is indeed a methodology rather than the independent branch of linguistics in the same sense of phonetics, syntax, semantics or pragmatics. So it's very debatable. But in this case, the corpus linguistic is getting and becoming more independent field. So the areas that have used corpora could be: contrastive, analysis, grammatical studies, lexicography, lexical studies, language variation, Translation Studies, language change, language teaching, registered genre Analysis, computational linguistic, discourse analysis, forensic linguistics, pragmatics Literary study, semantics, sociolinguistics, and stylistics. So corpora could be used in various areas, but also corpus linguistic could be the independent area. So if you take a look at for example AAAL conference, then you will see that the corpus linguistics is the separate category. You will see that corpus linguistics association, so in that association and in that section corpus linguistics fictions of AAAL. Then AAAL means American Apply Linguistic Association. So you will see that the researchers actually research more about though corpus linguistics based on the data. Okay the text data. Biber et al 1998, According to them, the corpus based approach is empirical analyzing the actual patterns of use from natural texts and also, it utilizes large principals collection of natural texture as the basis for analysis. It makes extensive use of computers for analysis, using both automatic and interactive techniques. It integrates both quantitative and qualitative analytical techniques. So this one is the nature of corpus based approach. The corpus linguistics because the data is machine readable, So the easiest way to analyze it is using a computer. So why we use computers? The first one is development of computer technology has revived corpora linguistics and machine readability is the de facto attribute of modern corpora. Electric corpora have advantage unavailable to their "shoebox" ancestors. So for example the students had written papers, then it's hard to analyze quickly. However, if the student type, they're writing on the computer and you analyze the machine readable data. Then using taggant or angkor, you can use these tools easily to analyze the certain linguistic features. Computerized corpora can be processed and manipulate rapidly at minimal cost. So for example searching, selecting, sorting, and formatting, that's way easier than without computer. Computers can process a machine-readable data accurately and also consistently. Computer can avoid the human bias in an analysis so making result more reliable. For example, I'm going to explain how you can use other tools with the computer at later in week six seven. However, if you are already familiar with the tools such as like Angkor or taggant and any other tools, then once you store the text data and it's like almost permanently we can store that text data. That's already we talked about that in the previous lectures. Machine readability allows further automatic processing to be performed on the corpus, so that the corpus texts can be enriched with various metadata and linguistic analysis. That's why the corpora and corpus linguistic is related to Big data and language. Okay. But still corpus cannot do everything. Right. Such as corpora do not provide negative evidence. It cannot tell us what is possible or not possible because it shows only the possible existence. Right. It can show what is central and typical in English. Though corpora can yield findings but rarely provide explanations for what is observed. So you need to use your intuitions and your knowledge to explain the observed findings. The use of corpora as a methodology also defines the boundary of any given study. As I mentioned that before. It is very important that the research question you need to define and identify very clear research question to analyze your corpus or corpora. Lastly, is the findings based on the particular corpus only tell what is true in that corpus. So it is a little bit hard to over generalize. So you only can find the certain findings within the corpus but those findings we cannot assume that we will have the same exact findings in other corpus. So you need to ask the corpora right questions as I mentioned that and designing your research question is very important. So that's why you need to clarify and identify the research question. I'm going to explain how you can make that clear research question in the in the later lectures when you design your final project. Okay, so now let's testing your intuitions with BNC. So I will explain more about BNC later. But now let's go to corpus.byu.edu/bnc/. Then talk, for example the word talk that could be used as a noun or it could be used as a verb as well. So in different registers like what would be the noun talk used more in which register. So based on the BNC is shows that the noun ''Talk'' we used a lot in newspaper and any other registers. However, the verb ''Talk'', We used a lot in the spoken data instead of compared to other registers such as fiction, newspaper, academic text. We can also check whether data Use singular or plural. Right. Based on the data BNC, we see that singular used a lot in academic which is 21. However, the plural used more 135 times. Academic we can say that 42 times five per million words. So where to find what, I listed all the corpus, so depending on your research question. You may want to use the existing corpora. Okay. So today I've talked about the corpus linguistics and next two I will talk about the register. Thank you for your attention