[MUSIC] Hello everyone. Welcome to Big Data and Language. Today, let's learn about how to use BNC, stands for British National Corpus. So, are you ready? Let's get started. First of all, let me give you the link address of about BNC. You need to go to Google and type BNC corpus, or you can just go to corpus.byu.edu/bnc/, okay? And also once you find the BNC corpus website I recommend you to make the account, so register as a student. And let's talk about British National Corpus, the features of national BNC one by one. The first one is BNC is monolingual corpus. Now, you are familiar with the terminology corpus, right? The corpus is the set of text data. So, monolingual, what does the mono mean? The mono is one, right? And lingual means language. So, monolingual means one language. So, this British National Corpus consists of English only. Okay? And BNC is the general corpus, which means the corpus that does not belong to a single text type, subject field, or register. So, when you go to BNC, you will notice that there are many sections, such as spoken, newspaper, academic, fiction, something like that. Okay? And BNC is a synchronic corpus, which means single point in time disregarding the historical developments. So, you will notice that the BNC was developed only in a certain time. Okay? And BNC is a sample corpus, samples of no longer than 45,000 words. Okay, and let's look at the domains of BNC. So, 20 or 30% imaginative texts, and 70 or 80% percent are based on informative texts. And there are different disciplines, such as natural and pure science, 5%. And applied science 5%, and social and community 15%. And world and current affairs 15%, commerce and finance 10%, arts 10%. And belief and thought 5%, and leisure 10%. So, we can see that there are many disciplines, but different percentages. And also the time, the text was collected based on the certain time between 1975 to 1994. But, especially the imaginative texts were written between 1960 to 1975. Okay? And also, the medium, the books are about 60%, and periodicals such as newspapers and magazines about 30%, and other miscellaneous has 13%. Okay? And also the text itself is divided into three different levels, high, middle, and low. And classification features also depending on author, target, place of publication, and sample. So, author means authorship, gender, age group, and domicile. And also the target means audience age, audience gender, and audience level. And the place of publication, whether that one is published in the UK, or US, or any other place. Because this one is a British National Corpus, of course, published in the UK, but which part? And also the sample, either whole text or beginning sample, middle sample, or end example. So, you can see that if, for example, the novel is really long, right? So, there will be the sample from the beginning of the fiction. Then there is the information about that. Okay? And let's talk about spoken the corpus design. In the BNC, we have written text and also the spoken text. The spoken text is depending on the demographic sampling. So, it's definitely in the United Kingdom, because it's a BNC, British National Corpus. And these sample data depict the spoken corpus generated in the UK. So, also the age ranges, there are six groups. So, equal age width except 15 to 24. So, equally distributed the word-unit, except 0 to 14. So, I will give you the table, so please take a look. And also the gender, there are equal amount of spoken text from both male and female. However, a word-unit value of female is higher than that of male, which means that the length of the spoken data of female is longer than that of male. Again, I will give you more specific data in the table. Let's look at the regions. So, they're researchers divided UK into three regions, the North region occupy the highest amount of the spoken text. And a word-unit value of South region is the highest, which means that the length of the spoken data of South region is longer than that of others. And the social groups were also considered. The researchers divided the social classes based on NRS social grade. So, most spoken texts are from AB class. And also the word-unit values follow the tendency of the numbers of text. So, AB means much richer than any other group. Okay? So, and let's talk about the context-governed sampling. So, 60% is based on dialogue. About 40% based on monologue. Monologue means such as lecturing or news reading, so just speak one way, right? Like what I'm doing, giving a lecture, right? And classification features, you can see date, place, time, setting, and size of the audience, and also the spontaneity of the speech. Okay, so these are all the BNC features, from now on you will have a chance to use BNC tools. So, please keep watching the videos. And also, please complete the task that I will give you. So, and I will give you all the explanations and also the examples that my students have done. So, please feel free to take a look and good luck with your task.