In this video we are going to look at one specific application of text processing, and that is the spelling correction task. Spelling correction is a task where when someone types a word that is misspelled, how do you find out the correct spelling for that word and replace that existing word? A common way to check for mis-spelt words and correcting them, is to find valid words in dictionary that share similar spelling. So to identify that, we have two requirements, one that there is a dictionary of valid words. What can you replace a mis-spelt word with? Here, NTLK comes to our rescue again. There is a particular component called words, a corpus called words which are English language words that is already there in nltk corpus. So if you include this words package from nltk.corpus, you can get the dictionary of English words. Second requirement is some way to measure a similarity between words. So how do you measure spelling similarity, right? There, it becomes a little bit more complicated. So we can use something called an edit distance between two strings. You can consider these two words as strings and then find edit distance between these strings. What is edit distance? Edit distance is the number of changes that need to be made to go from string A to string B. What all do we need to make changes to get the correct word? In this case, string A would be the mis-spelt word and string B would be the most appropriate word. The fewer those changes, the closer these two words are. One specific algorithm to do that is called the Levenshtein distance. For Levenshtein distance, three possible operations are allowed, you can insert, or delete, or substitute a character. So for example, if you want to go from the word water to the word cooler, now I'm not saying that water is a mis-spelt word and cooler is right spelling, it just works for any two strings. One way to do that would be to kind of lay them on top of one another and see how many changes do we need to make? So we could substitute w with c, and then the next a with o, the next t with o and so on. So if you align them one over the other, we can make five substitutions. And then, because cooler is one character longer, we can insert the last r and that would make 6 edits. We have five substitutions and one insertion. On the other hand, one of the advantages of Levenshtein distance is to find that combination that can reduce the edit distance as much as possible. So this is not the best way we could convert water to cooler. We could actually realize that is common for both of these words and hence you don't have to actually change them if you can align everything else. So we can keep the at the end, the suffix the same, and only substitute w with c, a with o, t with o and then insert l. So we only need to do four things and then suddenly both of these align well, right? So edit distance for water to cooler using this Levenshtein algorithm would be 4. There are many other ways to compute edit distance, but this is one we could go with. Another way would be to find N-grams. What are N-grams? N-grams, in this context are character sequences in a word of size n. So the character sequences size is n, and you find all such character sequences in a word. N-grams are also used in the context of word sequences. So when you're talking about multiple words that occur one after the other, that would be N-gram at the word level. An example would be let's say this is a girl. You can have bi grams, that means two grams, where you have this is, is one bigram and is a, is another bigram and a girl, is a third bigram. In the context of characters, we can do the same thing. And then how can we use these N-grams for spelling check? And that is done by realizing that when two words have very similar spelling, they share many n-grams together. So an example is this. So for a word like pierce, you can have five character bigrams. That means you're looking at two character long sub-strings, pi, ie, rc, and ce. For a mis-spelt word where somebody was trying to type pierce and mistyped it, let's say with P-I-E-R-S-E, the same bigrams would give pi, ie, rs, and se. Then, looking at these two sets of two grams of bigrams, we can compute what is called the Jaccard similarity. A Jaccard similarity is used to measure similarity of sets. And this Jaccard coefficient is computed by looking at the intersection of two sets A and B and divide it by the union of those two sets A and B. So in this example of pierce and pierse, so we have five bigrams from pierce and five from the second one, but three of them are common. And hence, the intersection between the two is 3, and the union of these two sets is 7. And so the Jaccard distance or Jaccard similarity between these two words is 3/7. We can use this now as a measure to find what is a valid word that has the least distance or the maximum similarity with the mis-spelt word and use that as an alternative spelling. These are the two things that you're going to explore in the assignment this week. So to summarize NLTK and simple text processing can be used to build a very simple naive spelling checker. N-grams can be used to capture these word phrases to find near duplicate passages. So in addition to just the strings, if you now look at words, then you can use the same idea of n-grams and bigrams and trigrams to find almost similar passages. In fact, n-grams are also fundamental to many NLP pre-processing tasks, like identifying suffixes and prefixes, doing suffix matching, doing some sort of character-level embedding that is used in advanced machine learning and deep learning models. So this is a very interesting concept to learn about and understand.