So the algorithm never going to look at for solving the sub string search problem is called the Knuth-Morris-Pratt Algorithm. It's got an interesting history that we'll get to at the end. But I just want to start by saying this is one of the coolest algorithms that we'll cover in this course. And it's not an algorithm that anyone would come up with without a lot of hard work, but understanding this algorithm really gives somebody an appreciation for what's possible with careful algorithmic thinking, even for such a simple problem as this. It's a quite ingenious method. So here's the intuition behind the algorithm. Say we're looking for the pattern B followed by a bunch of As in text. So we're going along and so, well, we found a mismatch right away. So we'll move over first position and [COUGH] we'll go BAAAA and we try to mismatch at the end in this case. So now what the brute force method is going to do is back up to move over one position at B doesn't match the A. Back up again at B doesn't match the A. And keep going, matching the first B against all these As. [COUGH] Before getting to the next character in the text. So we've matched five characters and we mismatched on the sixth. But the key point is that when we got that mismatch, if all we have is As and Bs, we already matched BAAA four As and we got a mismatch so it's a B. So at that point, we know what the previous characters in the text are, and we know that if we move over one position we're going to be trying to match B against A. So we should be able to take advantage of that knowledge that's the intuition of the method of the Knuth-Morris-Pratt. We going to need to back up the text pointer in this case. When we get the mismatch, we can just keep going. [COUGH] We don't even have to match the first B, we know it's there. So we can just keep going in the text pointer and start on the second pattern pointer. The Knuth-Morris-Pratt algorithm is a very clever method that manages to always avoid backup, no matter what the pattern is. So let's take a look at what's involved. [COUGH] It's based on the idea of building a deterministic finite state machine for string searching. Now, a deterministic finite state machine, if you're not familiar or don't remember, is a very simple thing. It's got a finite number of states. It's always in one of the states. There's a start state and a halt state. From every state, there's exactly one transition for each possible character in the alphabet. So if we're in a state and we're reading a particular character, we [COUGH] move to the state that is indicated by that character. So if we're in state 2 and we're reading an A, we move to state 3. That's what this line in the table says. If we're in state 2 and we read A, that happens to be the pattern character is A, if we see an A we move to state 3. If we see a B or a C, we go back to state 0. That's what this machine says. It's a tabular representation and a graphical representation. If we get to stage 6, then we just say accept, we found the pattern. And you can see how that is, the pattern, if we match a pattern character we move right through this machine until we get to the halt state. So, let's first take a look at exactly how the machine works to make sure everybody follows that. And then we'll look at how do we rebuild this machine. So, that's our machine and that's our texts off at the top. So we start at state 0. We're reading an A. What are we supposed to do if we're in stage 0 and we read an A? This entry of the table says we're supposed to go in state 1. So we go to state 1. At every step we consume a text character, just read a text character. Now we're in state 1 and reading an A. And in the graphical representation it says stay in state 1 or in this representation it says if you're in state 1 and you see an A stay in state 1. So that's what we'll do, we'll read the A and stay in state 1. Now we're in state 1 and so you can say that in state 1, it's reading As and we're looking for something that's not A. No matter how many As you have, you'll read them right here in state 1. So now we're in state 1 and reading a B and it says go to state 2 in that case so, that's where we'll go and read the B. Now we're in state 2 and reading an A, so we go to state 3 and read the A. Now we're in state 2 and we see a C. In state 3 when we see a C, we're supposed to go back to state 0. That's a mismatch, our pattern C is not until the end. So now we have to start the search all over again. Although in the finite state machine doesn't know that. Its just plotting along according to the state transition suppose to do. So we read the C now we're back in state 0. So now we're in state 0 looking at an A. And we move to one, and read the A. And then we see another A, so we stay and want to read the A. Now we see a B, and we go to state 2 and read the B, then we go to state 3 and read the A, state 3 and read the A, state 4 and read the A, state 5 and read the C, and now we're in state 6. That means we found the pattern. So that's the simulation of what the deterministic finite state of power ruckus corresponding to this pattern while we dig. Now, if we'll see the code for implementing this simulation is quite simple, that's a demo of DFA simulation. So let's just take a quick look at what the interpretation of this DFA is. So what does it mean after we just read the ith character of the text? Well the number of the state is the number of characters in a pattern that we've matched up to the current point. So that is at state three, we've matched three characters in the pattern. And what's happened is that we've matched a prefix of a pattern, the first three characters of a pattern. And actually, it's the longest prefix of the pattern. That's also a suffix of the first i plus 1 characters of the text from 0 to i. That's the interpretation of what it is. So the DFA is in state 3 after the first six characters of the text. It's got ABA is the longest prefix of a pattern that's also a suffix of the first six characters of the text. So another prefix to the pattern that looks like it might work is ABABA. But that's not a suffix of the text. Ending at character 6. That interpretation is useful to understand what's going on. Given that we've constructed this dfa, it's just the transition table for the dfa. It's simply a two dimensional array that's indexed by the pattern character. So, if you're in a certain, if you're reading a certain pattern character. Sorry, if you're reading a certain text character, then for each [COUGH]. We reset the pattern pointer to be the entry given in the table that corresponds to the current one. So let's go back to the table to be sure about that. So when we're in a particular state like 2, and we see an A, so this would be j. We refer to that column and whatever text character we happen to see, that's how we update j. So that's what this code does, and then if we ever get to the transition state in the final state, we just return i- M. Now this is just like that alternative implementation of the brute force method that we gave, but the key idea is that, notice that i never gets decremented. All we're doing is incrementing i and changing j, either always just referring to the table. When we have a match, the table automatically increments j. We have a mismatch, it figures out the right state to go to. So, that's a very simple and economical and fast implementation of subscreen search. So, the running time is clearly, all it does is look up an entry in the table for every text character, so that's linear time. Now the key is how to rebuild that table that drives the algorithm and that's the tricky algorithm. So a warning, this is definitely the trickiest algorithm we've looked at so far. And the importance, again, of no backup is that, since the textpointer never decrements, you could use an input stream. And just replace text divided by, just read the next character in the input stream. And this algorithm is going to work fine, no matter how big the input stream is. It'll just go right through, it's got no memory of what the text is, but it's got some memory of what the pattern is that's built into the DFA. So this is the key that you have a pattern. You can spend some time building this DFA table if you're in preprocessing. But then when you get the text, just index it into this table for every text character, and you're doing your substream search. So you can build a machine that does this. Okay, so let's take a look at what it means to construct this DFA. So it depends on the pattern, you start out with one character, one state for each character in the pattern, plus an extra accept state. Let's look at what it means to build the thing. So the first thing that we do, one state for each character in the pattern, the first thing that we do is deal with the match transitions. So that's when, you're in state j, that means you already matched j characters in the pattern. Then if the next character matches, so the character that you have is the character that's supposed to match. So that's path of j that's the j plus first character, then you've know that you've matched j + 1 characters. So all that means is, we can put in the match transitions. So we have A B A B A C and these guys go A B A B A C. If you're in state 0 and you see an A, you go to state 1. You see 1 you see B, that's state 2, and so forth. That's how we get the match transitions to get us all the way through the pattern to the accept state, so that's the easy part. And then the hard part is the mismatch transitions, what are you supposed to do if you come against a text character that does not match the current text character? So for example, if you're in state 0, the pattern starts with an A. If you don't see an A as the first character in the text, if you see a B or a C, then obviously you want to stay in state 0. So you can think of state 0 as scanning through the text looking for an A. It's going to stay in state 0 as long as it sees a B or C. As soon as it sees an A, it'll go to state 1, but that completes state 0. You know what to do if you have an A, you know what to do if you have a B or C. What about state 1? So we know that if you're in state 1, you saw an A in the text. And if you see a B in the text, what are you supposed to do? Well, there's two different cases, if you see a B, you go into state 2. If you see a C, then that's not going to match the first character A in the pattern, so you go back to state 0. But if you see an A, it's just as if you saw an A at state 0, so you might as well stay in state 1. So, that's how we fill in the BFA for state 1. if you see a B, you go on to state 2, you matched. If you see an A, well, that matches the first character, so stay in state 1. And if you see a C, clearly you have to go back to state 0. Okay, what about the next one? So if we're in state 2 and we see an A, we know that we go on to state 3. And what about the mismatch case? Well, in that case, it's a B or a C and again, if you're sitting on a B or a C, you'd better go back to state 0 to keep looking for an A. State 3, well, now it gets to be a little more complicated. State 3, see a B, you succeeded, if you see a C, you go back to state 0. It's not so bad, if you see an A, it's just like before, that's going to be like the first one, so it seems like we're going along pretty fine. And again if you're in state 4 and see an A, you go to 5, see a B or C then you better go back and look for an A again. This one is the one that's a little more complicated. If you're in state 5 and you see a B, you go back to state 4. And that's a little more work to figure out why that's the case. And that last case is kind of the essence of the algorithm. So we'll look at a systematic way To be able to figure out what you do on a mismatch in each case. In this case, you only needed that for that one stat there, otherwise it was elementary reasoning. So that's the full DFA for Knuth-Morris-Pratt demo of, at least thinking about how it's going to be constructed. Okay, let's look a little more carefully and systematically at the construction process for the Knuth-Morris-Pratt DFA. So the start is clear, we're going to go through the pattern, and for systematically filling the match transitions. If we're in state 0, and we see an a we want to go to state 1. If we're on stat 2 and we see a B, we're going to want to go state 2. So we look up the pattern character and then whatever that one is, we want to go to the next state. So we can fill in at least that much automatically. Now, the real key is the mismatch transitions. So here's the idea of the mismatch transition. So if you're in state j, and you get a mismatch the next character in the text does not match the j-th character in the pattern. So as pointed out at the beginning, as motivation for Knuth-Morris-Pratt, you know a lot about the text at that point. You know that the j 3 minus 1 characters of the text are, you lop off the first character of the pattern, it's from 1 to j minus 1. And then it's followed by the text character that you're looking at. So what you want to do to know that, and what you could do, is simulate the DFA that you have built on that part of the pattern and then take the transition for the character that you just find. So let's look at this. So let's run this machine, if we had seen it on the text, the BA BA, that's what we want to do, we want to put the machine in the state as if we had backed up, but we don't want to backup. So if we see a B, we stay in 0, if we see an A, we go to 1, then if we see a B we go to 2, and if we see an A, we go to 3. So [COUGH] now we're in state 3, and [COUGH] if we had a mismatch, then for the fifth character, what we'd do on a mismatch here, is we have to look at what happens if we get an A or if we get a B. If we run it on BA BA and we got an A, then we should go back to one. So what that says is if we had a mismatch, and we saw an A in 5 we would need to be in state 1. Because if we had had run the thing on the characters that we know, BA BA A, we would wind up in state 1. And similarly, [COUGH] if we were [COUGH] if we got the mismatch on a B, again, if we did BA BA would be in state 3, and we're in state 3 and we saw B, we'd go to 4. So and again to summarize, if we we're in state 5 and we see a C, we know that's a match, we go to 6. If we see an A, we know that the previous five characters in the text were BA BA A, so we can just simulate the machine BA BA A, we're in state 1. And if we got a mismatch that's a B, then that's dfa['B'][5], then we know that the previous five characters in the text were BA BA B, and that would put us in state 4. So that's the simulation of BA BA puts us in state 3, if we get an A we go to 1, so that means that from 5 we're back to 1, and if we get a B, we go to 4. So that's how, one way to calculate the mismatch transitions. And, as we noted in the simulation, this is the only non-trivial one for this example. Now there's a little problem with that, is that it seems to require j steps to do the simulation. In order to figure out these mismatched transitions, I had to go all the way through the pattern shifted over one, to figure out this state 3. So that seems to be a bit of a problem. But actually it's no problem at all, because we can run this simulation one character at a time, as we're building the machine. All we need to do is keep track of this state that we would be at, if we had run the DFA on the pattern starting at position one. Once we get going, it's pretty easy. But let's illustrate it by saying okay, we maintained the state x which is where we would be if we had run the machine and the pattern shifted over one. Now, when we come to do our mismatches to figure out where the mismatch transitions from five are, all we do is look at, if we were to get an A, would be as if we're in state x and we got an A, so that's 1. And if we were to get a B, it would be if as we were in state x and got a B and that's state 4. So what we need to do is to compute the mismatched transitions, is keep track of state x. And that is where would the thing be, if we had run it starting at the pattern one position shifted over. And we want to update that, so when we're moving to the next state, if we had a match on C, state x gets updated to where it would have gone if it got a C. Because for the next character, when we move j over for the match on C, we'll have x updated. So that's the key, is keeping track of the state where the machine would be if we had backed up, or if we had run it on the pattern shifted over one. So let's take a look at a demo that does the full construction for the k and p DFA. So here it goes. Again, one state for every character plus an accept. Match transitions are easy. We build those and we're going to start at position 0. And the mismatched transitions are easy. So now when we move over, and we're in position 1 of the pattern, x is where we would be, if we started out without that character, which would be the empty string. So we start out with just filling in 0s. The first state 0 and we can do that without any further reasoning, but now we've got x initialized. So now what we need to do is fill in the mismatched transitions for state 1. So what are they? They're what would happen if we found those characters. And we're in state x. If we found an A, we'd go to state 1, if we found a C, we'd go to state 0. And maybe you noticed, that's just taking the entries corresponding to the entries we need from xs column and putting it in js column. That's all it is. And then we need to update x which is where it would be if we matched the B. So we'll stay in, x will stay in state zero. So now let's look at state 2. So we need to fill in what would happen if we got a B and what would happen if we got a C. Well, it's what would happen if we were in state x and we got a B or a C. So all we do is move those two zeroes over to column 2. And then don't forget we have to update x and x goes where the machine would go if we saw an A, that transition from two to three. So we just move x to state 1. So now we have x in state 1 and we're doing position 3. And now you can see how really simple the algorithm is once it gets going. So now the mismatch transitions are A and C, and that's what we have to fill in for column 3. But those mismatch transitions were already computed, that's where x would go if we happened to be state 1. So we move the 1 and 0 from column 1 over to column 3. And, again, we update x and when we see a B, x goes to state 2. [COUGH] And [COUGH] And again you can check what is x supposed to be. It's supposed to be where you would be if you started the machine on the pattern with the first letter cut off. So it's supposed to be where you'd wind if you got BAB. BAB, and that's a check. All right, so now it's straightforward. Straightforward we have to fill in B and C. We go to xs column and copy over B and C from xs column. Then we update x, to c and a, now we're ready for state five. We've got x all computed. And we need to do A and B and we get those from x. If it's an A, it's a 1, and if it's a B, it's a 4. Let's just moved them over. And then when we do the C and get the accept, there's no mismatch. That's where you do update x to get ready. But when we get the state 6 we're done. And again it's just as a double check axis where your B, if you see a BABAC. That's the construction of the Knuth-Morris-Pratt DFA. An efficient algorithm, really ingenious. And it turns out, simple to implement. Implementation for the DFA construction for Knuth-Morris-Pratt requires remarkably little code. So it just goes through what we did in the demo. So we've got x and for every entry of the DFA in xs column, that corresponds to everyone except for the match. We just copy xs column to js column backed to we just copy them all. And then overwrite the one corresponding to the match case, not to j + 1. That's all, and entering the column it corresponds to the pattern character that's where do we go from match that's j+ 1, all the rest of them are copied from x. One for loop to copy from x, then set the match case. And then the only other thing to do is to update x. And how do you update x? You take the machine to where it would go, if it weren't state x, and cut the charAt, X charAt, the pat.charAt. So that's the DFA construction, amazingly little code. So the only sort of flaw in this code is that it does take time proportional to r times m, where R is the radix and M is the length of the pattern. And as we'll see, there's ways to get around this. But for relatively small alphabets, this is no problem, because the search is so efficient this way. And [COUGH] construction on it as well. But if you're doing this for unicode for a long pattern maybe that's too much memory to devote to the DFA representation. So the bottom line is, and this follows immediately from examining the code is that the KMP substring search is linear time. You only access at most, each character in the pattern, each character in the text once. Quite remarkable. Each pattern character you access once when you're making the DFA. And each text character is accessed once when you're simulating a DFA, and that's in the worst case. Now the space, again, is proportional to RM because we have all those mismatches. It is possible to develop a version of Knuth–Morris–Pratt that constructs the automaton in time and space proportional to M. It's actually a non determinant of automaton because it's either a match were all mismatches, and a mismatch might involve multiple hops. So if you're interested, you can read about that version of KMP. But it's sufficiently more complicated that you should be prepared to study it carefully to really understand it. This algorithm is really interesting because of its history. It was actually independently discovered by two theoreticians and a hacker. So there's a Knuth whose one of the fathers of computer science, who read a paper that was a very asset by Steve Cook. A very esoteric theoretical result that he realized implied, that it should be possible to solve the substring search problem in linear times. The theorem really didn't give any way to solve it. But it indicated that is should be possible to solve it in linear time. So Knuth worked on it and figured out a way. And then Pratt, who was a student of Knuth at Stanford at the time, figured out a way to take care of this mismatch independent of the alphabet size. And in the meantime, across the bay at Berkeley, Jim Morris was busy writing a text editor. In those days, people were using typewriters and other people were realizing that computers would be really good at editing text. And so many people would work on text editing and formatting systems. It was kind of a badge of honor in those times. Morris actually worked for the computer center at Berkeley, and I wanted to have a really bulletproof text editor that everyone could use. And one of the things he wanted to do was avoid backup, because backup was just really inconvenient. Involved a lot of code, and just something that he didn't want to have. And he basically came up with the same algorithm. And the community was kind of small at that time, and theory needs practice, and that's where this paper came from, 1977. Morris then went on to get his PhD, and to work at Xerox Park, and unfortunately later on, another systems programmer came and took a look at Morris's code and couldn't understand it and put the brute force out algorithm back in, but that part of the sad story is maybe a less successful example of theory meeting practice. That's the Knuth-Morris-Pratt algorithm, one of the most ingenious algorithms that we'll see.