Today we're going to talk about symbol tables. A classic data type with numerous applications, where classic data structure and various data structure allow us to achieve performance gains that are extremely important in the performance of our computational infrastructure. We're going to follow the same basic structure as the last lecture where we give the APIs and present a challenge of how to get performance, implementing those APIs and then look at a classic data structures that help us address the problem. Let's go back to Bob and Alice and the idea of using a whitelist filter with mergesort and binary search, to say maintain customer accounts or client accounts. And Alice says, "Well, it's kind of a pain sometimes really those companies doing really well". Bob maybe not so well, he hasn't noticed but then when Alice says, "Well we have to resort the whole list whenever we add new customers," and the other thing is, we want to process transactions and associate all sorts of information with our customers." The bottom line is that, that API we just merged certain binary search is not quite enough. We need a much more flexible API. And that's what symbol tables are all about. Think about it. So why are telephone books obsolete now? Well, they don't support some operations, so they don't support is change the number associated with the given name. It's already printed. But another one even more important is, what if you need to add a new name associated with a given number? You have to wait for next year's telephone book. Or what if you want to remove a name and associated number? Again with modern systems, we just do that in our digital devices and it's all fine. But the solution we presented with Mergesort and binary search in the sorting and search and lecture, has that same problem. We don't support the operation of adding a name or removing a name. And so with symbol tables, we get that flexibility. In symbol tables by the way, are the basis of the systems used in your digital devices to maintain phone numbers. The idea is called an associative array abstraction. It's really a very old idea. So, what you want to do is imagine using arrays but rather than forcing the indices to be between zero or any size minus one. Let's imagine that we can use string values. So, then a phone book would be something like this. We'd have an array called, phone numbers and then within the brackets we have a string instead of an end that's restricted to a certain range and then we can assign other strings to the array kind of in that way. Actually or so analysis case maybe there's transactions. There is a string that includes a date and then an amount and vendor and so forth. It's actually the case in some programming languages, not Java that you can write code like this, and underlying this code is the kind of implementation that we're going to talk about today, Symbol table. Here's another example that is common in the Web today. You want to associate a string with a URL and vice versa. So, it's a very fundamental abstraction. We talk about is using keys to access associated values and the keys and values could be any type of data but we have the simplest client code that you could imagine. Just put the key inside brackets and then assign values. That's the associative array abstraction. There's the vice versa where we associate IP addresses with string keys, very flexible abstraction. So the question is, how do we actually implement that as a data type in Java? And that's what we're going to consider today. So, let's lay out the abstract data type in the same way that we did for stacks and queues in the last lecture. So, a symbol table is an abstract data type whose values are sets of key-value pairs and we're going to assume that the keys are all different. That's not really a restrictive assumption as we'll see in a minute. So, the basic operations are, we want to associate a given key with a given value and that might involve two different things. If the key is not in the table, if there's no value associated that key anywhere in the table, then we'll add the key to the table. The key-value pair to the table. If the key is in the table, will just change its value. Just like you would do with with an array. Now, the other thing that we want to do is return the value associated with the given key that's using associative rate and expression or the right hand side of an assignment statement. We might want to test if a given key is in the table and then we might want to iterate through the keys in the table and get the key value pairs out. Those are the basic operations where we want to be able to perform. Now, sometimes it's useful to make additional assumptions. So, one thing that we'll use in this lecture and it's very common and useful in applications is that the keys are comparable. So that is, they are in a total order if we can compare one to the other we can get a result of less, equal or greater. And that's true of many common keys like the strings that we used in all examples are integers or doubles and so forth. And then in that case, when you iterate through the keys, you get sorting for free. The iteration comes in order and that's very useful. As with stacks and queues, we don't want to have within client code any limit on the number of key-value pairs. That's a fundamental restriction or assumption that it's going to make the click code much more useful just as with stacks and queues. And just to simplify the code and several implementations, we're going to assume that if a key is not in the table it by default associates with special value null. And again, this is just a statement in modern terms of a really old idea. So, used to have not just telephone books but dictionaries where the keys are word and the values are definition. And again in telephone book key might be the name and the value the phone number used to be and people still do buy paper documents where the keys is the time in the channel and the values are the TV show, and then encyclopedias where the key is a term, and the value is an article. Nowadays, you think of the key as what you put in a search and the value is what you get. So, that's a symbol table abstract data type. And one more point, before slide rules even or at the time of slide rules, it used to be that people evaluated functions by looking up a number in a book and getting the function value like the sine function or cosine function. Now, all of these have been made obsolete by symbol tables. Okay, so let's have a benchmark example that we're going to test our implementations against. And here's the one that we're going to use. So we have a sequence of strings on standard input. It could be a very long sequence of billions or longer. And what we want to do is count the frequency of occurrence of the strings in standard input, and there's lots of reasons to want to do that. So our keys are strings and our values are integers. So for each string we're going to keep an integer which is the number of times that we've seen that string. So here's just a simple example of where we'll take some keys. And so if we get it we saw it once and so we associate the value one with the key it and same with words and now we've got four things, and now we've got six things there. Now we get it and that's the second time we've seen it. So now we change the value associated with it to be two instead of one, and so forth. Now we just go through the strings that we get in standard input. I look up in the symbol table whether we've seen that before. If we have, we increment the frequency of occurrence we just saw it another time, if we haven't we put it in with a value of one. That's going to be our benchmark example when we look at client code that does that and then we'll see if we can support that client code in real applications. So either we change the value or we add no key with value one. Now, before doing the code, we have to again look at the parameterized API. We're going to use generics the same way as for stacks and queues. It's a little more complicated because we have two things that we want to use placeholder type names both keys and values. And again we'll use those within angle brackets and then substitute real types in client code. So, now inside the angle brackets we use the generic terms key and value, and there's an additional complication that we want for the implementations we're going to consider. We're going to assume that the keys are comparable. That means the data type that whatever type key is, implements the compared two methods. So that means that we can sort the symbol table by keys and we can do other things that depend on the keys being in order. And that allows us to support a much broader class of implementations and also guarantee a good performance as we'll see. Okay. So, our constructor creates a symbol table. Again, no reference to the size or the maximum number of the capacity or anything like that. The simple table should be able to hold any number of values and the client code shouldn't be restricted. So what operations do we do? Well the first is called put, where we associate key in value. And if there's no value associated with that key in the symbol table, we make a new entry with that key value pair. Otherwise, we change the value associated with that key to the new value. And then there's get which takes a key and returns a value, and that's the value associated with the key. If the key is not there we just return null. Contains. Is there a value associated with the key with our convention, that's very easy, we get it and check if it's null. And then there is the idea of iterating through the keys and the table and we'll talk about that in a minute. So that's the API. Let's talk about iteration just for a second. This is true for any collection, even a stack and a queue. And we can do it lots of ways but Java has language support for going through collections called the for each construct. So if you have a stack, you can just say for string s:stack, that's a special syntax for the for loop that goes through every item on the stack and prints it. The way that it goes through the stack is code in the implementation. But we're interested in it now because of how it simplifies client code. So it's useful for any kind of collection of data to be able to iterate through each item in the collection. The implementation determines which order... We like it because it has a substantially simpler client code and so that's what the implements iterable thing is about in the API. So actually, our implementation of push down stacks in the booking on the book side implements iterable. And what that means is that the client code can use for each construct for that data type. And we want to have a performance spec where it takes constant time per item like all the other operations for stacks. So the question is, what code do you need within the stack implementation to implement this iterable idea? Well, this is about on the border of the kind of code that would expect you to write, but for now the answer is we did it for stack and queue, so you don't have to do it. You can read in the text about how to implement an iterator that's associated with a collection type. And our implementation meets that performance spec of constant time per entry. And for now what we can do is make use of that one. But really what I want to emphasize now is in client code, if you're using collections like stack, queue or symbol tables, use iteration. It makes the client code much easier to understand. Let's take a moment to consider the question of why we use ordered keys. It seems to add complication. Well, one reason is it's very natural for many applications. The keys that client programs tend to use are things like numbers or strings or date and time. Even when you have color and length, there's a natural order in the keys. And given that natural order, that enables extensions to the API, we can provide operations that clients might expect. Like give me the keys in sorted order or even more important something like find the kth smallest key. There's plenty of applications. We're having operations like that in the API are important. So it makes sense to consider implementations that take advantage of order in the keys. In the present context, what we're interested in order for is that it enables us to develop implementations with guaranteed efficiency. Mergesort and Binary search are examples of that. And in this lecture, we're going to look at a data structure that takes advantage of ordering in the keys to guarantee efficient performance. All right. So let's look at a client example now. So what we want to do is take the lines on standard input the strings and put them out in sorted order with any duplicates removed. So the key type is going to be a string. We'll just use a line on standard input as the strings. In this case, we're going to ignore the values. We don't need the values. We're just sorting the keys. And this is a really simple symbol table client. We call the symbol table BST, in homage to the data structure that we're going to use as you'll see. So we build a new symbol table. It's keys are strings and its values are integers, and we're not going to use the value. While there's another line on standard input, all we do is put that line into the symbol table and give it a value of zero because we have to give it some value. Now we use, for each construct to go through, the keys and the symbol table and just print each one out. And that's a very simple client code to get this job done. So if we have A Tale of Two Cities text, first 10 lines or so, if we pipe that, take that as input to this program, we get those lines sorted. And sorting doesn't happen until the fourth word, where we have age and best and IPAC and so forth. So very simple client that gets sorting done. Don't need a specialized sort. And that's a warm up. Here's a more interesting example, which is going to be our benchmark, that's the Frequency Counter example. So now, we want to compute the frequencies of occurrence of the words on standard input. So our key type is a string again, and now, our value type is the integer which is the number of times we've seen that work. This is the client that I gave an example of at the beginning. So now, we build again a symbol table with string keys and integer values. Now, while standard in is not empty, we just read a string. We check if that one is in the symbol table with st.contains. If it is, then we put into the symbol table with key, with a new value, which is the old value plus one. So what that does is change the value associated with the key by incrementing it by one. That's if there was a value associated with that key in the symbol table. We've seen that string before. Otherwise, we put a new entry in the symbol table associating that key with one saying, "We've seen that the first time." And then, at the end, we can print out the keys in sorted order with the frequency of occurrence associated with them. Again, that's a very simple client code that's useful in lots of applications, and we'll look at some later on. So in this example, again, we use this code and as standard input, we provide that small test file. And you see a lot of words just to appear once, but "it", "of", "the" and "was" appeared 10 times in that example. So in this case then, we pipe it through the other program in order to get things in order of frequency of occurrence, which is maybe what we want in an application. That's frequency counter; that's the second example. Here's the third example that's more complicated but also not much more complicated but useful. So one thing that you want to have is maybe an index to the words on standard input. So what's that mean? Well our key is a string. It's a word that appears on standard input. But maybe for the value, we want to know all the places where that word occurs. So maybe in a book, you want to know what pages it appears on or in a string. In a string from a web page, you want to know the places where the word occurs. That's an indexing example. And this is not much more complicated than the frequency counter given the data structures that we've developed. It's a fine example of a utility of defining appropriate abstractions. So now, our symbol table is going to associate a string with a queue of integers. So we'll make a new symbol table that associate string keys with queues of integers. Again, we're going to go through every string in standard input, but now, we'll keep an index I, which is a place that we are in the string that when we're reading Ith key, then we have the value I. So we'll read a string. Again, if the symbol table doesn't contain the key, then we'll put it there with a new queue. Just make a new queue. And then, after we've done that, and if the key was already there, what we do is we're going to get the queue that's associated with the key. Now, if it wasn't there, we made a queue. If it was there, there is a queue. So we're going to get that queue and then we're going to call the enqueue method and put I at the end of that queue. That's it. In this case, we change the entry in the symbol table by getting the queue out and calling the enqueue method. Now, when we go through, we can print out each key and then we print out the queue that's associated with that key in the symbol table. And that's a queue and so there's a two string method that prints out the whole thing. So now, for our sample test file tail.tech, we get an index which says, "For every word, the place is where that word occurs." And you can see the utility of that in all kinds of applications. And again, our emphasis is that we have, with this API, a very flexible data type that gives a simple client code for doing complex tasks. You might have imagined that it would take quite a bit of code to do something like solve this indexing problem. So the bottom line is that symbol tables are everywhere in the computers and devices that you use, from credit card account numbers, to web search, to cloud storage. In all kinds of situations, you have a key that indicates what you want. And you have a value which is the thing that you want. And these things have to perform well. They're the basis of everything that goes on. The router and the Internet, when it has to route something somewhere, it has to know where and has to look up that value in a table. In many, many other situations in the computational infrastructure involve simple tables. It's a very fundamental data type. So Bob and Alice or anybody else that wants to make effective use of computers, you're going to need a good symbol table implementation. And that's what we're going to consider next.