Hello and welcome to this lecture where we are going to be talking about a very interesting idea called Open-Address Hashing, okay? And this is going to be something as opposed to what we have learned so far, which is chaining-based hash table. And we will recall what chaining is in a second, but open-addressing is a very important implementation choice for hash tables. We'll talk about why and we also note that python dictionaries, especially, in implementation of python called CPython or Cython uses open-addressing. So you can actually find some very interesting implementations, very highly performant implementations of Open-Address Hashing. What is Open-Address Hashing? Well, we already saw in hash table the main thing about a hash table design was resolving collision. Other than hash functions, the other big thing was, how do you resolve collision? And we already saw chaining as a solution to this problem. So chaining already solves this problem by keeping a list rather than a single value for each slot in the hash table. So now you have a list of values, a list of key value pairs and you're going to grow and shrink this list as items are inserted and deleted into the hash table. And we saw how this works and so on and so forth. So then the question is, why do we need another solution? So let's talk about that. The reason we need under the solution is first of all, lists are very memory intensive and they are a data structure which are complicated, okay? That's the first thing. And the second thing is, when you're using chaining, you're going to find that some of these lists are going to grow very large. Where else some of these slots are going to have nothing here or empty things here, so you're wasting a lot of space in this kind of a hash table. So the question is, could we utilize all this wasted space where nothing is in these slots? So there are lots of these lots where there's nothing, and the question is could be use these. And open addressing kind of results it and makes the design of the data structure of the table itself simple. So what is the idea here? So we go back to where we started from where each slot in the hash table simply has a space for a single key and a single value, okay? So you have space for a single key and a single value, [COUGH] and a single value, and that's going to be important, okay? So you're going to store a single key and a value, not a list anymore, all right? The one leaky thing is, how do we resolve collision? And what is collision? Remember, suppose if I wanted to include a new key in this hash table. And the hash function says that the hash value of this new key is j, but slot number j is already taken, okay? And that's where a collision happen. So this is what causes a collision and we need to handle this collision. So how do we do that is the main subject of open addressing. It gives you a different solution, rather than chaining you get a different idea, a different solution. So what do we do there? So the idea here is if the original slot is occupied, then place things in an alternative location. So what we will define is there is an original location for each thing. That is, say, when you hash a key for the very first time you get slot j and as I said before, j is already occupied. So what do you do? Well, we can do something like this. We can previously like, fixed by convention, a sequence of alternative locations. So we will fix some alternative locations where we can try to insert this new key. What's an example? And this is called a probe sequence, this is called a probe sequence. In this literature for Open-Address Hashing we call it a probe sequence. And one simple idea for an alternative location is if the slot j is occupied, just look at slot j + 1. If that's occupied, look at j + 2, i that's occupied, look at j + 3 and so on, okay? Now, if in this process you come back, you wrap around from m, you come back, okay? So this was the m, you come back and let's say you find that everything is occupied, well, then you rehash, okay? So we've already been through that double the size of the hash table, rehash everything and try to insert your key again, okay? So that's good, but the point here is we will use this probe sequence to figure out where this item could go. And that's going to be important, all right? So this probe sequence gives us a list of alternative locations or a sequence of alternative location. Not a single location a sequence where you can go one location after another. Remember, if it's a single location, you haven't quite solved the problem. Because you could have that alternative location be occupied in turn, then you need an alternative location, so that's what we are really doing. So we're coming up with the sequence of alternative locations where we could be placing this key, and that's what we would do. Rather than chain we're just going to use the space in the hash table itself to place these keys, that's the key idea behind open addressing. But because of this idea, we need to change the way we do insertions and we need to change the way we do deletion. So let's talk about that, [COUGH], all right? So what is open address hash table then do? Let's talk about lookup and insertion in an open address hash table, all right? So the idea here is suppose I want to lookup whether a key is in a hash table, I want to do a find operation or a lookup operation, all right? How do I do that? Well, originally, what we would have done is we would have hashed the key. And we would go to the slot where it says the key should be found, which is the popular slot j. But in slot j, we are not finding key, we're finding some other key called key j. And key j, let's say it's not key, so should we now stop looking? That's the key question. So think about this for a second. Should we stop looking? Once you've found that slot j, which is where the hash function tells you that the key should be found, we do not find that key. So should we stop looking? And the answer is, if you've been thinking about this, and hopefully you arrive at the same answer as we do, which is no, you cannot stop looking. Because you might have to look at the these alternative locations. [COUGH] You might have to look at these alternative locations and see if the key is to be found in any of them, right? You just can't just stop looking, you have to go look in the alternative locations. So this is what you do. What you do then is probing sequence, all right? So again, going back to this example, suppose I'm looking for a key and it's not found in the very first lookup, which is driven by the hash function. The hash function tells me there to do the very first lookup. Let's say I don't find it there, now I have to do the probe sequence. So I have to look at j which I've already done, let me look at j + 1. So this is, let's say, my alternative locations, and we will see that there's different possibilities for this. But let's say it's j + 1, j + 2, and so on. So let's look at j + 1, if we find the key there, there is a chance we could find it there, okay? Otherwise, if j + 1 is empty, let's say j + 1 is empty, then we can conclude that this key would not exist in the hash table. And the reason is that if it were to exist, then it would have either existed in slot j or it would have existed in j + 1. But if j + 1 were filled up with something else, then you could do it in j + 2. So the point is, if, [COUGH] j + 1 is empty, then we can immediately conclude that no, it doesn't exist or if j + 1 has the key, then you return the key. Or if j + 1 is not empty, let's say j + 1 is filled with something else, then you keep going. Then you look at j + 2, j + 3 until you find this key or you reach an empty slot, which means that you can't do anymore. [COUGH] And these are called probing sequences. The sequence of key locations j + 1, j + 2 that's called a probing sequence. Now you could do something more general instead of 1, you could use a fixed constant called a and you could look at j + a, j + 2a. And these are all modular m, which means that if you go past the end of the hash table, then you wrap around by using the modular operator, okay? So that's that's called linear probing, okay? So this is like looking at a linear sequence or an arithmetic progression of locations modular m, okay? The alternative is something called quadratic probing. So in quadratic probing instead of j, j + a, and so on, we would do like j, j squared + a, (j squared + a) the whole squared + a. Again, you do mod m, so this gives you some more random looking sequence of locations very good findings, okay? And one of the issues with linear probing is something called the clustering problem. It's very severe, especially if you have probing to the next location and so on. Because what's going to happen is your hash table is going to fill up in a patchy manner. Where there's going to be some location that has a value, and then a sequence of locations afterwards has values. And then there's going to be lots of empty slots after that and that's called the clustering problem. And to avoid the clustering problem, you would use something like quadratic probing. Which produces a more random sequence or random looking sequence of places to look at, which is going to spread the values throughout the hash table, all right? So, another thing you could do instead of doing a sequence which is driven by a linear probe, you could use a double hash function. So you could have two hash functions hash1, which tells you the first place to look. And then a hash2, which gives you the stride of the arithmetic progression, which gives you the amount by which you increase. So the probing sequence becomes the h1 of key, which is the first place to look, and then h1 apply to key plus h2 to apply to key, h1 applied to key us twice H two. So this serves the role of a the stride in the arithmetic progression we saw in the previous slide. This serves the same rule, except that it's driven by a new hash function, okay? And this is another way of doing the same thing using to hash function. All right, so the only thing notes so far is when you are going to do a look up. We already saw that you need to do a look up in sequence. So suppose you're looking for a key, then you look for where the hash function tells you it is. let's say you're looking doing linear probing that equals 1. So then you look at the next location, then you look look in the next location. Then you look in the next location until you get to an empty slot. When you get to an empty slot, you know that what you're finding for what you're looking for may not exist in the hash table anymore. So, as an example, let us look for key j +2. But let's say the hash value of key j +2 is actually j. Even though it's written j+2. Let's say it's actually j. So what you would do is in this case, you would look here. Is this the key that I'm looking for? No, you look in the first alternative location. Is this what I'm looking for? No, since it's not empty, keep going. You look in the second alternative location. Is it what I'm looking for? Yes, Bingo, I found it. Lets later, okay, let's say I delete key j +2 now. So suppose I delete the key. Then it's no longer there in the hash table. So, let me just strike this off. Pretend this is no longer there. So this is empty. This is an empty slot. This is an empty slot, all right? So just think about it as empty. So let's say I deleted the key j+2. Now let's say I look up key j +3, but let's say unfortunately, hash key j +3. Let's say it was actually key j+1. All right, so, again, to look for key j +3, I go here first at location key +1, which is what the hash table tells me to look at. But that's not the key I'm looking for I go to the sorry, I go to the next alternative locations. So I looked at j+1 it's not there. I go to the next one, which is j+2, but in j +2 I have an empty slot. So I conclude now that key j +3 does not exist in the hash table, which is bad because it does exist in the hash table. The reason we lost it, so to speak, is because we deleted something in the middle and therefore the prob sequence, the prob sequence just stops the moment it finds an empty cell. So there is a problem with deletion in an open address hash table. Now, the fix is very, very simple. We simply don't delete and, just leave it as such. When we delete a key in a open address hash table instead of saying empty, we're going to put some special marker here. We are ready to put a special marker, the deletion marker. We're going to put a special marker, okay? So that when a probe sequence comes across a cell that's empty but especially marked, okay? Especially marked sell, so you can have a flag bit, a single bit called a flag. Let's say we mark that bit and suppose you have that marker. Then you keep moving. Your probe sequence should not stop at that cell. You should just keep moving. And that would fix the problem, okay? So a lot of this is a very common sense solution. So I hope it makes sense to you that we do things this way. And this is, if you think about it, it makes perfect sense that we have to do things this way. And the whole scheme then starts to work insertion, deletion, and look up just just fine with this little trick, okay? So let's stop here. And, start stop with some, like, basic comparison between chaining and open addresses, okay? So the interesting thing about a chaining based hash table is it never needs to fill up. Quote and quote, fill up because your your chains are basically link list that can keep stretching and expanding as you give more as you insert more and more elements. So that's a good thing, right? So that's a good thing. Excuse me, and you're chaining, unfortunately, in an open addressing hash table, there is the possibility that you do fill up. And when you do, when you're close to filling up when your load factor exceeds some fraction. You do need to rehash like we mentioned in the original hashtag lecture you don't need to rehash, which means you you come up with a bigger hash table with a different hash function. You take all the keys in the original hash table and put them in the new hash table. Look now, the interesting thing with chaining is there's going to be a lot of wasted space, all right? Because of the nature of changing. A lot of space is being used up by the link lists that are growing and shrinking. But the hash table itself might have a lot of wasted space. Open addressing gives a better utilization of space. The problem with training is also that there's an extra overhead of maintaining a dynamic area. So maintaining an area that's growing and shrinking, or a list that's growing and shrinking for each slot. But as for an open address, hash table. You have a different problem called the clustering problem, which is, your hash table is going to be get filled up in clusters around a single element. Because if you're using, for example, a linear probe sequence. Last but not least, open address hash tables have a very interesting property, which is they have something called cache locality. Since you do filleting up linearly, you may be able to have the advantage of your computer pre loading parts of your memory in a very fast memory called the cash. And and therefore you might be able to enjoy foster access is especially when you're doing things like linear probing. So linear probing is bad because it has the clustering problem. But it's actually also good because it can give you cash locality. And it all depends on how you design the hash function and what kind of applications you have for the hash table, all right? With that, I will stop and hopefully this lecture gave you a good sense of how to design and implement an open address hash table. Thank you. And I will see you in the next lecture.