Hi everyone. Today we are going to learn garbage collection. We're going to learn three algorithms in two lectures, and the textbook explains the concept really well, so you'd better read Chapter 13, garbage collection in the textbook. Each programming language has its own implementation of altercation. When we use some data, the language, run-time system, a compiler, and they work on memory or location. How about geo-location? We do like to use some memory, but at the same time, because our computer has finite amount of memory, we do like to use that efficiently. If we want to use memory efficiently, we may want to free some objects. Then, when can we free an object? We can free an object when we can guarantee that it will not be used again in the computation. When it is not reachable, we're not going to use that. When an object in a memory is not reachable from us, then we cannot use it anyway, so we can get rid of it. That's called garbage collection. Garbage collection is a way to know whether a given record in a memory is live. Whether that place is accessible. Accessible, live, reachable, they all mean the same thing. As a programmer, as a user, I can get there. That's reachable, live. The other side is what? Not live, but dead, not accessible, but inaccessible, not reachable, unreachable. They are garbage, and we'd like to reclaim the garbage and we use that memory. That's the main idea behind garbage collection. We start from the definite live cells in a computer and they are values on the stack and values in registers. Stack and registers are not garbage by definition. We start from there when we want to find accessible records and we call those roots. The beginning, the initial of the search. Values on the stack and values in registers are live by definition, and they are called roots or record referenced by a live record is also live because we can get there. Our program can only possibly use live records because there's no way to get to other records. From the roots, values on the stack and in registers, we follow their links and collect them all and they are live and all the others are garbage. That's the main idea behind garbage collection. A garbage collector phrase all records that are not live because we cannot use them anyway. Allocated until we run out of memory and then run the garbage collector to get more space. In a language that programmers can allocate until the K memory cells, like C or C++, then it's easier for programmers to use memory efficiently. But it doesn't in time, it is really dangerous if programmers do not know what they are doing. [Korean] Low-level languages like C and C++, [Korean] which are very close to deep hardware,
[Korean] allow programmers to allocate
[Korean] and reclaim, or deallocate, memory.
[Korean] Programmers can do lots of things.
[Korean] But they should be aware of
[Korean] what they're doing.
[Korean] Their programs crash if they make a mistake. Memory management is really important and it's not that easy. Languages with memory management, like Java and Scala, they do the job for the programmers. Programmers do not need to worry about memory allocation and deallocation, and the garbage collector in the run time system will do the job for you. That's the main idea. Then, as I said, we are going to learn three algorithms about garbage collection. There are a lot of garbage collection algorithms, but those three algorithms are very critical. In this lecture, we're going to learn two algorithm and in the next lecture, we're going to learn the third. The first one. A is called reference counting. Reference counting is the name of the algorithm. As the name says, it counts references on each record. A way to know whether a record has other users. In the beginning, we attach a count to every record starting zero. Because every record doesn't have any user, so the counter of the links to the record is zero. When installing a pointer to a record, when we use the record in a program, we increment its count. When replacing a pointer to a record, when we do not use that record anymore from the link, if we move the link to some other record, then decrement its count. When it's being used, increment, when it's not used, decrement. When a count is decremented back to zero, then it means that no one is using that record, so decrement counts for other records referenced by this record and free it. One more time. When we have a record, we have a counter of the links that use this record. If we have one more user, increment the counter, if we have one less user, decrement the counter. When the counter becomes a zero, no one is using this record, and we'd like to get rid of this, so we follow the link from this record and decrement all by one and remove this. This is how this reference counting algorithm works. Let's see what I just said in pictures. We're going to use several pictures to see how that works. This is our pictorial representation of the memory and roots in your computer. Top two boxes, they are the roots. They may be stack or registers. The blue area, that's the memory, and boxes in that are allocated memory. We have those allocated records which have counters. How many links are pointing to each record? This one has only one link, this one has one, this one has two, because from here then there. You can see that. So far so good. Yeah, you'd better be familiar with this notation, because we're going to see them a lot. If we move a link from the top to the bottom, then this record used to have one counter is now zero, and the other one now has one more link, so it's counter is three. We adjust a count whenever we change a pointer, and because that is zero we are going to follow the link from this one and decrement it to one because we're going to remove that like this. Free a record if its count goes to zero. It's same from the links from the roots. On the top, we have two roots and if we remove a pointer from a root, then we decrement the counter and one more time, it's going to make it free and it triggers more frees. Now, you can see that. That's how it works. This is all. Now you understand how the reference counting algorithm works. That's simple. Then there's one big problem with this algorithm. This algorithm is so simple. It's easy to understand and now you know how it works. But what's going to be the main problem of this algorithm? It does not work with cycles. What does that mean? Let's see. When we have some assignment to record, then it make a cycle. Let's see. On the left we have this link from the root to this record, and this one has a link to this one. If this one uses that back, so this on the top uses the bottom and the bottom uses the top, then what's going to happen? We're going to have one more link to this. Now, this one has two users, and this one has one user, so adding a reference increments the counter. Then what's next? Let's get rid of this link. Then what's going to happen? Think about it. If we remove this link, it's going to decrement to one, and it's going to look like this. Lower left two records are inaccessible. Previously, we had a link from the roots, so we could get these two records. But now that we removed this link, there is no way for us to use those two records. They are inaccessible, but we cannot deallocate them, because they do have counters. Really? Really, because this one is used by this one, so we cannot decrement this to zero and remove it. This one is used by the other one. They are a loving couple and they use each other, but not by anyone else. That's by definition garbage, because we cannot get there, but we cannot free them either. That's the problem. Cycles break reference counting, meaning that with this reference counting algorithm, we cannot free such inaccessible cycles, meaning that there is a free memory location but we cannot reuse that. While you're running some app, say, you're doing some game on your mobile phone and if the program has this issue, then why you're running the game? The phone is going to run out of memory and it's going to die even though its memory is not being used. [Korean] What a pity! [Korean] Each one points to the other in memory.
[Korean] We cannot use them,
[Korean] but they keep accumulated in memory. [Korean] It's memory leak. We call that memory leak. Memory is leaking, but we can reuse that. That's the biggest problem of this algorithm. This reference counting algorithm has three more problems. The biggest one is cycles. The second one is that, whenever we use the record, we need to increment or decrement the counter. It takes time. Maintaining, can't waste time. Another thing is that we follow the link and the records may be here and there everywhere. Memory is used sparsely, maybe in a far away. That's called bad locality. I'm not going to explain the locality here because that's more well-explained in other courses like computer systems and computer architecture, or company organization. Whatever class you take, and our textbook explains this problem really well. For now, you can just think this reference, counting algorithm does not use memory consequently, so it has this bad locality problem. Finally, because we are making records by denoting the number of counts. We need to maintain such available locations. That's called our free list. We usually maintain a list of free records in some other place. Whenever we are running out of memory, we may ask the free list, like give me several locations to use. It's not like we, but the runtime system of the language will do that. That needs to use free list to track available locations. It takes time and memory as well. These are problems of the reference counting algorithm. You know what? Apple has used this reference counting garbage collection algorithm for their language called Swift. There's really interesting story about that. I'm sad that we do not have enough time to have their story, but if you go visit this website, it is going to tell you what happened. Now, the Swift language does not have the problem, there is a date in 2016, March 27th, there was a report saying that because of this reference counting algorithm Swift had some issues. It's that important and it's really important to have a correct and efficient garbage collection algorithm. Then the second algorithm is called mark and sweep garbage collection. It's a bit similar to reference counting, but a bit different. Let's see how they are similar and different. Previously, we had a recounter in each record and make them all zero. Similarly, in the beginning, we make all the records white. In this new algorithm, we are going to color record white, gray, and black. Everything is white, and we call her records referenced by roots gray. Do you see that? Starting from the root scope, because they are all live by definition, there are not garbage at all. We start from there, follow the link. Color than gray. Gray, meaning we don't know what they are yet. Repeat until there are no gray records. Do these things multiple times until there are no gray records left. What are we going to do? Pick one among those gray records, let's call that r. For each white record, set our points to make it gray and color are black. Are you with me? One more time in the beginning, we color all the records white. Starting from the root, follow that link, make them all gray. Pick one gray and follow the link from that gray in color them gray, if they are white. Pick this one black, black, meaning it's live. We cannot free this record that we have possibly enlarged set of gray records. Do that again, pick gray, follow the link, color white things gray, and make this black until there are no leftover gray records, meaning that everything is going to be either black or white. Black, meaning they are reachable from the roots, which are live records. White, meaning that they are not accessible from the root, meaning garbage, yeah. Starting from the root, we follow the links, color them all black, and the remaining parts are going to be garbage. Deallocate all white records and they're going to be available memory. That simple. Let's see how that works with pictures again. All the records are marked white in the beginning and start from the root, follow the link, make it gray. Mark records referenced by roots as gray. Pick one gray or there's new one, see. Let's pick that and follow the link from the gray and make them gray like this. We need to pick a gray record. Red arrow indicates the chosen one, and we follow the link, make them gray. So far so good. Now, that we made them gray, we're done with this first picked gray make it black. Yes. Now, one more time. Pick one gray. Pick it and follow the link. There's no link. Then we are done with that, make it black. There's another gray pick it, follow the link. It has the link here. We make it gray and that one black. So far so good. We have one more thing, but when we follow the link, it's black. Are we supposed to make it gray? No, then it's going to do it forever. It's already black. We're done. We already visited this. That's good. We already did our homework. We're done. Make, pick that, follow the link already done. It's black. We're done. No more gray records, deallocate white records, and they are going to be all available memory. Even though we have this cycle here, because we follow the links from the root and collect them all, that cycle has no problem with this algorithm. Cycles do not break garbage collection, meaning that this mark and sweep garbage collection, get rid of the cycles. Happy. We can use the memory efficiently. Yes. Isn't it good? But still it has some problems. Obviously, it resolved that cycle problem, but it has another problem. What? Cost of a collection proportional to the entire heap, meaning that we need to collect the white records. Then we need to sweep the memory. We need to scan the memory from top to bottom to collect the white records. That takes time. It also has this bad locality problem. It also needs to use free lists to track available memory. Again, that detailed explanation about the bad locality and free lists, I will ask you to read the relevant chapter in the textbook. But the main thing between this reference counting algorithm and mark and sweep algorithm is that in the first counting, we assigned a counter to each record and increment or decrement the counter whenever we use our pointer to some record and we free a record when the counter gets to zero. It's really easy to maintain the same process. We do not need to scan the entire heap. Garbage collection does not take much time because we do not scan the memory. But it had the biggest problem, which is memory leak due to cycles. In mark and sweep algorithm, it does not have that cycle problem. That's good. But in order to collect them, we need to scan the memory to collect all those white records. While we are running some program, when the program is running out of memory, it's going to stop running the current program and collects the memory. If we are doing some computer game on your desktop and you're fighting, and at some point it's going to be like this instead of this. Why? Because the garbage collection runs behind your program. Why? It needs to stop the work and collect the memory, scan the entire memory. That's why it takes time. That's why Java and Scala, those languages with memory management have some issues with performance because of this garbage collection time. We can discuss more about that in the next lecture. Here's quiz for this lecture, which of the following is not a disadvantage of reference counting? One, it cannot handle cyclic structures. Two, maintaining reference count takes cost. Three, it suffers from external fragmentation. Four, execution stops during GC, garbage collection. What do you think? We just compare those two algorithms, reference counting and mark and sweep. Execution does not stop for reference counting. Execution stops for mark and sweep. Thank you.