Now we understand the intuition of autoencoder. Next, let's introduce some different variants of autoencoder, namely sparse autoencoder, so that the hidden variable h will be sparse, and denoising autoencoder, which should make it more robust against input noise, and finally the stacked autoencoder, so you can actually stack multiple autoencoder together to get a deep neural network. Let's start with the sparse autoencoder. The objective is we want to have this autoencoder to generate sparse latent variable h. Why do we want a sparse latent variable? It's more interpretable for one thing, because those dimensions that are non-zero all of sudden become important or presentation to the input. Different input may trigger a different part of this code to be one and otherwise zeros, so it's more interpretable. How do we achieve that? If you don't do anything, most likely h will be some dance vectors. No matter what input you give the autoencoder, they will map that to some dance input vector h. That's less interpretable. The way to do that is by, first, making sure h can be sparse. For example, when you use the activation function of sigmoid, you know the h will have a value between zero and one. In this case, maybe when it's close to zero, it become sparse, so you can actually make it zero. That's the first step, or the easy step. The second step is we want to define what we mean by sparsity. By sparsity, in this particular setting, where sigmoid activation function is just at different position of h, the corresponding average activation value. Let me explain this with this precise formulation. What it does is, it will first figure out the corresponding elements h_j, so this j's element in this vector. For example, j equal to one as first element of this vector. At that position, we want to know when you pass through different input vectors, for example we have n vectors input and what are those h1 or hj look like when you have a different input. Then you can take average of those values to produce the average activation at this j's position. That's defined as this low j hat. This is a sparsity level, or our average activation at the j's position in this hidden vector h. Now we know the sparsity level, so this is defined. The next thing we want to do is we want to set a target sparsity level. What means, is we want to say, "Okay, we want this target sparsity level to be low so that the hidden vector can be sparse. We want to set this to be 0.05, on average five percent." But if we don't do anything with standard autoencoder, you may not achieve this target sparsity. You may achieve some sparsity close to 0.5 maybe. The way to achieve this target sparsity level is by changing the loss function instead of using the original loss, which is really the first term as reconstruction loss between the input x_i and the reconstruction vector r_i. In addition to this, we want to add a regularization term to measure the difference between the target sparsity level row and the actual sparsity level row j hat at different position of this h vector. The h vector is of size k. We measure this KL divergence between target activation level and the actual activation level at this k different position. We want to minimize that KL divergence. That's how we achieve sparsity in h by introducing this regularization term. Otherwise, this become the same as standard autoencoder, but just by adding this regularization term, we're able to achieve sparse a hidden vector h. That's sparse autoencoder. Another variant, very popular variant of autoencoder is called denoising autoencoder. The idea here is, we want the autoencoder to learn a robust representation against noise. How do we achieve that? One simple way is, maybe we should introduce some noise into the input and hopefully, the autoencoder can overcome those noise and still reconstruct something close to the original input without denoise. Step is as follows, and by the way, this is a paper published in ICML 2008 that describe this model. The step is, by introducing this corruption process, so first corrupt the input at some random position of the input vector, so we introduce this x tilde, this corrupted input vector. That will be used as input as autoencoder. Then goes through the autoencoder process, and finally, we want to still measure the loss between the reconstructed vector x tilde prime with original uncorrupted input x. We want to minimize this reconstruction error, not between the corrupted input with the reconstruction, but the uncorrupted input with the reconstruction. That's denoising autoencoder, has shown great performance against noise, so that's another popular variance. Last but not least, we can stack multiple autoencoders together to get a stacked autoencoder. That's very much like a deep feedforward neural network. You can imagine have a sequence of encoder process and then a sequence of a decoding process to reconstruct the final reconstruction that's close to the original input. If you look at this, this autoencoder is actually stacked by separating all the encoders and applying them in sequence first. Then going through the decoding process in the backward order. In this particular case, you first apply the encoder of the first autoencoder, and second and third, until the last, which is the case autoencoder at the encoding process. Then when you decode, you actually go in a reverse order. You first apply a decoder of the last autoencoder, then the second last, then gradually go to the first. If you go this process, that's how we do this stacked autoencoder. Of course, you can train this with backpropagation n to n, just like a standard feedforward neural network. Another very effective trick is, you could do this layer-wise pretraining so that you don't have to start with some random initialization for the backpropagation for a very deep neural network, is that you can train this autoencoder k times if you have this k autoencoder stacked together. The way to do that is, you first train the first autoencoder to generate the hidden variable h_1, and the reconstruction r_1. Then you take this h_1 as your new input and try to construct the hidden variable layer h_2. Just using h_1, treat h_1 as input to train the autoencoder, and correspondingly, that second autoencoder will produce the output r_2. If you do this in sequence, and finally, in the very middle part of this, that's the last autoencoder you want to train, the case one, that will take the hidden variables from the previous autoencoder, h_k minus 1, as input. Then try to reconstruct this layer of r_k minus one by figuring out what this hidden variable h_k look like. You can imagine the stacked autoencoder, you can train this just like train k autoencoders, standard autoencoder separately. That's a very neat trick to deal with stacked autoencoder, so you don't have to imagine training a very deep feedforward neural network, which is very hard, instead that, you just train k autoencoder separately. That's the variance of autoencoder.