Monday, June 5, 2017

Building Variational Autoencoders from Plain Autoencoders

What this tutorial is not about: This tutorial is not about implementation of variational autoencoders (VAEs), neither it is about the mathematics underlying VAEs. The main objective of this tutorial is to start with what we know (plain autoencoder), and add some intuition there to understand what we do not know (VAEs). Right amount of math will be provided.

Variational autoencoders (VAEs) have been one of the major breakthrough in machine learning, and that's why it's no wonder that there are so many awesome tutorials out there. Despite all these tutorials, it took me more than a few days (with reading all tutorials along with the original paper) to be able to really understand them, and this is after having known the concepts of variation inference already. So in this post, I'd try to explain VAEs once again in as simple way as possible. I'll provide links to some of the other tutorials on VAEs towards the end of this tutorial.

Before we talk about "variational" autoencoders, I think it makes sense to understand autoencoders first. At a very high level, autoencoders are what their name suggest. They attempt to encode themselves. In the language of neural network, the input and output to the network are same, and the main objective of autoencoders is to encode the input in a hidden layer, so that output can be reconstructed from this hidden layer. Since there is no notion of supervision here, autoencoders are unsupervised methods. One might think: what is the use of all this learning when output is same as input. Well! The answer lies in the picture below.

Autoencoders (Image source: here)

As you can see from this picture, the output does not come directly from the input - there is one hidden layer in between, and all crux of autoencoders lies in this hidden layer. The autoencoders attempt to find representation of the input in some lower dimensional space from which one can reconstruct the output. Due to their nature of reducing the dimension of the data, they are typically used for dimensionality reduction.

Variational Autoencoders

Now coming to "variational" autoencoders, at a very high level, they are same as autoencoders. Like autoencoders, VAEs also have two main units, one tries to encode the data while other tries to decode it - but all the important concepts lie in between these two units. VAEs are primarily used for generating data but how can you generate data from autoencoders which do not have any notion of probability distribution in them. In order to generate data, you either need to have a probability distribution of the actual data from which you can sample (if we had that, wouldn't the life be awesome), or have a "simple" probability distribution from which you can sample, and that sample can then be modified to look like actual data.

Now we understand that VAEs have to have a notion of probability distribution in them, but the question is: how do we incorporate that notion? Lets take a very dummy approach to doing that, and see where we get from there, or if that even makes sense mathematically. Similar to autoencoders which try to find "deterministic" latent representation $z$ of the input data, in VAEs, instead of deterministic representation $z$, we make it probabilistic. So if output of the encoder is $f(x)$ then $z$ is sampled from some distribution parametrized by $f(x)$, i.e. $z \sim p(z|f(x))$ In its simplest form, the parametric distribution can be Gaussian. (In the real framework you will see that there is a prior on the latent space $z$ which is denoted by $p(z)$. I have deliberately skipped it to make life easier. One can always skip this prior by making it uniform prior.)

All of above things are possible at conceptual level. We can sample from a distribution parametrized by the output of the encoder (a neural network). If we are able to sample $z$, we can send it to decoder which would reconstruct the output. Yay! we have just created variational autoencoders from autoencoders without caring about the variational part of it or mathematics underlying it.

Now, in order for these variational autoencoders to be useful, we need to ask two questions to ourselves. (1) Does it makes sense to do this mathematically? (2) Does it makes sense to do this algorithmically? I will try to answer the second question first because it is easier. The answer is No. In its current form, it does not make sense because we cannot train such a network end-to-end using backpropogation due to the sampling step being non-differentiable. But there is a simple solution to this, at least for some cases. Lets say that you have prior on the latent space $z \sim \mathcal{N}(0,1)$, and your distribution which is parameterized by the encoder output is also Gaussian. Now the life is simpler. You can sample $\epsilon \sim \mathcal{N}(0,1)$ and scale it up with respect to the output of the encoder. If output of the encoder is $f(x)=\{\mu(x), \sigma(x)\}$, then
$$z = \mu(x) + \sigma(x)*\epsilon$$
Now since the sampling step $\epsilon \sim \mathcal{N}(0,1)$ does not have any learnable parameters, it can be kept aside, and the model can be learned using backpropogation.

Now to answer the first question, yes, this makes sense mathematically, and for this, we will have to talk math a bit.  In an inference problem involving latent variable, the goal is to infer the posterior distribution of the latent variable given data i.e. $p(z|x)$, and since, even in a moderately complex model, it is difficult to compute this posterior probability exactly, we resort to approximation i.e. approximate it using a simpler distribution $q(z|x)$. This problem of approximating the distribution $q(z|x)$ from original posterior distribution is known as variational inference, and is cast as an optimization problem using KL divergence. In this optimization problem, our goal is to find $q(z|x)$ such that $KL(q(z|x)||p(z|x))$ is minimized. Now with some simple math (explained here), you can write this KL term as following:
$$ \log p(x) - KL(q(z|x)||p(z|x)) = \underbrace{E_{q(z|x)}(\log p(x|z))}_{\text{Reconstruction error}} - \underbrace{KL(q(z|x)||p(z))}_{\text{Regularization term}}$$
The above equation has two parts on the right hand side, The first part is the reconstruction error, and is doing exactly what we explained in the section where we added some intuition to autoencoders to build VAEs. This step involves computing expectation, and in order to compute this expectation, we sample $z$ from $q(z|x)$,  let it go through the decoder, and compute the probability of it being generated. For some distributions, log probabilities are nothing but negative loss, meaning that loss is minimized. The second term is the regularization term which attempts to make $q(z|x)$ as close as possible to the prior distribution on the latent variable.

With the answer to both of these questions, we have built variational autoencoders from plain autoencoders by adding some intuition which makes sense mathematically and feasible algorithmically. In the above version of VAE, we used a simple prior on the latent variable space i.e. Gaussian. Once you are in the land of probability space, there are all sort of fancy things you can do. For example, you can put complex structure on the latent space e.g. GMM, GMM with auxiliary variables etc., you can add supervision to the model making supervised, semi-supervised models. You can make the latent space more fancy where some parts you generate while other parts you control, giving you the power of generating controllable output.


Additional Resources:




No comments:

Post a Comment

Generative Adversarial Networks for Text Data

This is my second post in the series of generative models using Deep Learning. My earlier post talked about other form of generative models...