Monday, June 26, 2017

Generative Adversarial Networks for Text Data

This is my second post in the series of generative models using Deep Learning. My earlier post talked about other form of generative models i.e., variational autoencoders. I specifically explained how to build variational autoencoders from plain autoencoders. Just like previous post, this post is also not about implementation of Generative Adversarial Networks (GANs), neither it is about the mathematics behind them.  I will start with something we know i.e., GANs (an introduction would be provided), and build GANs for text from there. 

Generative Adversarial Networks (GANs) have been called one of the major break through in machine learning. In fact they have been called "the most interesting idea in machine learning in last 10 years". There are many good tutorials out there (links to which will be provided at the bottom of this post) which talk about general GANs, however most of them discuss GANs from computer vision perspective (for image data), and are not directly applicable for text data.

Now before we discuss GANs for text data, lets revise general GANs and how they are applicable to image data. The basic idea in GANs is pretty simple. GANs consists of mainly two components, one is generator and another is discriminator. The job of the generator is to generate some random but real looking data, while the job of the discriminator is to discriminate between real data and real-looking random data. The training happens in an adversarial combat mode. If the discriminator is able to tell the difference between real and real-looking data, it means that generator has not done its job well at generating real-looking data, and generator's weights are updated so that it can generate better real-looking data. And if the discriminator is not able to tell the difference between real data and real-looking data, it means that the discriminator has not trained well, and its weights are updated. Now the training continues until we get a good generator and good discriminator.

In their original paper, Goodfellow et al. propose to minimize the following objective function and we will discuss GANs first in the context of this objective function:
$$ \min_G \max_D  \mathbb{E}_{x\sim p_{\text{data}}(x)}[\log D(x)] + \mathbb{E}_{z\sim p_{z}(z)}[\log 1-D(G(z))]$$
Here $G$ is the generative model, $D$ is the discriminative model, $x$ is real data, $z$ is the random noise (not real-looking data).

Although nasty looking, the above equation is pretty simple. It is basically trying to maximize the probability of discriminator's ability to classify the real data, and generator's ability to generate fake data. In other words, if you think of discriminator as classifier (for example multi-layer perceptron), then the objective function is nothing but cross entropy i.e. $\log p(y|x) + \log(1-p(y|z))$. Both the terms in the objective function denotes the probability of generating real data so it has to be maximized with respect to discriminator while minimize with respect to generator. Hopefully the pictures below clarifies things further.

Generative Adversarial Network for Image Data



GANs should now be clear from the above picture. As you can see we have two kind of data i.e., real data and real-looking data, both of which are fed to discriminator, which maximizes the probability of generating real data.

As we all know GANs were originally designed for image data so they are naturally applicable to images, more so because of their real-valued nature. Each image pixel is nothing but some real value denoting the color. The application of GANs to text data is a bit tricky. This is primarily because of the following reasons.

  • Generation of text in a sequential manner: In image data, one can think of each image as a real-valued single vector, and you can generate the entire vector at a time, however the story in the text is different. Here you need to generate one word at a time. 
  • Real vs discrete: Text data is discrete while image data is real. How do we generate discrete noise that would ultimately be transformed in some real looking text?
  • Discriminator for text data: Even if we are able to generate random text, how do we design a discriminator which is able to differentiate between real text and fake text?

Despite these challenges, GANs have been used for text data, and in this post I will try to make it simpler. The original paper can called these networks as GAN-RNN.

Now assuming that we know GANs for image data, we will make modifications in them to make them applicable for text data. The idea in GANs is to learn a generator that is able to generate real looking data. In GAN-RNN which is GAN for text data, the very first modification we make is replace the generator with a recurrent neural network (RNN). We can also call it encoder as it is usually called in sequence-to-sequence model. The job of this RNN is to generate random text. Note that in GANs we first generate random noise and then transforms this random noise into some real-looking data through generator, here we perform both of these tasks by RNN. To begin with, the input to RNN's first cell is a start-of-sentence token,  and from there on, subsequent cells generate rest of the words automatically. Note that when comparing GAN-RNN with plain GANs, we have replaced both random noise generator and generator in GANs with RNN. It performs the job of both. In the beginning when we have not really learned the weights of RNN, it will be as good as random noise i.e. meaningless text; however as the training continues, it will learn to generate meaningful text. With this simple replacement, we have solved two of the three problems mentioned above i.e. how to generate text data sequentially, and problem with discrete vs. real valued.

Generative Adversarial Network for Text Data: the generator has been replaced with an RNN encoder. We still do not know what the discriminatory looks like (see the following picture). 


Now since we know how to generate random text data, we just need to make sure this random generated data is real looking. We do this by comparing random data with the real data. The comparison takes place in the discriminator. The discriminator takes the random data and real data as input and produces some score denoting how far the random data is from real data. Now the question is - how do we design a discriminator that is suitable for text data. For this we will make another modification to GANs but before that, it is important to note that the output of the generator is the text data which is not really differentiable, causing problems with the learning. So we have two problems (1) design a discriminator (2) make the whole pipeline differentiable. And to your surprise, both of these problems can be solved by a simple modification i.e. by introducing a new function called summarization function. This summarization function operates on output of the hidden states of the RNN (so the output of the RNN is not really discrete, but continuous). Now one has flexibility is designing this summarization fuction in such a way that the output of this summarization can be used into a discriminator. The summarization function can be anything e.g. last hidden state of RNN. Once we have such a summarization function, a discriminator can be designed in the same way as for the image data. It need to maximize the probability of generating real data. The whole process is explained in the image below.


Generative Adversarial Network for Text Data: in order to make the discriminator work, we have introduced a new function i.e. summarization function. Now the real input also passes through the RNN encoder. 

Resources:


Monday, June 5, 2017

Building Variational Autoencoders from Plain Autoencoders

What this tutorial is not about: This tutorial is not about implementation of variational autoencoders (VAEs), neither it is about the mathematics underlying VAEs. The main objective of this tutorial is to start with what we know (plain autoencoder), and add some intuition there to understand what we do not know (VAEs). Right amount of math will be provided.

Variational autoencoders (VAEs) have been one of the major breakthrough in machine learning, and that's why it's no wonder that there are so many awesome tutorials out there. Despite all these tutorials, it took me more than a few days (with reading all tutorials along with the original paper) to be able to really understand them, and this is after having known the concepts of variation inference already. So in this post, I'd try to explain VAEs once again in as simple way as possible. I'll provide links to some of the other tutorials on VAEs towards the end of this tutorial.

Before we talk about "variational" autoencoders, I think it makes sense to understand autoencoders first. At a very high level, autoencoders are what their name suggest. They attempt to encode themselves. In the language of neural network, the input and output to the network are same, and the main objective of autoencoders is to encode the input in a hidden layer, so that output can be reconstructed from this hidden layer. Since there is no notion of supervision here, autoencoders are unsupervised methods. One might think: what is the use of all this learning when output is same as input. Well! The answer lies in the picture below.

Autoencoders (Image source: here)

As you can see from this picture, the output does not come directly from the input - there is one hidden layer in between, and all crux of autoencoders lies in this hidden layer. The autoencoders attempt to find representation of the input in some lower dimensional space from which one can reconstruct the output. Due to their nature of reducing the dimension of the data, they are typically used for dimensionality reduction.

Variational Autoencoders

Now coming to "variational" autoencoders, at a very high level, they are same as autoencoders. Like autoencoders, VAEs also have two main units, one tries to encode the data while other tries to decode it - but all the important concepts lie in between these two units. VAEs are primarily used for generating data but how can you generate data from autoencoders which do not have any notion of probability distribution in them. In order to generate data, you either need to have a probability distribution of the actual data from which you can sample (if we had that, wouldn't the life be awesome), or have a "simple" probability distribution from which you can sample, and that sample can then be modified to look like actual data.

Now we understand that VAEs have to have a notion of probability distribution in them, but the question is: how do we incorporate that notion? Lets take a very dummy approach to doing that, and see where we get from there, or if that even makes sense mathematically. Similar to autoencoders which try to find "deterministic" latent representation $z$ of the input data, in VAEs, instead of deterministic representation $z$, we make it probabilistic. So if output of the encoder is $f(x)$ then $z$ is sampled from some distribution parametrized by $f(x)$, i.e. $z \sim p(z|f(x))$ In its simplest form, the parametric distribution can be Gaussian. (In the real framework you will see that there is a prior on the latent space $z$ which is denoted by $p(z)$. I have deliberately skipped it to make life easier. One can always skip this prior by making it uniform prior.)

All of above things are possible at conceptual level. We can sample from a distribution parametrized by the output of the encoder (a neural network). If we are able to sample $z$, we can send it to decoder which would reconstruct the output. Yay! we have just created variational autoencoders from autoencoders without caring about the variational part of it or mathematics underlying it.

Now, in order for these variational autoencoders to be useful, we need to ask two questions to ourselves. (1) Does it makes sense to do this mathematically? (2) Does it makes sense to do this algorithmically? I will try to answer the second question first because it is easier. The answer is No. In its current form, it does not make sense because we cannot train such a network end-to-end using backpropogation due to the sampling step being non-differentiable. But there is a simple solution to this, at least for some cases. Lets say that you have prior on the latent space $z \sim \mathcal{N}(0,1)$, and your distribution which is parameterized by the encoder output is also Gaussian. Now the life is simpler. You can sample $\epsilon \sim \mathcal{N}(0,1)$ and scale it up with respect to the output of the encoder. If output of the encoder is $f(x)=\{\mu(x), \sigma(x)\}$, then
$$z = \mu(x) + \sigma(x)*\epsilon$$
Now since the sampling step $\epsilon \sim \mathcal{N}(0,1)$ does not have any learnable parameters, it can be kept aside, and the model can be learned using backpropogation.

Now to answer the first question, yes, this makes sense mathematically, and for this, we will have to talk math a bit.  In an inference problem involving latent variable, the goal is to infer the posterior distribution of the latent variable given data i.e. $p(z|x)$, and since, even in a moderately complex model, it is difficult to compute this posterior probability exactly, we resort to approximation i.e. approximate it using a simpler distribution $q(z|x)$. This problem of approximating the distribution $q(z|x)$ from original posterior distribution is known as variational inference, and is cast as an optimization problem using KL divergence. In this optimization problem, our goal is to find $q(z|x)$ such that $KL(q(z|x)||p(z|x))$ is minimized. Now with some simple math (explained here), you can write this KL term as following:
$$ \log p(x) - KL(q(z|x)||p(z|x)) = \underbrace{E_{q(z|x)}(\log p(x|z))}_{\text{Reconstruction error}} - \underbrace{KL(q(z|x)||p(z))}_{\text{Regularization term}}$$
The above equation has two parts on the right hand side, The first part is the reconstruction error, and is doing exactly what we explained in the section where we added some intuition to autoencoders to build VAEs. This step involves computing expectation, and in order to compute this expectation, we sample $z$ from $q(z|x)$,  let it go through the decoder, and compute the probability of it being generated. For some distributions, log probabilities are nothing but negative loss, meaning that loss is minimized. The second term is the regularization term which attempts to make $q(z|x)$ as close as possible to the prior distribution on the latent variable.

With the answer to both of these questions, we have built variational autoencoders from plain autoencoders by adding some intuition which makes sense mathematically and feasible algorithmically. In the above version of VAE, we used a simple prior on the latent variable space i.e. Gaussian. Once you are in the land of probability space, there are all sort of fancy things you can do. For example, you can put complex structure on the latent space e.g. GMM, GMM with auxiliary variables etc., you can add supervision to the model making supervised, semi-supervised models. You can make the latent space more fancy where some parts you generate while other parts you control, giving you the power of generating controllable output.


Additional Resources:




Saturday, January 7, 2017

20 Basic Interview Questions in Machine Learning and Data Mining

I am sure most people would agree with me that data science is one of the most sought after fields in the job market today, and it is only going to increase with the time. Over the last few years I have interviewed dozens of candidates for data science jobs at various levels - undergraduate, graduate, post-graduate, both full time and part time; and I am often surprised to see how even the most qualified candidates fail to answer even very basic questions in Machine Learning and Data Mining. So I decided to put a list of basic interview questions that might be of help to some of you. In my experience, more than 95% of the candidates failed to answer three out of five questions from the following list, which I think are simply from ML 101.

20 Basic Interview Questions in Machine Learning and Data Mining

  1. What is the difference between inductive learning and transductive learning?
  2. What is the difference between discriminative models vs generative models?
  3. What is overfitting? How can you avoid it?
  4. What is regularization? Why do we need it?
  5. What is cross validation? Why do we need it?
  6. What is the difference between Logistic Regression and Support Vector Machines?
  7. Tell me the convergence properties of K-means Clustering algorithm? Does it converge at all?
  8. What is the difference between Naive Bayes and Logistic Regression?
  9. What is the difference between L2 and L1 regularization?
  10. Neural Network is not something that we discovered in this millennium. They have been around for decades. Then why have they suddenly become so popular now?
  11. What is VC dimension? Why do we need this?
  12. Is Logistic Regression a linear classifier? It is made of Softmax function which does not look like linear?
  13. Heard of RKHS? Explain it? What is the meaning of each character i.e. R,K,H,S in it?
  14. Heard of Representer Theorem? Explain it?
  15. Explain the difference between Bagging and Boosting?
  16. Generalized error consists of two parts i.e. Bais and Variance? What is the effect of Bagging and Boosting on Bias and Variance?
  17. What is class imbalance? How can you take care of it?
  18. What would be the right metric for imbalanced classification problem?
  19. How the depth of tree in Decision tree related to the bias and variance?
  20. High dimensions, is it good or bad, and why?
  21. More to come...


Answers to come in the next post based on the interests.

Generative Adversarial Networks for Text Data

This is my second post in the series of generative models using Deep Learning. My earlier post talked about other form of generative models...