Advanced Machine Learning, Data Mining, and Online Advertising Services
In this post we explain the mathematics behind Latent Dirichlet Allocation (LDA) model using collapsed Gibbs sampling technique. Also we present a Java implementation of the LDA model.
One essential problem in NLP is to model the language behind a collection of documents. In particular, one interesting NLP problem is to extract underlying topics from a collection of documents. Here, we model a document as a bag of words in which words ('s) are the observed variables while topics are latent variables. Blei et al. proposed LDA model as a generative model to study such NLP problems .
One way to approach the problem of topics modeling for a collection of documents is to employ a generative framework and that's extactly the way LDA solves the problem. To find the intuitions behind the LDA model let's look at a science paper published by J. Kleinberg in the Nature: Navigation in a Small World.
By a quick glance at the paper, the reader realizes that the paper can be expressed as a collection of words. Also, the paper talks about a few topics such as: "social networks", "graphs structure", "search", "routing" and so on. In real world, we expect that most of documents also talk about a few topics. One can argue that brain also expresses a topic by a probability distribution over vocabulary of words .
Fig 1 is the graphical representation of the LDA model . One can generate a document by first sampling from a multinomial topic distribution for that document to find the topic for word in document . Next, one can sample word from which is the probability distribution of topic over vocabulary words. Thus, one can use LDA model to generate documents given documents distributions over topics as well as topics distributions over vocabulary of words. This description has been captured in Fig 1.
The challenge is how one can compute LDA model in practice. Here the only observed variables are words while (topics) are latent variables. Also, both (documents distributions over topics) and (topics distributions over ) distributions are unknown.
Fig 2 shows the LDA algorithm for sampling hidden topic of word i in document j using conditional probability of . Fig 2 shows all the steps required to scan words in document when first we compute the probability for each topic using the current state of as shown in conditional probability. Next, we use the probability weights in order to determine the current topic of the given word with index in the document under process.
Running one iteration of Gibbs sampling takes steps where is the number of words in a document and is the total number of topics. So, it takes steps to scan all documents where is the number of documents in our collection. We need to run Gibbs sampler for enough number of iterations in order to guarantee the convergence of probability distributions. This makes the overall running time to be where is the number of iterations.
We have implemented LDA model using Gibbs sampling technique in Java. You can pull the code from its github repo: LDA implementation in Java. Currently there is not any unit test in place but we have plan to add unit tests to the code in the future.
In the next post we will discuss different applications of the LDA model such as clustering text documents and information retrieval. Also we will explain how one can pick parameter as the total number of topics.
Please contact us if you have any questions for the LDA implementation.