# A.I. & Optimization

## Extracting Topics from Texts using LDA Model

In this post we explain the mathematics behind Latent Dirichlet Allocation (LDA) model using collapsed Gibbs sampling technique. Also we present a Java implementation of the LDA model.

### LDA Overview

One essential problem in NLP is to model the language behind a collection of documents, and specifically to extract the underlying topics from a collection of documents. Here, we model a document as a bag of words in which words ('s) are the observed variables while topics are latent variables. Blei et al. has proposed the LDA as a generative model for solving such NLP problems .

### LDA: a Generative Model

One way to approach the problem of topics modeling for a collection of documents is to employ a generative framework and that's how LDA solves the problem. To understand the intuitions behind the LDA model let's look at a science paper published by J. Kleinberg in the Nature: Navigation in a Small World.

With a quick glance, you can see that the paper talks about a few topics: "social networks", "graphs", "search", and "routing". In the real world, we expect that most of documents talk about a few topics. One can argue that our mind expresses a topic somehow with a list of probabilities over words vocabulary .

Fig 1 shows the graphical representation of the LDA model . We can generate a document by first sampling from a multinomial topic distribution for that document to find the topic for word in document . Next, we can sample word from which is the probability distribution of topic over vocabulary words. Thus, we can use LDA to generate documents given documents distributions over topics along with topics distributions over the vocabulary. This description has been captured in Fig 1 above.

### LDA: from theory to implementation

The next key question is how one can train the LDA model. Here the only observed variables are words while (topics) are latent variables. Also, both (documents distributions over topics) and (topics distributions over ) distributions are unknown and should be computed.

Let's define to be the number of words in document with topic . So, we can compute the following counts:





Here, we've marginalized out and . In words, is the number of words in document with topic and is the number of times word has been assigned to topic .

Given the current state of all but one variable (topic assigned to word i in doc j), the conditional probability of is computed as below:





, where and are computed as follows:





is the normalization constant:





Here, the subscript means that the corresponding dataum has been excluded in the coun summations .

### LDA Gibbs Sampling Algorithm

Fig 2 shows the LDA algorithm for sampling hidden topic of word i in document j using conditional probability of . Fig 2 shows all the steps required to scan words in document when first we compute the probability for each topic using the current state of as shown in conditional probability. Next, we use the probability weights in order to determine the current topic of the given word with index in the document under process.

### LDA Running Time

Running one iteration of Gibbs sampling takes steps where is the number of words in a document and is the total number of topics. So, it takes steps to scan all documents where is the number of documents in our corpus. We need to run Gibbs sampler for enough number of iterations in order to guarantee the convergence of probability distributions. This makes the overall running time to be where is the number of iterations.

### LDA: Java implementation

We have implemented LDA model using Gibbs sampling technique in Java. You can pull the code from its github repo: LDA implementation in Java. Currently there is not any unit test in place but we have plan to add unit tests to the code in the future.

### Extra Notes

In the next post we will discuss different applications of the LDA model such as clustering documents and information retrieval. Also we will explain how one can pick parameter as the total number of topics.