# A.I. & Optimization

## Extracting Topics from Texts using LDA Model

In this post we explain the mathematics behind Latent Dirichlet Allocation (LDA) model using collapsed Gibbs sampling technique. Also we present a Java implementation of the LDA model.

### LDA Overview

One essential problem in NLP is to model the language behind a collection of documents, and specifically to extract the underlying topics from a collection of documents. Here, we model a document as a bag of words in which words ($w$'s) are the observed variables while topics are latent variables. Blei et al. has proposed the LDA as a generative model for solving such NLP problems [1].

### LDA: a Generative Model

One way to approach the problem of topics modeling for a collection of documents is to employ a generative framework and that's how LDA solves the problem. To understand the intuitions behind the LDA model let's look at a science paper published by J. Kleinberg in the Nature: Navigation in a Small World.

With a quick glance, you will see that the paper talks about a few topics including: "social networks", "graphs structure", "search", and "routing". In the real world, we expect that most of documents talk about a few topics. One can argue that our mind expresses a topic somehow with a list of probabilities over words vocabulary $V$.

Fig 1 shows the graphical representation of the LDA model [1]. We can generate a document $j$ by first sampling from a multinomial topic distribution for that document $\theta_{j}$ to find the topic $z_{ij}$ for word $i$ in document $j$. Next, we can sample word $x_{ij}$ from $\phi_{z_{ij}}$ which is the probability distribution of topic $z_{ij}$ over vocabulary words. Thus, we can use LDA to generate documents given documents distributions over topics along with topics distributions over the vocabulary. This description has been captured in Fig 1 above.

### LDA: from theory to implementation

The challenge is how we can train the LDA model. Here the only observed variables are words $x={x_{ij}}$ while $z$ (topics) are latent variables. Also, both $\Theta$ (documents distributions over topics) and $\phi$ (topics distributions over $V$) distributions are unknown and should be computed.

Let's define $N_{wkj}=\&hash;\{i:x_{ij}=w,z_{ij}=k\}$ to be the number of words $w$ in document $j$ with topic $k$. So, we can compute the following counts:


$N_{kj}=\sum_{w}{N_{wkj}}$
$N_{wk}=\sum_{j}{N_{wkj}}$


Here, we've marginalized out $\theta$ and $\phi$. In words, $N_{kj}$ is the number of words in document $j$ with topic $k$ and $N_{wk}$ is the number of times word $w$ has been assigned to topic $k$ [2].

Given the current state of all but one variable $z_{ij}$ (topic assigned to word i in doc j), the conditional probability of $z_{ij}$ is computed as below:


$p(z_{ij}=k|z^{\neg&space;ij},x,\alpha,\beta)=\frac{1}{Z}a_{kj}b{wk}$


, where $a_{kj}$ and $b_{wk}$ are computed as follows:


$a_{kj}=N_{kj}^{\neg&space;ij}$

$b_{wk}=\frac{N{wk}^{\neg&space;ij}+\beta}{N_{k}^{\neg&space;ij}&space;+&space;W\beta}$


$Z$ is the normalization constant:


$Z=\sum_{k}{a_{kj}b_{wk}}$


Here, the subscript $\neg&space;ij$ means that the corresponding dataum has been excluded in the coun summations $N{wkj}$.

### LDA Gibbs Sampling Algorithm

Fig 2 shows the LDA algorithm for sampling hidden topic of word i in document j using conditional probability of $z_{ij}$. Fig 2 shows all the steps required to scan $N$ words in document $j$ when first we compute the probability for each topic using the current state of $N_{wkj}$ as shown in conditional probability. Next, we use the probability weights in order to determine the current topic of the given word with index $i$ in the document $j$ under process.

### LDA Running Time

Running one iteration of Gibbs sampling takes $O(NK)$ steps where $N$ is the number of words in a document and $K$ is the total number of topics. So, it takes $O(NKM)$ steps to scan all documents where $M$ is the number of documents in our corpus. We need to run Gibbs sampler for enough number of iterations in order to guarantee the convergence of probability distributions. This makes the overall running time to be $O(NKMI)$ where $I$ is the number of iterations.

### LDA: Java implementation

We have implemented LDA model using Gibbs sampling technique in Java. You can pull the code from its github repo: LDA implementation in Java. Currently there is not any unit test in place but we have plan to add unit tests to the code in the future.

### Extra Notes

In the next post we will discuss different applications of the LDA model such as clustering documents and information retrieval. Also we will explain how one can pick $K$ parameter as the total number of topics.