# A.I. & Optimization

## Extracting Hidden Topics from Texts using LDA Model

In this post we explain the mathematics behind Latent Dirichlet Allocation (LDA) model using collapsed Gibbs sampling technique. Also we present a Java implementation of the LDA model.

### LDA Overview

One essential problem in NLP is to model the language behind a collection of documents. In particular, one interesting NLP problem is to extract underlying topics from a collection of documents. Here, we model a document as a bag of words in which words ($w$'s) are the observed variables while topics are latent variables. Blei et al. proposed LDA model as a generative model to study such NLP problems [1].

### LDA: a Generative Model

One way to approach the problem of topics modeling for a collection of documents is to employ a generative framework and that's extactly the way LDA solves the problem. To find the intuitions behind the LDA model let's look at a science paper published by J. Kleinberg in the Nature: Navigation in a Small World.

By a quick glance at the paper, the reader realizes that the paper can be expressed as a collection of words. Also, the paper talks about a few topics such as: "social networks", "graphs structure", "search", "routing" and so on. In real world, we expect that most of documents also talk about a few topics. One can argue that brain also expresses a topic by a probability distribution over vocabulary of words $V$.

Fig 1 is the graphical representation of the LDA model [1]. One can generate a document $j$ by first sampling from a multinomial topic distribution for that document $\theta_{j}$ to find the topic $z_{ij}$ for word $i$ in document $j$. Next, one can sample word $x_{ij}$ from $\phi_{z_{ij}}$ which is the probability distribution of topic $z_{ij}$ over vocabulary words. Thus, one can use LDA model to generate documents given documents distributions over topics as well as topics distributions over vocabulary of words. This description has been captured in Fig 1.

### LDA: from theory to implementation

The challenge is how one can compute LDA model in practice. Here the only observed variables are words $x={x_{ij}}$ while $z$ (topics) are latent variables. Also, both $\Theta$ (documents distributions over topics) and $\phi$ (topics distributions over $V$) distributions are unknown.

Let's define $N_{wkj}=\&hash;\{i:x_{ij}=w,z_{ij}=k\}$ to be the number of words $w$ in document $j$ with topic $k$. So, we can compute the following counts:


$N_{kj}=\sum_{w}{N_{wkj}}$
$N_{wk}=\sum_{j}{N_{wkj}}$


Here, we've marginalized out $\theta$ and $\phi$. In words, $N_{kj}$ is the number of words in document $j$ with topic $k$ and $N_{wk}$ is the number of times word $w$ has been assigned to topic $k$ [2].

Given the current state of all but one variable $z_{ij}$ (topic assigned to word i in doc j), the conditional probability of $z_{ij}$ is computed as below:


$p(z_{ij}=k|z^{\neg&space;ij},x,\alpha,\beta)=\frac{1}{Z}a_{kj}b{wk}$


where $a_{kj}$ and $b_{wk}$ are computed as follows:


$a_{kj}=N_{kj}^{\neg&space;ij}$

$b_{wk}=\frac{N{wk}^{\neg&space;ij}+\beta}{N_{k}^{\neg&space;ij}&space;+&space;W\beta}$


$Z$ is the normalization constant:


$Z=\sum_{k}{a_{kj}b_{wk}}$


Here, the subscript $\neg&space;ij$ means that the corresponding dataum has been excluded in the coun summations $N{wkj}$.

### LDA Gibbs Sampling

Fig 2 shows the LDA algorithm for sampling hidden topic of word i in document j using conditional probability of $z_{ij}$. Fig 2 shows all the steps required to scan $N$ words in document $j$ when first we compute the probability for each topic using the current state of $N_{wkj}$ as shown in conditional probability. Next, we use the probability weights in order to determine the current topic of the given word with index $i$ in the document $j$ under process.

### LDA: time complexity

Running one iteration of Gibbs sampling takes $O(NK)$ steps where $N$ is the number of words in a document and $K$ is the total number of topics. So, it takes $O(NKM)$ steps to scan all documents where $M$ is the number of documents in our collection. We need to run Gibbs sampler for enough number of iterations in order to guarantee the convergence of probability distributions. This makes the overall running time to be $O(NKMI)$ where $I$ is the number of iterations.

### LDA: Java implementation

We have implemented LDA model using Gibbs sampling technique in Java. You can pull the code from its github repo: LDA implementation in Java. Currently there is not any unit test in place but we have plan to add unit tests to the code in the future.

### Extra Notes

In the next post we will discuss different applications of the LDA model such as clustering text documents and information retrieval. Also we will explain how one can pick $K$ parameter as the total number of topics.