Advanced Machine Learning, Data Mining, and Online Advertising Services
In the last eight years, Kazem Jahanbakhsh has been involved with several Machine Learning and Natural Language Processing problems including: predicting people interactions in conferences, computational advertising, fraud detection, and predicting election results by mining tweets. The common challenges we have seen in above problems are three-folds:
In this post, we discuss some of challenges one may face during feature engineering and model designing steps by showing a few examples. We describe algorithms such as Neural Networks and Deep Learning by which one can handle feature engineering step in more automated manner. We also dig into some of the mathematics behind these algorithms including their architecture, activation functions, and training phase.
This is relevant to any audience who is interested in designing machine learning algorithms for real-life problems.
In the last eight years, Kazem has been working on a few academic/industrial machine learning problems listed below:
You can read more about above problems here: list of ml problems
There are a few inter-related challenges when you are building an ML model for a prediction task:
Preparing and cleaning training data is usually time-consuming task but plays a crucial role on your predictive algorithm performance. Feature engineering is another time consuming phase in which data scientists need to spend enough time going through their training data and use ranking/statistical techniques to extract and rank informative features from raw data and build their predictive models around them. This is the focus of the An Introduction to Variable and Feature Selection paper. Finally, building the ML model where we feed our extracted features to predict the desired output is the last step.
An example of a high dimensional feature space representing a Facebook account is shown below. The goal is to build an ML model predicting if an account is real or fake.
Another example from twitter where we have a high dimensional data for Twitter accounts. Again the goal is to detemine if an account is bot or not.
Last example is about a hand-written image where we are interested in tagging an image with its corresponding number for example here 4.
In above examples, we can start thinking of a rule based classifier in which we come up with a list of rules determining if an account is fake or not. But, there are a few issues with that approach: (I) how to generate those rules such that they are exhaustive and mutually exlusive, (II) we might end up throwing lots of rules with hand-tuned thresholds which makes the whole approach cumbersome and not scalable.
The same reasoning is true when we are dealing with an image recognition task described earlier. As we see, the main problem here is to find and extract the list of representative features from raw data and feed those features to the designed model for the final prediction. Leaving this task to a data scientist is time-consuming and has a big impact on predictor's performance.
So, instead of relying on human insights to extract representative features from raw data, it will be much more efficient if we can develop a mathematical framework by which we perform the feature engineering and learning task in an automated manner. Read more on the subject of automating data science problems at Ref 3.
One possible mathematical framework is to use Neural Networks consisting of several inter-connected layers similar to brain. The underlying component of Neural Nets are nuerons where one feed a number of inputs along with their weights to the activation function of the neuron. Below you can see an example of a linear activation function:
y = W.X = w1*x1 + w2*x2 + ... + wn*xn
As we see, the dot product of input vector (i.e. X) with weight vector (i.e. W) is linearly mapped to the output of neuron (i.e. y).
One can stack layes of Neurons on top of each other in order to learn higher level features and build the final model. Building multi-layer neural nets especially become very handy when we have hierarical structure in our data (e.g. human faces). In these situations, adding multiple layers enables us to model the hierarchy relationship in the training data. See Ref 4, 5, and 12 for details.
One important feature of deep learning is that its framework enables us to automate feature engineering phase which in result can eliminate or minimize the need for working with domain experts to build predictive models. See Ref 1 for more details.
It's insightful to know that with a Neural Network with at least one hidden layer one can approximate any continuous function f(x). There is an intuitive visual explanation for this here: A visual proof that neural nets can compute any function.
One interesting question in designing a neural network is how to pick the activation function f(.). There are some popular choices such as sigmoid function, hyperboilc tanh(), or Rectified Linear Unit(ReLU) (See Ref 12). It's important to think of the mathematical properties of these activation functions and their impact on trained neural networks.
For instance, although sigmoid function is nicely derivable, but as x grows, it quicly staurates and its gradient approaches zero. Such property has its own side effect on the speed of training the neural network which become an issue as your network size grows. So, researchers like Alex et al. have found that using a linear rectifier with no saturating property can be trained 6 times faster for image recongnition problems. Check the original paper here: ImageNet Classification with Deep Convolutional Neural Networks
Computing weights of Neural Nets and Deep Learning models is the core of training process. Here, we need to formulate a cost function calculating the overall error of our prediction respect to the true labels (i.e. RMSE). So, the problem becomes solving an optimization problem in which we want to calculate the parameters (i.e. w's) s.t. it minimizes the cost function.
There are know techniques such as gradient descent by which one can find the minimum of a cost function. In practice, training neural nets involves using a modified version of gradient descent called stochastic gradient descent, that tries to use randomization and noise to find the global minimum with high probability on a complex error surface. See Ref 6 for a complete list of optimization techniques in Machine Learning area.
If we want to train a multi-layer neural net, we need to use techniques such as Backpropagation in conjunction with an optimization method like stochastic gradient descent.
As we add more layers to our net, we can end up having much more parameters compared to training data samples. This problem is known as overfitting problem in machine learning space. To solve this issue, we need to take proper actions such as including a dropout layers/neurons.
We will demo a few examples of machine learning models for solving image recognition problems. These examples are in Python and use Theano library. Check the link here for some code examples: Theano Tutorial.
Theano comes with both CPU and GPU implementations for deep learning. In one of our tests, we ran two versions of deep learning one on a Mac Pro (2.9 GHz Intel Core i7 with 8GB mem) and the other test on an EC2 GPU instance (g2.2xlarge instance type - NVIDIA GPU with 1,536 CUDA cores - 4GB of video memory). We filled a spot request for EC2 instance. Next, we followed steps specified in Ref 7 to install Theano.
On Amazon G2 instance, running the deep learning code (i.e. 4_modern_net.py) took 5 minutes while on the Mac node it took almost 37 inutes to run the same code. So, running on G2 instance gives a 7.4 speedup factor which is a significant boost using single multi-core machine. Netflix lab also has done some work around deep learning problem in 2014 where they reported building their ANN (artificial neural network) models on a single machine. However, they use multiple machines to build seperate models for each 41 countries they have have buisiness in. See Ref 8 for details on the Netflix effort to optimize running Deep Learning models.
The most time consuming part of deep learning model is the Stochastic Gradient Descent part by which we try to compute the weights of network edges (wi's). There is large dependecnies while running SGD phase which makes distributing this part on multi machines not trivial and can result in worse performance if not careful steps are taken during parallelization design phase.
In 2012, Google has taken some steps to build and run a distributed deep networks where they distributed some of their computations across several machines. In Google's case their deep network had billions of parameters but in our case the number of parameters are way less. See Ref 9 to find more details on Google deep learning project.