A.I. & Optimization

Advanced Machine Learning, Data Mining, and Online Advertising Services

Who's a Data Scientist?

I decided to write this article about "who is a data scientists?" and "what kind of skills/background a data scientist needs to have?". This is completely based on my experience over the last 7 years in employing machine learning, data mining, and statitical analysis techniques to mine and analyze big/noisy data.


PhD Work:

A- Predicting Human Contacts in Real Social Settings:

I started my PhD in computer science in 2007. The first challenge that I faced was to find an interesting/chalenging/novel problem to work on. It took me almost two years until I found my problem! In 2008 I was exposed to the area of social network analysis which was very intriguing to me. A large portion of work done by computer scientist in this area was to first collect an interesting real data related to human activities/behaviour. Next, mine/analyze the collected data in order to find interesting, meaningful, and statistically significant patterns in data. This kind of activity can be well expressed as: "searching for needle (meaningful pattern or signal) in a haystack (noisy data)". So, I started reading papers and books related to social networks area.

I formulated a new problem in my PhD: "Predicting How People Contact each other in Different Social Settings". I started analyzing real data about human mobility in different social settings such as conferences and outdoor events. For predicting how people move and contact each other in a coneference, I formalized different hypothesis based on my intuitions, insights from data, and theories from social networking area. Next, I tested my hypothesis by analyzing the real mobility data. After mining patterns in data about how people contact each others, I started building predictive models (prediction algorithms) in order to forecast who's the next person Alice meets when she attends a conference like KDD. The accuracy of the predictor was computed by running the algorithm with real data.

As expected, predicting human contacts especially in a conference environment is a hard problem due to human random behavior. Ultimately I developed several unsupervised/supervised learning algorithms for contact prediction. I published my results for human contact prediction in different conferences in social computing, wireless networks, and complex networks areas. You can find the papers on the publication section of my homepage.

B- Fast Information Spreading Algorithms in Social Networks:

In the second part of my PhD, I worked on social graphs structure (e.g. Facebook) in order to find out how one can speed up spreading information in social networks. Suppose you are carrying your cellphone and you have an interesting news/rumor that you'd like to share with your friends. But, you don't want to send the message to everyone because of the costs (e.g. cost per message and battery). The question is to whom will you send the message in order to make sure your information spread in your network as fast/far as possible.

This project requires both devising efficient algorithms for information spreading and developing mathematical models which explains Facebook social graph structure. At the end, I found an onion-like core-periphery structure for FB graph structure which in turn directed us to find fast strategies for information spreading. Please see my PhD thesis for more details. This problem has many interesting applications especially in the area of social advertising. I provided some consulting for a company which was active in the area of social advertising based on my PhD results.

Industry Projects:

A- Machine Learning for Computational Advertising

Since September 2012 I've been providing consulting for a few ad tech companies in the the area of online display advertising. One company business model was to select the most converting ads and show them to web browsers in order to make revenue. My problem was to design/tune their advertising algorithms and improve the prediction accuracy of choosing best ads from a large set of ads and showing them to users. This was formulated as a machine learning/optimization problem. The technical part of my work was to analyze/mine a large number of historical ad impressions data in order to test different hypothesis/ideas which allowed to find predictive ad features as well as tune the ad selection strategy. The core of the algorithm was based on Multi-Armed Bandit Problem which is a classical machine learning problem.

B- Machine Learning in Natural Language Processing Domain:

In January 2013, I joined a company which was active in the area of machine learning and natural language processing. I was involved in different natural language processing projects in which we implemented/tuned different machine learning algorithms for running different types of analysis on texts. The first project was to implement a revised version of Latent Dirichlet Allocation algorithm (LDA) which is a unsupervised Bayesian approach for analyzing a collection of texts and extracting topics from them.

The second project was to design/implement a medical entity extractor which processes medical texts and extracts different identities such as "DISEASE" and "TREATMENT". We implemented a revised version of Hidden Markov Models (HMMs) and Conditional Random Fields (CRFs) to extract medical entities. The most interesting part of our work was that we modified HMM model in order to deal with the problem of rarity which is a well-known problem in machine learning and data mining domains.

C- Machine Learning in Fraud Detection:

In Sepetember 2013, I joined a startup which was active in online identity verification domain. In this company my role was to design/implement predictive models which was supposed to take people social data from different online social networks such as: Facebook, Twitter, Google+, LinkedIn, and Email as input and generate an authenticity score reflecting how authentic that digital identity is. Using these predictive models we could detect fake/fraud online identities which could harm companies millions of dollars. During my presence we scored over 1.3 Billion Facebook profiles, 120M Twitter profiles, and over 200M emails and Google+ profiles!

Side Research Projects:

Predicting Election Results and Flu:

As side projects, I was also involved with a few interetsing problems requiring collecting a large number of tweets, cleaning them to remove noise from signal, and running different machine learning and data mining algorithms to test if we can extract interesting/meaningful patterns in tweets. In particular, I worked on two projects: I) one mining tweets to predict US presidential election results in 2012 and II) the second one analyzing tweets to predict flu in Spring 2013. Please check the following links for more details:

Last Words

In summary, a data scientist needs to start with some sort of hypothesis in mind based on some theories or intuitions. Next, she should analyze data in order to validate hypothesis. If they find interetsing patterns in data, they can take the observations and turn them to a prediction algorithm or incorporate it into existing algorithms in order to improve prediction results or quality of the service provided to customers. This process is iterative and should be repeated to improve predictive models over time. This also requires the following skills:

  • Background in Probability and Statistics
  • Experience with Machine Learning and Data Mining
  • Data Structures and Algorithms
  • Proficient in programming languages like: Python, C++, Java etc.
  • MySql and other databases knowledge
  • Knowing Hadoop, Mahout, and Spark can come handy to deal with analyzing big data
  • Knowing how to visualize data and get insight through visualization

Please share with us in the comment section your experience and thoughts on data science role. We would like to hear from you!