Advanced Machine Learning, Data Mining, and Online Advertising Algorithms
In this article we describe a Real-Time application which measures how healthy your neighborhood is by analyzing tweets content!
In January 2013, we attended a Mozilla Hackathon in Vancouver. This was a conference where Mozilla demoed their mobile operating system and the development framework for building mobile apps for Firefox OS. They released their Firefox cellphones after the event which was shipped by their own OS.
In Fall 2012, we experienced a Flu outbreak in US and Canada. There were a few people who died because of the Flu. So, we thought it would be interesting if people can have access to a mobile/web application which tells them the likelihood of catching the flu when they travel or visit different places in a city. We can model the spread of Flu by a random process where there is an unknown probability p for a person to catch the flu if he comes to the close proximity of an infected one. Thus, the geographical distribution of sick people around you plays an important role in probability that you catch the flu. Now, the interesting question would be to find a place where we can collect data about people who have the Flu. We found a few research work where researchers used Twitter data for this purpose. In fact, we noticed that some people tweet when they catch Flu! A few examples of tweets that people posted on Twitter website:
|Tweet no||Tweet content||Probablity|
|Tweet#1||"If I'm coming down with the flu, lord jesus have mercy on my soul and give that sickness to someone else"||p=0.70|
|Tweet#2||"Definitely coming down with the flu everyone's had :("||p=0.9|
We started collecting tweets where people mentioned flu-related words in their tweets. The first technical task was to use ML/NLP models in order to find out who really has the flu and who just tweets a news about the flu by analyzing tweets contents. We trained a Naive Baye Classifier and used it to predict the chance that a tweets is posted by a person who really has the flu. For Tweet#1 and Tweet#2 in previous table, the classifier computed 70% and 90% as the chance they are posted by sick people, respectively. But, this analysis is not trivial as people can freely tweet in various ways. Just a few interesting examples to highlight the point shown below:
|Tweet no||Tweet content||Probablity|
|Tweet#3||"Seeing the Patriots score makes me sick and want to vomit."||p=0.03|
|Tweet#4||"The great Boston influenza scare of 2012... #really? http://t.co/tjWfQ4SA;@;1;@;0.828821011434"||p=0.82|
|Tweet#5||"I'm praying that he don't have the flu"||p=0.97|
|Tweet#6||"@pegs_hanson @scascum yeh mate got back yest, we had a sick trip bitches!!! Yeh I'm keen for dinner during the week. Wednesday?............."||p=0.97|
As we see, Naive Bayes correctly classifies Tweet#3 as negative (it assigns a low probability: 3%). However, the classifier fails to classify correctly for other tweets. Although Tweet#4 contains "influenza" word, the tweet is just a news link tweeted by the author and it doesn't imply that he's sick. Tweet#5 also mentions the word "flu", but the author is concerned about another friend who might have had the flu not himself/herself. Finally, Tweet#6 also contains "sick" word but the author has used that word for completely different meaning (positive sentiment here).
Above example shows that Naive Bayes classifier plays moderately well after being trained by manually labelled tweets. However, it fails in more intricate examples because it doesn't take into account the dependency between words inside a tweet (i.e. the context). In particular, if we want to improve its performance, we need to augment ML classifier results with part of speech returned by a natural language parser. Having POS (part of speech), one can correct ML mistakes for Tweet#4 because "influenza" is linked with Boston city and doesn't relate to the tweeter author. The same thing will happen for Tweet#5 as "flu" relates to "he" not the author. Finally, in Tweet#6 POS for word "sick" is adjective which qualifies "trip" word in that example. Therefore, assuming that we have a good parser which can generate POS for tweets, we would be able to fix those "False Positive" examples!
After classifying tweets in real-time, we built a web application which consumes the classified data by Naive Bayes and computes a probability for a person to catch the Flu in the near future by taking into account the geographical distribution of sick people around him/her. We built this App during Mozilla event, and we won three Firefox phones! You can check our web app from the following link: Starling Flu Predictor Web App. You can download Starling flu predictor source code from its github repository: Starling flu predictor.
Google also has a web app for predicting the flu. This app is based on terms people search (e.g. flu-related words) using Google search engine. They use those search term signals to predict a flu outbreak: Google Flu Trend App. However, there is a recent research work where scientists pointed out a flaw in Google Flu Trend model: Disruptions: Data Without Context Tells a Misleading Story
One of the challenges with Google flu trend is that it doesn't consider the context of search as it has been pointed out in above article. People may just search about the flu because everybody talks about the flu and that doesn't necessary mean that they have the flu! But, we think Twitter doesn't have the same problem as people explicitly tweet about their life and experience. In particular, they may tweet when they have the flu. However, twitter might not be a very good sample distribution from the whole population because it's mainly is used by young and more tech savy people.
Another interesting and related problem is to collect Twitter data and test if we can detect social cascade behaviors early enough (social signals). Assuming that innovators and early adopters tweet about new ideas/events/products, one might be able to collect indicator signals before those behaviors, opinions, and technologies become epidemic.
The true positive signals about an epidemic behavior can be clearly used to give recommendations for marketers inside eBay! Through a colleague who works for Terapeak company we got access to eBay selling data in 2012/2013. Having such data, we tested the correlation between eBay selling rate for flu-related medicines and the computed flu rate from our Twitter data. We wrote an article about tracking Twitter trend to find out what to sell on eBay which was published on Terapeak's blog. You can find the article from the following link: Tracking Social Media Trends and Their Influence on E-Commerce Markets.
Twitter is a rich and real time communication channel which can potentially provide insights if one tap into its tweets. One can think of many web/mobile applications where you can get interesting social signals from Twitter and predict near future! Please contact us if you have any questions or comments. We love to hear from you!