Advanced Machine Learning, Data Mining, and Online Advertising Services

In the last few years, I have been doing an extensive industrial research in the area of computational advertising for a few advertising companies. My focus is on designing/developing algorithms to find & show ads with the highest conversion rates to online web users. I formulated ad-selection problem as a **Multi-Armed Bandit problem** which is a classical paradigm in Machine Learning. I've been applying machine learning, data mining, probability and statistics in order to analyze big data in the ad-tech space and devise efficient ad selection strategies. This article highlights some of my findings in the area of computational advertising.

In our model we assume that there is a finite set of offers where denotes offer with index . We model each offer's payoff with an unknown distribution with an unknown expected value . For our first set of analysis we assume that each offer's payoff has a **Bernoulli distribution**. We also assume that the advertiser has allocated a budget for her advertising campaign. In words, the advertiser can buy at most number of impressions.

The goal of an advertising campaign is to maximize the revenue from displaying ads. Here, we assume that the advertiser makes $1 revenue every time an offer is clicked by a web user. Otherwise, there is no revenue! Solving this problem is closely related to the well-known Multi-Armed Bandit problem. In both settings, the player/advertiser doesn't have any prior knowledge about how much revenue she can make by displaying each ad. Moreover, in both problems there is a trade-off between exploring phase where the goal is to collect statistical information from ads performance and exploitation phase in which one sticks to the best known ad so far!

There are a few algorithms to solve the described optimization problem. **Epsilon-Greedy** is one of the basic strategies for optimizing ads in which with probability one chooses a random ad from a set of possible ads and with probability of one displays the ad with the best expected revenue. It's important to choose such that it maximizes our campaign revenue. There is an important performance factor called **average regret** computed as follows:

, where denotes the offer with the highest acceptance probability and denotes our collected revenue in trial . The expected regret value shows the performance difference between our employed strategy and the optimal strategy. Thus, a low expected regret value is desirable. Here, we are interested in the **average regret per offer** and benchmark different algorithms by comparing their average regret per offer over time.

One of the main challenges in ad optimization is to find the balance between **exploring** the space (e.g. existing arm space) to find the best arm and **exploiting** the current knowledge by playing the best arm found so far as often as possible. The *total regret* is a measure to compare the performance of different algorithms. Lai and Robbins are the first researchers who showed that the total regret for MAB problem has to grow at least *logarithmically* in the number of plays [1]. They introduced a class of policies in which they computed an *Upper Confidence Index* for each arm based on the previous reward observations for that arm. The computed index estimates the true expectation reward for that arm. Later, Agrawl introduced a set of policies where the index function can be expressed in the form of simple functions and computed more easily than Lai and Robbins's [2]. The regret for classes of policies introduced by Agrawal attains the optimal logarithmic behavior.

In Auer's paper, he formulated a basic version of -armed bandit problem in which successive plays of arm generates rewards , , ... [3]. The generated rewards for each arm (i.e. ) are assumed to be independent and identically distributed according to an unknown law with an unknown expectation . He also assumed that independence holds between different arms. In other words, (i.e. reward attained from playing arm in round ) is independent from while they are identically distributed where and . Auer showed that a simple algorithm called **UCB1** achieves a logarithmic regret uniformly over and without any preliminary knowledge about the reward distributions. For initialization step, *UCB1* algorithm plays each machine once. Then, in subsequent rounds *UCB1* plays the machine which maximizes its index function computed as follows:

, where is the average reward obtained from machine , is number of times machine has been played so far, and is the total number of plays so far. As above equation shows, the index of *UCB1* policy has two terms. The first term is simply the current average reward. The second term of the index is related to the size of (according to *Chernoff-Hoeffding bounds*) the one sided confidence interval for the average reward within which the true expected reward falls with high probability. Aurer et al. showed that if one runs *UCB1* with machines with arbitrary reward distributions with support in range , the total expected regret after number of plays is at most [3].

In the third approach, we use a simple Bayesian framework. We model number of successes in showing an ad times with the number of successes in **Bernoulli** trials with an unknown conversion probability . In Bayesian inference, the beta distribution is the conjugate prior probability distribution for the Bernoulli. We use Beta function in order to encode our observations and model probability distribution of q as below:

So, we model the probability distribution for each ad's payoff with a *Beta* distribution. For offer , at time step we use number of plays and payouts for the offer to model the uncertainty about that offer's payoff and compute Beta parameters as follows:

At each time step, we display the ad with the *maximum expected likelihood* by sampling from above distributions for each ad!

For our analysis, we simulate an advertising campaign with ads where the value of each ad is $1 if clicked. For modeling ads, we use two simple distributions. First, we model ads' payoffs with a **uniform** distribution. We choose for number of ads and run our ads times where . The goal here is to see how different algorithms perform by comparing their average regret per offer. The following figure shows the results when ads payoffs modelled with the uniform distribution.

In reality, we expect to have only a few ads with high payoff while most of ads have low payoffs. For modeling such settings, we use **beta** distribution where and . The following figure shows the results for beta distribution while other setting parameters are the same as before.

Our results show that in both secenarios **UCB1** beats **Epsilon-Greedy** while **Bayesian MAB** outperforms the other two.

- T. Lai, and H. Robbins, Asymptotically efficient adaptive allocation rules, Advances Applied Mathematics, 1985, pp. 4--22
- R. Agrawal, Sample mean based index policies with O(log n) regret for the multi-armed bandit problem, Advances in Applied Probability, 27, 1995, pp. 1054--1078.
- PETER AUER, NICOLO CESA-BIANCHI, and PAUL FISCHER, Finite-time Analysis of the Multiarmed Bandit Problem, Machine Learning, 47, 2002, pp. 235--256.