A.I. & Optimization

Advanced Machine Learning, Data Mining, and Online Advertising Services

Using Online Learning Algorithms for Computational Advertising


In the last few years, I have been doing an extensive industrial research in the area of computational advertising for a few advertising companies. My focus is on designing/developing algorithms to find & show ads with the highest conversion rates to online web users. I formulated ad-selection problem as a Multi-Armed Bandit problem which is a classical paradigm in Machine Learning. I've been applying machine learning, data mining, probability and statistics in order to analyze big data in the ad-tech space and devise efficient ad selection strategies. This article highlights some of my findings in the area of computational advertising.

Advertising Model

In our model we assume that there is a finite set of offers where denotes offer with index . We model each offer's payoff with an unknown distribution with an unknown expected value . For our first set of analysis we assume that each offer's payoff has a Bernoulli distribution. We also assume that the advertiser has allocated a budget for her advertising campaign. In words, the advertiser can buy at most number of impressions.

The goal of an advertising campaign is to maximize the revenue from displaying ads. Here, we assume that the advertiser makes $1 revenue every time an offer is clicked by a web user. Otherwise, there is no revenue! Solving this problem is closely related to the well-known Multi-Armed Bandit problem. In both settings, the player/advertiser doesn't have any prior knowledge about how much revenue she can make by displaying each ad. Moreover, in both problems there is a trade-off between exploring phase where the goal is to collect statistical information from ads performance and exploitation phase in which one sticks to the best known ad so far!

Las Vegas slot machines
Fig 1. - slot machines (source: wikipedia.org)

Epsilon-Greedy Algorithm

There are a few algorithms to solve the described optimization problem. Epsilon-Greedy is one of the basic strategies for optimizing ads in which with probability one chooses a random ad from a set of possible ads and with probability of one displays the ad with the best expected revenue. It's important to choose such that it maximizes our campaign revenue. There is an important performance factor called average regret computed as follows:

, where denotes the offer with the highest acceptance probability and denotes our collected revenue in trial . The expected regret value shows the performance difference between our employed strategy and the optimal strategy. Thus, a low expected regret value is desirable. Here, we are interested in the average regret per offer and analyze different algorithms by comparing their average regret per offer over time.

UCB1 Algorithm: Index-based Policies

One of the main challenges in ad optimization is to find the balance between exploring the space (e.g. existing arm space) to find the best arm and exploiting the current knowledge by playing the best arm found so far as often as possible. The total regret is a measure to compare the performance of different algorithms. Lai and Robbins are the first researchers who showed that the total regret for MAB problem has to grow at least logarithmically in the number of plays [1]. They introduced a class of policies in which they computed an Upper Confidence Index for each arm based on the previous reward observations for that arm. The computed index estimates the true expectation reward for that arm. Later, Agrawl introduced a set of policies where the index function can be expressed in the form of simple functions and computed more easily than Lai and Robbins's [2]. The regret for classes of policies introduced by Agrawal attains the optimal logarithmic behavior.

In Auer's paper, he formulated a basic version of -armed bandit problem in which successive plays of arm generates rewards , , ... [3]. The generated rewards for each arm (i.e. ) are assumed to be independent and identically distributed according to an unknown law with an unknown expectation . He also assumed that independence holds between different arms. In other words, (i.e. reward attained from playing arm for time) is independent from while they are identically distributed where and . Auer showed that a simple algorithm called UCB1 achieves a logarithmic regret uniformly over and without any preliminary knowledge about the reward distributions. For initialization step, UCB1 algorithm plays each machine once. Then, in subsequent rounds UCB1 plays the machine which maximizes its index function computed as follows:

, where is the average reward obtained from machine , is number of times machine has been played so far, and is the total number of plays so far. As above equation shows, the index of UCB1 policy has two terms. The first term is simply the current average reward. The second term of the index is related to the size of (according to Chernoff-Hoeffding bounds) the one sided confidence interval for the average reward within which the true expected reward falls with high probability. Aurer et al. showed that if one runs UCB1 with machines with arbitrary reward distributions with support in range , the total expected regret after number of plays is at most [3].

Bayesian MAB Algorithm

In the third approach, we use a simple Bayesian framework. We model number of successes in showing an ad times with the number of successes in Bernoulli trials with an unknown conversion probability . In Bayesian inference, the beta distribution is the conjugate prior probability distribution for the Bernoulli. We use Beta function in order to encode our observations and model probability distribution of q as below:

So, we model the probability distribution for each ad's payoff with a Beta distribution. For offer , at time step we use number of plays and payouts for the offer to model the uncertainty about that offer's payoff and compute Beta parameters as follows:

At each time step, we display the ad with the maximum expected likelihood by sampling from above distributions for each ad!

Performance Analysis

For our analysis, we simulate an advertising campaign with ads where the value of each ad is $1 if clicked. For modeling ads, we use two simple distributions. First, we model ads' payoffs with a uniform distribution. We choose for number of ads and run our ads times where . The goal here is to see how different algorithms perform by comparing their average regret per offer. The following figure shows the results when ads payoffs modelled with the uniform distribution.

Uniform probability distribution

In reality, we expect to have only a few ads with high payoff while most of ads have low payoffs. For modeling such settings, we use beta distribution where and . The following figure shows the results for beta distribution while other setting parameters are the same as before.

Beta probability distribution


Our results show that in both secenarios UCB1 beats Epsilon-Greedy while Bayesian MAB outperforms others.


  1. T. Lai, and H. Robbins, Asymptotically efficient adaptive allocation rules, Advances Applied Mathematics, 1985, pp. 4--22
  2. R. Agrawal, Sample mean based index policies with O(log n) regret for the multi-armed bandit problem, Advances in Applied Probability, 27, 1995, pp. 1054--1078.
  3. PETER AUER, NICOLO CESA-BIANCHI, and PAUL FISCHER, Finite-time Analysis of the Multiarmed Bandit Problem, Machine Learning, 47, 2002, pp. 235--256.