{"id":3363,"date":"2021-02-28T16:00:52","date_gmt":"2021-02-28T16:00:52","guid":{"rendered":"https:\/\/thenextweb.com\/?p=1340644"},"modified":"2021-02-28T16:00:52","modified_gmt":"2021-02-28T16:00:52","slug":"most-ads-you-see-are-chosen-by-a-reinforcement-learning-model-heres-how-it-works","status":"publish","type":"post","link":"https:\/\/www.londonchiropracter.com\/?p=3363","title":{"rendered":"Most ads you see are chosen by a reinforcement learning model \u2014 here\u2019s how it works"},"content":{"rendered":"\n<p>Every day, digital advertisement agencies serve billions of ads on news websites, search engines, social media networks, video streaming websites, and other platforms. And they all want to answer the same question: Which of the many ads they have in their catalog is more likely to appeal to a certain viewer? Finding the right answer to this question can have a huge impact on revenue when you are dealing with hundreds of websites, thousands of ads, and millions of visitors.<\/p>\n<p>Fortunately (for the ad agencies, at least),<span>&nbsp;<\/span><a href=\"https:\/\/bdtechtalks.com\/2019\/05\/28\/what-is-reinforcement-learning\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">reinforcement learning<\/a>, the branch of artificial intelligence that has become renowned for<span>&nbsp;<\/span><a href=\"https:\/\/bdtechtalks.com\/2018\/07\/02\/ai-plays-chess-go-poker-video-games\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">mastering board and video games<\/a>, provides a solution. Reinforcement learning models seek to maximize rewards. In the case of online ads, the RL model will try to find the ad that users are more likely to click on.<\/p>\n<p>The digital ad industry generates hundreds of billions of dollars every year and provides an interesting case study of the powers of reinforcement learning.<\/p>\n<h2>Na\u00efve A\/B\/n testing<\/h2>\n<p>To better understand how reinforcement learning optimizes ads, consider a very simple scenario: You\u2019re the owner of a news website. To pay for the costs of hosting and staff, you have entered a contract with a company to run their ads on your website. The company has provided you with five different ads and will pay you one dollar every time a visitor clicks on one of the ads.<\/p>\n<p>Your first goal is to find the ad that generates the most clicks. In advertising lingo, you will want to maximize your click-trhough&nbsp;rate (CTR). The CTR is ratio of clicks over number of ads displayed, also called impressions. For instance, if 1,000 ad impressions earn you three clicks, your CTR will be 3 \/ 1000 = 0.003 or 0.3<span><span>%<\/span><\/span>.<\/p>\n<p>Before we solve the problem with reinforcement learning, let\u2019s discuss A\/B testing, the standard technique for comparing the performance of two competing solutions (A and B) such as different webpage layouts, product recommendations, or ads. When you\u2019re dealing with more than two alternatives, it is called A\/B\/n testing.<\/p>\n<p><em>[Read:&nbsp;<a class=\"c-link c-message_attachment__title_link\" href=\"https:\/\/thenextweb.com\/plugged\/2020\/11\/27\/build-pet-friendly-gadget-experts-animal-owners-design\/\" target=\"_blank\" rel=\"noreferrer noopener\" data-qa=\"message_attachment_title_link\"><span dir=\"auto\">How do you build a pet-friendly gadget? We asked experts and animal owners<\/span><\/a>]<\/em><\/p>\n<p>In A\/B\/n testing, the experiment\u2019s subjects are randomly divided into separate groups and each is provided with one of the available solutions. In our case, this means that we will randomly show one of the five ads to each new visitor of our website and evaluate the results.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><a href=\"https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/normal-distribution.jpg?ssl=1\" target=\"_blank\" rel=\"nofollow noopener noreferrer\"><img decoding=\"async\" loading=\"lazy\" class=\"wp-image-9633 jetpack-lazy-image jetpack-lazy-image--handled lazy\" src=\"https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/normal-distribution.jpg?resize=696%2C392&amp;ssl=1\" sizes=\"(max-width: 696px) 100vw, 696px\" alt=\"normal distribution\" width=\"696\" height=\"392\" data-attachment-id=\"9633\" data-permalink=\"https:\/\/bdtechtalks.com\/2021\/02\/22\/reinforcement-learning-ad-optimization\/normal-distribution\/\" data-orig-file=\"https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/normal-distribution.jpg?fit=2560%2C1440&amp;ssl=1\" data-orig-size=\"2560,1440\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"normal distribution\" data-image-description data-medium-file=\"https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/normal-distribution.jpg?fit=300%2C169&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/normal-distribution.jpg?fit=696%2C392&amp;ssl=1\" data-recalc-dims=\"1\" data-lazy-loaded=\"1\" data-lazy=\"true\" data-srcset=\"https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/normal-distribution.jpg?resize=1024%2C576&amp;ssl=1 1024w, https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/normal-distribution.jpg?resize=300%2C169&amp;ssl=1 300w, https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/normal-distribution.jpg?resize=768%2C432&amp;ssl=1 768w, https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/normal-distribution.jpg?resize=1536%2C864&amp;ssl=1 1536w, https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/normal-distribution.jpg?resize=2048%2C1152&amp;ssl=1 2048w, https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/normal-distribution.jpg?resize=696%2C392&amp;ssl=1 696w, https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/normal-distribution.jpg?resize=1068%2C601&amp;ssl=1 1068w, https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/normal-distribution.jpg?resize=747%2C420&amp;ssl=1 747w, https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/normal-distribution.jpg?resize=1920%2C1080&amp;ssl=1 1920w, https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/normal-distribution.jpg?w=1392&amp;ssl=1 1392w\"><\/a><\/figure>\n<\/div>\n<p>Say we run our A\/B\/n test for 100,000 iterations, roughly 20,000 impressions per ad. Here are the clicks-over-impression ratio of our ads:<\/p>\n<p>Ad 1: 80\/20,000 = 0.40% CTR<\/p>\n<p>Ad 2: 70\/20,000 = 0.35% CTR<\/p>\n<p>Ad 3: 90\/20,000 = 0.45% CTR<\/p>\n<p>Ad 4: 62\/20,000 = 0.31% CTR<\/p>\n<p>Ad 5: 50\/20,000 = 0.25% CTR<\/p>\n<p>Our 100,000 ad impressions generated $352 in revenue with an average CTR of 0.35%. More importantly, we found out that ad number 3 performs better than the others, and we will continue to use that one for the rest of our viewers. With the worst performing ad (ad number 2), our revenue would have been $250. With the best performing ad (ad number 3), our revenue would have been $450. So, our A\/B\/n test provided us with the average of the minimum and maximum revenue and yielded the very valuable knowledge of the CTR rates we sought.<\/p>\n<p>Digital ads have very low conversion rates. In our example, there\u2019s a subtle 0.2% difference between our best- and worst-performing ads. But this difference can have a significant impact on scale. At 1,000 impressions, ad number 3 will generate an extra $2 in comparison to ad number 5. At a million impressions, this difference will become $2,000. When you\u2019re running billions of ads, a subtle 0.2<span><span>%<\/span><\/span>&nbsp;can have a huge impact on revenue.<\/p>\n<p>Therefore, finding these subtle differences is very important in ad optimization. The problem with A\/B\/n testing is that it is not very efficient at finding these differences. It treats all ads equally and you need to run each ad tens of thousands of times until you discover their differences at a reliable confidence level. This can result in lost revenue, especially when you have a larger catalog of ads.<\/p>\n<p>Another problem with classic A\/B\/n testing is that it is static. Once you find the optimal ad, you will have to stick to it. If the environment changes due to a new factor (seasonality, news trends, etc.) and causes one of the other ads to have a potentially higher CTR, you won\u2019t find out unless you run the A\/B\/n test all over again.<\/p>\n<p>What if we could change A\/B\/n testing to make it more efficient and dynamic?<\/p>\n<p>This is where reinforcement learning comes into play. A reinforcement learning agent starts by knowing nothing about its environment\u2019s actions, rewards, and penalties. The agent must find a way to maximize its rewards.<\/p>\n<p>In our case, the RL agent\u2019s actions are one of five ads to display. The RL agent will receive a reward point every time a user clicks on an ad. It must find a way to maximize ad clicks.<\/p>\n<h2>The multi-armed bandit<\/h2>\n<div class=\"wp-block-image\" readability=\"6\">\n<figure class=\"aligncenter size-large\" readability=\"2\">\n<p><figure class=\"post-image post-mediaBleed aligncenter\"><a href=\"https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/multi-armed-bandit.jpg?ssl=1\" target=\"_blank\" rel=\"nofollow noopener noreferrer\"><img decoding=\"async\" loading=\"lazy\" class=\"jetpack-lazy-image jetpack-lazy-image--handled wp-image-9635 lazy\" src=\"https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/multi-armed-bandit.jpg?resize=696%2C392&amp;ssl=1\" sizes=\"(max-width: 696px) 100vw, 696px\" alt=\"multi-armed bandit\" width=\"696\" height=\"392\" data-attachment-id=\"9635\" data-permalink=\"https:\/\/bdtechtalks.com\/2021\/02\/22\/reinforcement-learning-ad-optimization\/multi-armed-bandit\/\" data-orig-file=\"https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/multi-armed-bandit.jpg?fit=2560%2C1440&amp;ssl=1\" data-orig-size=\"2560,1440\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"multi-armed bandit\" data-image-description data-medium-file=\"https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/multi-armed-bandit.jpg?fit=300%2C169&amp;ssl=1\" data-large-file=\"https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/multi-armed-bandit.jpg?fit=696%2C392&amp;ssl=1\" data-recalc-dims=\"1\" data-lazy-loaded=\"1\" data-lazy=\"true\" data-srcset=\"https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/multi-armed-bandit.jpg?resize=1024%2C576&amp;ssl=1 1024w, https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/multi-armed-bandit.jpg?resize=300%2C169&amp;ssl=1 300w, https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/multi-armed-bandit.jpg?resize=768%2C432&amp;ssl=1 768w, https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/multi-armed-bandit.jpg?resize=1536%2C864&amp;ssl=1 1536w, https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/multi-armed-bandit.jpg?resize=2048%2C1152&amp;ssl=1 2048w, https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/multi-armed-bandit.jpg?resize=696%2C392&amp;ssl=1 696w, https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/multi-armed-bandit.jpg?resize=1068%2C601&amp;ssl=1 1068w, https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/multi-armed-bandit.jpg?resize=747%2C420&amp;ssl=1 747w, https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/multi-armed-bandit.jpg?resize=1920%2C1080&amp;ssl=1 1920w, https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/multi-armed-bandit.jpg?w=1392&amp;ssl=1 1392w\"><\/a><figcaption><a href=\"https:\/\/thenextweb.com\/neural\/2021\/02\/28\/how-ads-are-chosen-by-reinforcement-learning-model-syndication\/#\" data-url=\"https:\/\/twitter.com\/intent\/tweet?url=https%3A%2F%2Fthenextweb.com%2Fneural%2F2021%2F02%2F28%2Fhow-ads-are-chosen-by-reinforcement-learning-model-syndication%2F&amp;via=thenextweb&amp;related=thenextweb&amp;text=Check out this picture on: The multi-armed bandit must find ways to discover one of several solutions through trial and error\" data-title=\"Share The multi-armed bandit must find ways to discover one of several solutions through trial and error on Twitter\" data-width=\"685\" data-height=\"500\" class=\"post-image-share popitup\" title=\"Share The multi-armed bandit must find ways to discover one of several solutions through trial and error on Twitter\"><i class=\"icon icon--inline icon--twitter--dark\"><\/i><\/a>The multi-armed bandit must find ways to discover one of several solutions through trial and error<\/figcaption><\/figure>\n<\/p>\n<\/figure>\n<\/div>\n<p>In some reinforcement learning environments, actions are evaluated in sequences. For instance, in video games, you must perform a series of actions to reach the reward, which is finishing a level or winning a match. But when serving ads, the outcome of every ad impression is evaluated independently; it is a single-step environment.<\/p>\n<p>To solve the ad optimization problem, we\u2019ll use a \u201cmulti-armed bandit\u201d (MAB), a reinforcement learning algorithm that is suited for single-step reinforcement learning. The name of the multi-armed bandit comes from an imaginary scenario in which a gambler is standing at a row of slot machines. The gambler knows that the machines have different win rates, but he doesn\u2019t know which one provides the highest reward.<\/p>\n<p>If he sticks to one machine, he might lose the chance of selecting the machine with the highest win rate. Therefore, the gambler must find an efficient way to discover the machine with the highest reward without using up too much of his tokens.<\/p>\n<p>Ad optimization is a typical example of a multi-armed bandit problem. In this case, the reinforcement learning agent must find a way to discover the ad with the highest CTR without wasting too many valuable ad impressions on inefficient ads.<\/p>\n<h2>Exploration vs exploitation<\/h2>\n<p>One of the problems every reinforcement learning model faces is the \u201cexploration vs exploitation\u201d challenge. Exploitation means sticking to the best solution the RL agent has so far found. Exploration means trying other solutions in hopes of landing on one that is better than the current optimal solution.<\/p>\n<p>In the context of ad selection, the reinforcement learning agent must decide between choosing the best-performing ad and exploring other options.<\/p>\n<p>One solution to the exploitation-exploration problem is the \u201cepsilon-greedy\u201d (\u03b5-greedy) algorithm. In this case, the reinforcement learning model will choose the best solution most of the time, and in a specified percent of cases (the epsilon factor) it will choose one of the ads at random.<\/p>\n<figure class=\"wp-block-image size-large\" readability=\"3\">\n<p><figure class=\"post-image post-mediaBleed aligncenter\"><a href=\"https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/exploration-vs-exploitation.jpg?ssl=1\" target=\"_blank\" rel=\"nofollow noopener noreferrer\"><img decoding=\"async\" loading=\"lazy\" class=\"jetpack-lazy-image jetpack-lazy-image--handled wp-image-9637 lazy\" src=\"https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/exploration-vs-exploitation-1024x576.jpg?resize=696%2C392&amp;ssl=1\" alt=\"exploration vs exploitation\" width=\"696\" height=\"392\" data-attachment-id=\"9637\" data-permalink=\"https:\/\/bdtechtalks.com\/2021\/02\/22\/reinforcement-learning-ad-optimization\/exploration-vs-exploitation\/\" data-orig-file=\"https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/exploration-vs-exploitation.jpg?fit=2560%2C1440&amp;ssl=1\" data-orig-size=\"2560,1440\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"exploration vs exploitation\" data-image-description data-medium-file=\"https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/exploration-vs-exploitation.jpg?fit=300%2C169&amp;ssl=1\" data-large-file=\"https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/exploration-vs-exploitation.jpg?fit=696%2C392&amp;ssl=1\" data-recalc-dims=\"1\" data-lazy-loaded=\"1\" data-lazy=\"true\"><\/a><figcaption><a href=\"https:\/\/thenextweb.com\/neural\/2021\/02\/28\/how-ads-are-chosen-by-reinforcement-learning-model-syndication\/#\" data-url=\"https:\/\/twitter.com\/intent\/tweet?url=https%3A%2F%2Fthenextweb.com%2Fneural%2F2021%2F02%2F28%2Fhow-ads-are-chosen-by-reinforcement-learning-model-syndication%2F&amp;via=thenextweb&amp;related=thenextweb&amp;text=Check out this picture on: Every reinforcement learning algorithm must find the right balance between exploiting optimal solutions and exploring new options\" data-title=\"Share Every reinforcement learning algorithm must find the right balance between exploiting optimal solutions and exploring new options on Twitter\" data-width=\"685\" data-height=\"500\" class=\"post-image-share popitup\" title=\"Share Every reinforcement learning algorithm must find the right balance between exploiting optimal solutions and exploring new options on Twitter\"><i class=\"icon icon--inline icon--twitter--dark\"><\/i><\/a>Every reinforcement learning algorithm must find the right balance between exploiting optimal solutions and exploring new options<\/figcaption><\/figure>\n<\/p>\n<\/figure>\n<p>Here\u2019s how it works in practice. Say we have an epsilon-greedy MAB agent with the \u03b5 factor set to 0.2. This means that the agent chooses the best-performing ad 80<span><span>%<\/span><\/span>&nbsp;of the time and explores other options 20%&nbsp;of the time.<\/p>\n<p>The reinforcement learning model starts without knowing which of the ads performs better, therefore it assigns each of them an equal value. When all ads are equal, it will choose one of them at random each time it wants to serve an ad.<\/p>\n<p>After serving 200 ads (40 impressions per ad), a user clicks on ad number 4. The agent adjusts the CTR of the ads as follows:<\/p>\n<p>Ad 1: 0\/40 = 0.0%<\/p>\n<p>Ad 2: 0\/40 = 0.0%<\/p>\n<p>Ad 3: 0\/40 = 0.0%<\/p>\n<p>Ad 4: 1\/40 = 2.5%<\/p>\n<p>Ad 5: 0\/40 = 0.0%<\/p>\n<p>Now, the agent thinks that ad number 4 is the top-performing ad. For every new ad impression, it will pick a random number between 0 and 1. If the number is above 0.2 (the \u03b5 factor), it will choose ad number 4. If it\u2019s below 0.2, it will choose one of the other ads at random.<\/p>\n<p>Now, our agent runs 200 other ad impressions before another user clicks on an ad, this time on ad number 3. Note that of these 200 impressions, 160 belong to ad number 4, because it was the optimal ad. The rest are equally divided between the other ads. Our new CTR values are as follows:<\/p>\n<p>Ad 1: 0\/50 = 0.0%<\/p>\n<p>Ad 2: 0\/50 = 0.0%<\/p>\n<p>Ad 3: 1\/50 = 2.0%<\/p>\n<p>Ad 4: 1\/200 = 0.5%<\/p>\n<p>Ad 5: 0\/50 = 0.0%<\/p>\n<p>Now the optimal ad becomes ad number 3. It will get 80<span><span>%<\/span><\/span>&nbsp;of the ad impressions. Let\u2019s say after another 100 impressions (80 for ad number three, four for each of the other ads), someone clicks on ad number 2. Here\u2019s how what the new CTR distribution looks like:<\/p>\n<p>Ad 1: 0\/54 = 0.0%<\/p>\n<p>Ad 2: 1\/54 = 1.8%<\/p>\n<p>Ad 3: 1\/130 = 0.7%<\/p>\n<p>Ad 4: 1\/204 = 0.49%<\/p>\n<p>Ad 5: 0\/54 = 0.0%<\/p>\n<p>Now, ad number 2 is the optimal solution. As we serve more ads, the CTRs will reflect the real value of each ad. The best ad will get the lion\u2019s share of the impressions, but the agent will continue to explore other options. Therefore, if the environment changes and users start to show more positive reactions to a certain ad, the RL agent can discover it.<\/p>\n<p>After running 100,000 ads, our distribution can look something like the following:<\/p>\n<p>Ad 1: 123\/30,600 = 0.40% CTR<\/p>\n<p>Ad 2: 67\/18,900 = 0.35% CTR<\/p>\n<p>Ad 3: 187\/41,400 = 0.45% CTR<\/p>\n<p>Ad 4: 35\/11,300 = 0.31% CTR<\/p>\n<p>Ad 5: 15\/5,800 = 0.26% CTR<\/p>\n<p>With the \u03b5-greedy algorithm, we were able to increase our revenue from $352 to $426 on 100,000 ad impressions and an average CTR of 0.42%. This is a great improvement over the classic A\/B\/n testing model.<\/p>\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\">\n<p><iframe loading=\"lazy\" src=\"https:\/\/www.youtube.com\/embed\/bkw6hWvh_3k\" width=\"560\" height=\"315\" frameborder=\"0\" allowfullscreen=\"allowfullscreen\">[embedded content]<\/iframe><\/p>\n<\/figure>\n<h2>Improving the \u03b5-greedy algorithm<\/h2>\n<p>The key to the \u03b5-greedy reinforcement learning algorithm is adjusting the epsilon factor. If you set it too low, it will exploit the ad which it thinks is optimal at the expense of not finding a possibly better solution. For instance, in the example we explored above, ad number four happens to generate the first click, but in the long run, it doesn\u2019t have the highest CTR. Small sample sizes do not necessarily represent true distributions.<\/p>\n<p>On the other hand, if you set the epsilon factor too high, your RL agent will waste too many resources exploring non-optimal solutions.<\/p>\n<p>One way you can improve the epsilon-greedy algorithm is by defining a dynamic policy. When the MAB model is fresh, you can start with a high epsilon value to do more exploration and less exploitation. As your model serves more ads and gets a better estimate of the value of each solution, it can gradually reduce the epsilon value until it reaches a threshold value.<\/p>\n<p>In the context of our ad-optimization problem, we can start with an epsilon value of 0.5 and reduce it by 0.01 after every 1,000 ad impressions until it reaches 0.1.<\/p>\n<p>Another way to improve our multi-armed bandit is to put more weight on new observations and gradually reduces the value of older observations. This is especially useful in dynamic environments such as digital ads and product recommendations, where the value of solutions can change over time.<\/p>\n<p>Here\u2019s a very simple way you can do this. The classic way to update the CTR after serving an ad is as follows:<\/p>\n<p><em>(result + past_results) \/ impressions<\/em><\/p>\n<p>Here,<span>&nbsp;<\/span><em>result<\/em><span>&nbsp;<\/span>is the outcome of the ad displayed (1 if clicked, 0 if not clicked),<span>&nbsp;<\/span><em>past_results<\/em><span>&nbsp;<\/span>is the cumulative number of clicks the ad has garnered so far, and<span>&nbsp;<\/span><em>impressions<\/em><span>&nbsp;<\/span>is the total number of times the ad has been served.<\/p>\n<p>To gradually fade old results, we add a new&nbsp;<em>alpha<\/em><span>&nbsp;<\/span>factor (between 0 and 1), and make the following change:<\/p>\n<p><em>(result + past_results * alpha) \/ impressions<\/em><\/p>\n<p>This small change will give more weight to new observations. Therefore, if you have two competing ads that have an equal number of clicks and impressions, the ones whose clicks are more recent will be favored by your reinforcement learning model. Also, if an ad had a very high CTR rate in the past but has become unresponsive in recent times, its value will decline faster in this model, forcing the RL model to move to other alternatives earlier and waste fewer resources on the inefficient ad.<\/p>\n<h2>Adding context to the reinforcement learning model<\/h2>\n<div class=\"wp-block-image\" readability=\"6.5\">\n<figure class=\"aligncenter size-large\" readability=\"3\">\n<p><figure class=\"post-image post-mediaBleed aligncenter\"><a href=\"https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/contextual-bandit.jpg?ssl=1\" target=\"_blank\" rel=\"nofollow noopener noreferrer\"><img decoding=\"async\" loading=\"lazy\" class=\"jetpack-lazy-image jetpack-lazy-image--handled wp-image-9640 lazy\" src=\"https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/contextual-bandit.jpg?resize=696%2C387&amp;ssl=1\" sizes=\"(max-width: 696px) 100vw, 696px\" alt=\"contextual bandit\" width=\"696\" height=\"387\" data-attachment-id=\"9640\" data-permalink=\"https:\/\/bdtechtalks.com\/2021\/02\/22\/reinforcement-learning-ad-optimization\/contextual-bandit\/\" data-orig-file=\"https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/contextual-bandit.jpg?fit=1930%2C1074&amp;ssl=1\" data-orig-size=\"1930,1074\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;1&quot;}\" data-image-title=\"contextual bandit\" data-image-description data-medium-file=\"https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/contextual-bandit.jpg?fit=300%2C167&amp;ssl=1\" data-large-file=\"https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/contextual-bandit.jpg?fit=696%2C387&amp;ssl=1\" data-recalc-dims=\"1\" data-lazy-loaded=\"1\" data-lazy=\"true\" data-srcset=\"https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/contextual-bandit.jpg?resize=1024%2C570&amp;ssl=1 1024w, https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/contextual-bandit.jpg?resize=300%2C167&amp;ssl=1 300w, https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/contextual-bandit.jpg?resize=768%2C427&amp;ssl=1 768w, https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/contextual-bandit.jpg?resize=1536%2C855&amp;ssl=1 1536w, https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/contextual-bandit.jpg?resize=696%2C387&amp;ssl=1 696w, https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/contextual-bandit.jpg?resize=1068%2C594&amp;ssl=1 1068w, https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/contextual-bandit.jpg?resize=755%2C420&amp;ssl=1 755w, https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/contextual-bandit.jpg?resize=1920%2C1068&amp;ssl=1 1920w, https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/contextual-bandit.jpg?w=1930&amp;ssl=1 1930w, https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2021\/02\/contextual-bandit.jpg?w=1392&amp;ssl=1 1392w\"><\/a><figcaption><a href=\"https:\/\/thenextweb.com\/neural\/2021\/02\/28\/how-ads-are-chosen-by-reinforcement-learning-model-syndication\/#\" data-url=\"https:\/\/twitter.com\/intent\/tweet?url=https%3A%2F%2Fthenextweb.com%2Fneural%2F2021%2F02%2F28%2Fhow-ads-are-chosen-by-reinforcement-learning-model-syndication%2F&amp;via=thenextweb&amp;related=thenextweb&amp;text=Check out this picture on: Contextual bandits use function approximation to factor in the individual characteristics of ad viewers\" data-title=\"Share Contextual bandits use function approximation to factor in the individual characteristics of ad viewers on Twitter\" data-width=\"685\" data-height=\"500\" class=\"post-image-share popitup\" title=\"Share Contextual bandits use function approximation to factor in the individual characteristics of ad viewers on Twitter\"><i class=\"icon icon--inline icon--twitter--dark\"><\/i><\/a>Contextual bandits use function approximation to factor in the individual characteristics of ad viewers<\/figcaption><\/figure>\n<\/p>\n<\/figure>\n<\/div>\n<p>In the age of internet, websites, social media, and mobile apps have<span>&nbsp;<\/span><a href=\"https:\/\/bdtechtalks.com\/2020\/04\/17\/what-is-browser-fingerprinting\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">plenty of information on every single user<\/a>&nbsp;such as their geographic location, device type, and the exact time of day they\u2019re viewing the ad. Social media companies have even more information about their users, including age and gender, friends and family, the type of content they have shared in the past, the type of posts they liked or clicked on in the past, and more.<\/p>\n<p>This rich information gives these companies the opportunity to personalize ads for each viewer. But the multi-armed bandit model we created in the previous section shows the same ad to everyone and doesn\u2019t take the specific characteristic of each viewer into account. What if we wanted to add context to our multi-armed bandit?<\/p>\n<p>One solution is to create several multi-armed bandits, each for a specific sub-field of users. For instance, we can create separate RL models for users in North America, Europe, Middle East, Asia, Africa, and so on. What if we wanted to also factor in gender? Then we would have one reinforcement learning model for female users in North America, one for male users in North America, one for female users in Europe, male users in Europe, etc. Now, add age ranges and device types, and you can see that it will quickly develop into a big problem, creating an explosion of multi-armed bandits that become hard to train and maintain.<\/p>\n<p>An alternative solution is to use a \u201ccontextual bandit,\u201d an upgraded version of the multi-armed bandit that takes contextual information into account. Instead of creating a separate MAB for each combination of characteristics, the contextual bandit uses \u201c<a href=\"https:\/\/en.wikipedia.org\/wiki\/Function_approximation\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">function approximation<\/a>,\u201d which tries to model the performance of each solution based on a set of input factors.<\/p>\n<p>Without going too much into the details (that could be the subject of another post), our contextual bandit uses<span>&nbsp;<\/span><a href=\"https:\/\/bdtechtalks.com\/2020\/02\/10\/unsupervised-learning-vs-supervised-learning\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">supervised machine learning<\/a>&nbsp;to predict the performance of each ad based on location, device type, gender, age, etc. The benefit of the contextual bandit is that it uses one machine learning model per ad instead of creating a MAB per combination of characteristics.<\/p>\n<p>This wraps up our discussion of ad optimization with reinforcement learning. The same reinforcement learning techniques can be used to solve many other problems, such as content and product recommendation or dynamic pricing, and are used in other domains such as health care, investment, and network management.<\/p>\n<p><i><span>This article was originally published by&nbsp;<a class=\"author url fn\" title=\"Posts by Ben Dickson\" href=\"https:\/\/bdtechtalks.com\/author\/bendee983\/\" rel=\"nofollow noopener noreferrer\" target=\"_blank\">Ben Dickson<\/a> on <\/span><\/i><a href=\"https:\/\/bdtechtalks.com\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\"><i><span>TechTalks<\/span><\/i><\/a><i><span>, a publication that examines trends in technology, how they affect the way we live and do business, and the problems they solve. But we also discuss the evil side of technology, the darker implications of new tech and what we need to look out for. You can read the original article <a href=\"https:\/\/bdtechtalks.com\/2021\/02\/22\/reinforcement-learning-ad-optimization\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">here<\/a>.<\/span><\/i><\/p>\n<p class=\"c-post-pubDate\"> Published February 28, 2021 \u2014 16:00 UTC <\/p>\n<p> <a href=\"https:\/\/thenextweb.com\/neural\/2021\/02\/28\/how-ads-are-chosen-by-reinforcement-learning-model-syndication\/\">Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Every day, digital advertisement agencies serve billions of ads on news websites, search engines, social media networks, video streaming websites, and other platforms. And they all want to answer the same question:&#8230;<\/p>\n","protected":false},"author":1,"featured_media":3364,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[],"_links":{"self":[{"href":"https:\/\/www.londonchiropracter.com\/index.php?rest_route=\/wp\/v2\/posts\/3363"}],"collection":[{"href":"https:\/\/www.londonchiropracter.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.londonchiropracter.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.londonchiropracter.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.londonchiropracter.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=3363"}],"version-history":[{"count":0,"href":"https:\/\/www.londonchiropracter.com\/index.php?rest_route=\/wp\/v2\/posts\/3363\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.londonchiropracter.com\/index.php?rest_route=\/wp\/v2\/media\/3364"}],"wp:attachment":[{"href":"https:\/\/www.londonchiropracter.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=3363"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.londonchiropracter.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=3363"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.londonchiropracter.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=3363"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}