Why even ask the question? Because ad serving looks like cost-sensitive multiclass classification, and reducing cost-sensitive multiclass classification to regression leads to regret bounds that are worse than reduction to binary classification.
So here's a laundry list of issues I've encountered in the past, how a regression reduction deals with them, and how reduction to binary classification might deal with them.
The Set of Actions is Changing
First, let me say that I've used regression even in cases where the set of actions wasn't really changing that quickly. For instance, I was involved with a domain monetization product where the actions were a list of bidded keywords phrases (monetization was via driving to a search engine results page). Such a list changes infrequently (e.g., monthly) and modestly (not too many ``Lady Gaga''s are made per unit time). So really, I had no excuse there.In the case where the set of actions really does change significantly over time (e.g., contextual targeting of sponsored search advertisements, where new ads appear frequently), it is tempting to think that a regressor trained on previous data would generalize reasonably to a novel instance, after all the new instance will share lots of features with existing instances (e.g., words and word phrases) and will be shown in similar contexts (e.g., web pages). This is tempting, but dangerous. I learned the hard way that one has to be very careful about graduating an action from exploration to exploitation traffic. (``Learning the hard way'' is a nice euphemism for ``I built something that didn't work''). Nonetheless, even acknowledging the care required to move from exploration policy to exploitation policy, it is fair to say that regression makes it easy to ``mix a new action in''.
Given that transition from exploration to exploitation is a controlled process, how might it work in a reduction to binary classification? Some of these reductions are structured as tournaments organized as a binary tree. Consider adding a single action. In that case, one can create a new root node whose children are the old root node and the new action. This new root node essentially has to learn, ``Under what situations should I take the new action, versus doing whatever I would have done before when the new action was not available?'' Building out the tree in this fashion would result in a very unbalanced tree. Adding many actions in one sweep would mitigate the problem a bit, since an entire tree can be stitched under the new root node, but unless the number of actions is doubling this will still lead to lack of balance. However, it could be a good choice as an incremental operation, with |A_{new}| + 1 novel binary classification subproblems to train comprising \log_2 (|A_{new}|) sequential steps.
Another strategy is to add new actions (or delete old actions) at the leaf level. Converting an existing leaf to an internal node with children being the new action and the action at the former leaf would require 1 + \log_2 (|A_{old}|) novel binary classification subproblems to train, since the entire path to the root must be relearned. Conservatively if this done for a set of new actions the total number of retrains is scaled by |A_{new}|, but in fact many paths to the root will be shared if the replacements are located near each other in the tree. I suspect the actual cost is something like |A_{new}| + \log_2 (|A_{old}|/|A_{new}|), i.e., a complete tree of |A_{new}| classifiers plus one shared path of length \log_2 (|A_{old}|/|A_{new}|) to the root. I also suspect these retrains can be done in \log_2 (|A_{old}|) sequential steps.
In some cases it is not unreasonable to simply consider retraining the entire tree; each level can be trained in parallel so the number of sequential steps is \log_2 (|A|), with a total number of retrains |A|. Given nonstationarity, feature innovation, etc. a complete retrain has to occur periodically anyway.
Intra-Query Constraints
This is similar to the set of actions changing, but while the above section was about how the universe of possible actions can change, this section is about how on an individual instance certain actions might not be allowed.There are two different situations that I've identified. The first, which I call ``average constrained CSMC'', involves constraints that change very slowly if at all, such that they can be modeled as part of the problem instance with training and testing data drawn IID. These are things like ``this advertisement is not allowed on web pages with pornographic content,'' which almost never changes over the lifetime of an advertisement (perhaps at the very beginning due to a error in specification of a campaign).
The second, which I call ``minimax constrained CSMC'', involves constraints that change rapidly, such that the distribution of the constraints on the training set bears no relationship to the distribution of constraints on the test set. These are things like ``this advertiser has exhausted their budget,'' which given how advertisers experiment with budgets can be quite volatile. Constraints here are modeled as imposed adversarially, and a solution is required to get good regret over all possible settings of constraints.
An interesting result is that argmax regression reduction has the same regret bound for unconstrained, average constrained, and minimax constrained CSMC. This is achieved by simply argmax on the regression score over the set of actions that are allowed on this instance.
In the average constrained case, tree based reductions can be modified such that disallowed actions forfeit their tournaments, and an analogous regret bound to the unconstrained case can be derived. I don't have any results for the minimax constrained case for tree based reductions yet, although I have a small problem example which indicates that forfeiting alone does not achieve good results.
I strongly suspect that minimax constrained CSMC has to be well understood for regression to be dethroned from advertising.
Inter-Query Constraints
This refers to properties that need to be enforced across a set of queries. Budget constraints are the canonical example, where greedy delivery is known to have a worst-case competitive ratio of \frac{1}{2}. Again with no excuse (other than lack of knowledge), I've used regression even in the case where there were no inter-query constraints: a system for contextually targeting eBay affiliate ads. Affiliate programs only pay you when they get paid so essentially they have infinite budget.However often such constraints must be addressed. OR has been dealing with such constraints for decades, and OR pervasively reduces to regression. If budgets are specified in dollars, and regression estimates purport to be of expected revenue, then some ad serving problems with budget constraints can be attacked using network flow algorithms. Such algorithms are fast enough to re-run periodically as actuals flow in to overcome the inevitably large errors in traffic volume estimates. (The size of an ad network that can leverage this approach goes up as CPU and memory get cheaper).
It seems plausible to dethrone regression here, by reducing ad serving to cost-sensitive multiclass classification leveraging approaches like Policy Learning by Dynamic Programming. It might make a nice PhD thesis for somebody (it is somewhat practical, so perhaps lacks panache). In the meantime I will plod along: I've improved my intuition around stochastic shortest path and eventually hope to play around with reducing flow to CSMC.
I also wonder if approximate online methods for optimizing with budget constraints, which involve argmax on adjusted regression estimates, might also be applicable to other CSMC reductions. For example with Mehta et. al.'s \psi (x) = 1 - e^{x-1} remaining budget discounting function, a tree based reduction could be trained using the remaining budget discounted observed reward rather than the actual observed reward. Whether this makes sense requires further thought: my understanding of the analysis of such algorithms is they assume the regression is perfect, and the performance bound is due to the online nature of the query sequence. It would be interesting to augment the analysis with additional terms for regret in the regression, such that a tree based approach could be said to do better.
Selecting a Set
CSMC reductions choose a single action from a set of actions, but often in ad serving multiple ads are selected at once. Not always, however: display advertising is often a single ad display, and mobile screen real estate can be scarce. For sponsored search (or contextual ad serving of sponsored search advertisements) populating multiple positions is the norm.If the reward associated with a set is the sum of the individual action rewards, then regression handles set selection quite naturally: merely select the top m actions by estimated value, rather than only the first. The regret bound is almost identical to the single action case, with an extra factor of \sqrt{\min \{m,|A|-m\}}. The (undesirable) square root dependence on the regressor regret is preserved. Fortunately, this problem can also be reduced to average constrained CSMC. The basic strategy is ``choose the best action, then the next best action, etc.'' The regret has an extra factor of m (worse) but preserves the linear dependence on CSMC regret (better).
For ad serving, however, the assumption of linear rewards is too strong is practice, as there are usually significant positional effects. Fortunately, if the reward dependence upon position obeys swap supremacy and preservation of relative order (as is implied by a monotonic action-independent multiplicative positional modulation), then a similar technique can be used to solve the problem of selecting the best of actions when the reward associated with a set is the sum of individual action-position rewards via reduction to average constrained CSMC.
If the reward of a set of actions is not the sum of individual action rewards, one option is to treat entire sets as actions. In ad serving this is generally infeasible but in content optimization (e.g., adaptive UI) this can be viable. If externalities between actions only flow forward by position (e.g., a serial scan model in a vertical presentation), it feels intuitively like a stochastic shortest path problem but I haven't verified this.
In every ad server I've ever worked on, the reward of a set of actions was assumed linear in the individual action rewards, possibly with a positional correction. Thus, there really is no excuse for using regression merely because the problem involves selecting sets.
Summary
Overall, the two big issues that I feel are preventing the dethroning of regression from ad serving are 1) adversarially imposed intra-query constraints and 2) inter-query constraints. Any ad serving problem that does not exhibit these properties should be a slam dunk for more advanced CSMC reductions. For instance, any ad serving problem which monetizes via search engine landing pages (e.g., actions are bidded phrases) does not exhibit these properties; neither do meta-monetization problems (e.g., dynamically selecting between several ad networks).I'll be noodling on intra-query and inter-query constraints for CSMC in my spare time.