So the empirical policy estimators I discussed in my previous post provide a way of addressing some of the quandaries that arise when I get asked ``what is the business impact of a proposed change to a decision system?'' There are a few issues and provisos but no show stoppers.
The first issue is that the main system I'm asked to prognosticate about does not operate by frequently making a decision involving a few actions (e.g., like an ad server does); instead it infrequently makes a decision involving a large number of actions (e.g., like an airline might do when planning its flight schedule). Fortunately it has done this enough times that I can consider myself possessing a sample of data $H$ of the form $\{ (x, \mathcal{A}, \{ r (a) | a \in \mathcal{A} \}) \}$. That suggests something like the empirical value estimator for set revelation, \[ \sum_{(x, \mathcal{A}, \{ r (a) | a \in \mathcal{A} \}, p) \in H} r (\pi (x)) \frac{1_{\pi (x) \in \mathcal{A}}}{p (\pi (x) \in \mathcal{A} | x)}. \] Going forward, I'll still be making decisions in large sets rather than individually, but I'm assuming that the reward for a set is the sum of the rewards for the actions, so this should work $\ldots$
Except for the second issue, which is that the historical policy $p (a \in \mathcal{A} | x)$ is unknown. This is because the historical policy is actually a deterministic global optimization routine. Here I can hope to use ideas from Strehl et. al. to consider the historical data as implicitly exploratory, estimate $\hat p (a \in \mathcal{A} | x)$, and use \[ \sum_{(x, \mathcal{A}, \{ r (a) | a \in \mathcal{A} \}) \in H} r (\pi (x)) \frac{1_{\pi (x) \in \mathcal{A}}}{\max \{ \tau, \hat p (\pi (x) \in \mathcal{A} | x)\} }. \] I'll need to verify that the analysis in Strehl et. al., which was for single actions, holds when sets are chosen and revealed (presumably yes). I'll also need to arrange for new policy to have sufficient historical support, i.e., I can't allow actions to be chosen which have a $\hat p$ which is too small (actual code must be written to enforce this). Therefore, since I want to have the possibility of breaking out of the chains of history, I'll have to include some exploration decisions into the decision making process (currently, there are none).
Finally, the proviso: this technique will only work for predicting rewards that are unambiguously associated with a single action. So I'll need to set expectations here. For instance, ``how will users spend over the next year change as a result of this decision system tweak?'' is not a fair question (for the above technique). However a related and fair question is ``how will users immediate spend in response to actions change as a result of this decision system tweak?'', and hopefully some other hand-waving can be employed to project longitudinal spend based upon short-term spend.
No comments:
Post a Comment