This slide is from Leon Bottou's amazing ICML 2015 keynote. You should contemplate this image on a regular basis. |
Years later I now realize, the best theory doesn't predict experiments, it suggests them. In other words, the best theory prompts ideas that otherwise would not be contemplated. (Of course, there's still plenty of theory that attempts to explain experiments that are already run, and that's fine: the left column (“Exact Sciences”) of above image needs to digest the output of the other two columns. This is necessary activity, but the real payoff comes when the left hand side suggests something that would have taken a very long time for the other two columns to consider.)
The Decision-Estimation coefficient (DEC) is amazingly good theory, because it suggests algorithm designs that otherwise would not be apparent: already I've published results on infinite action contextual bandits with arbitrary function classes, infinite action neural-bilinear contextual bandits, and risk-averse contextual bandits. The last two are already implemented in Vowpal Wabbit and commercially available in Azure Personalizer Service. In other words, these are not “COLT algorithms”, these are practical algorithms you can run on real problems. (The first one, it works, but as stated in the original paper it doesn't generate causal data exhaust suitable for subsequent offline analysis. We have a fix for this we will publish and then that version will go into the product.) For risk-averse contextual bandits, I was confused for 18 months on how to achieve an online result (the exploration distribution affects the risk measure in a subtle way), and then Dylan Foster explained the DEC to me, and clarity soon followed.
Now I will pay it forward by explaining the DEC to you.
A Tale of Minnie and Max
The DEC is part of the rich tradition of minimax approaches to learning theory: excellent introductions are available from Lafferty et. al., Rakhlin and Sridharan, and Krishnamurthy. For a more modern example, Duchi et. al. provide a minimax characterization of local differential privacy. In classic minimax theory, one bounds the worst case value of a risk functional (i.e., something that says how good your decision rule is) subject to some constraints (e.g., local differential privacy). The DEC is a bit different and fun because it operates as a machine learning reduction to regression, and therefore it bounds the worst case difference between performance on the original problem (“decision”) and the performance on the reduced problem (“estimation”).
Basic Formulation
Here's one version suitable for contextual bandit algorithm design: $$
\text{DEC} = \min_P \max_{Q,f}\ \underbrace{\mathbb{E}_{a \sim P}\left[f(a)\right] - \mathbb{E}_{a \sim Q}\left[f(a)\right]}_{\text{CB regret}} - \gamma\ \underbrace{\mathbb{E}_{a \sim P}\left[ \left(f(a) - \hat{f}(a)\right)^2 \right]}_{\text{Estimation error}} $$ where
- $P$ is Player's distribution over actions. In the basic version, Player can pick any distribution over actions, but lately I've been having fun restricting the Player in various ways.
- $Q$ is Nature's distribution over actions. In the basic version, Nature can pick any distribution over actions, but constraining Nature is the basis for the smoothed infinite action result.
- $f$ is Nature's choice of (expected) reward function. By constraining Nature to pick from a limited complexity function class (e.g., neural-bilinear) you can get nontrivial bounds in infinite action spaces.
- $\hat{f}$ is the output of the regression oracle, i.e., our guess about what the true reward function is.
- Estimation error here is measured in squared loss, but using different divergences gives different results, e.g., triangular discrimination yields an $L^*$ result for contextual bandits, and expectile loss yields a online risk averse contextual bandit.
- $\gamma$ can be interpreted as a learning rate which grows as the decision horizon (total number of rounds of decision making) grows. It can also be interpreted as a Lagrange multiplier: in more recent (currently unpublished) formulations of the DEC the estimation error is constrained rather than subtracted.
- You can add a context $x$ to all the components of the DEC (e.g., $\hat{f}(x, a)$), but it doesn't change anything since we evaluate the DEC example-by-example, so we elide it.
Closing the Loop
Let's understand why a DEC bound implies an effective reduction. Informally, a bound on the DEC ensures that one of two things happens: 1) you do well on the original problem you care about or 2) you discover a mistake in your estimator. Because we reduce to online no-regret learning and assume realizability, #2 eventually runs out of steam.
More precisely, suppose we have a bound on the $\text{DEC} \leq C \gamma^{-\alpha}$. After $T$ rounds we have $$\begin{aligned}\text{CB regret} &\leq T C \gamma^{-\alpha} + \gamma \sum_{t=1}^T \left(f_t(a) - \hat{f}_t(a)\right)^2 \\
&= T C \gamma^{-\alpha} + \gamma \text{Reg}(T) \\ &= C^{\frac{1}{1 + \alpha}} \alpha^{-\frac{\alpha}{1 + \alpha}} (1 + \alpha) T^{\frac{1}{1 + \alpha}} \text{Reg}(T)^{\frac{\alpha}{1 + \alpha}} \biggr|_{\gamma =\left(\frac{C T \alpha}{\text{Reg}(T)}\right)^{\frac{1}{1 + \alpha}}} \end{aligned}$$ and in particular
- (strongly observable) $\alpha = 1$ implies $\text{CB Regret} = O\left(\sqrt{T \text{Reg}(T)}\right)$ and $\gamma = O\left(\sqrt{\frac{T}{\text{Reg}(T)}}\right)$,
- (weakly observable) $\alpha = \frac{1}{2}$ implies $\text{CB Regret} = O(\text{Reg}(T)^{1/3} T^{2/3})$ with $\gamma = O\left(\left(\frac{T}{\text{Reg}(T)}\right)^{2/3}\right)$,
- (unobservable) $\alpha \to 0$ implies unbounded DEC and $\text{CB Regret} = O(T)$.
Of course, if $\text{Reg}(T) = T$ then you are hosed. Fortunately for the version of the DEC presented here, with an arbitrary finite function class $\mathcal{F}$, we can use Vovk's aggregation algorithm to achieve $\text{Reg}(T) = O\left(\log \left| \mathcal{F} \right|\right)$, which yields the SquareCB result. In general $\text{Reg}(T)$ will have $T$ dependence which degrades the overall CB regret bound.
Calling All Constructivists
Here's the important part: Foster et. al. show the DEC characterizes the statistical complexity of interactive decision making, but except for a few special cases, the reasoning is nonconstructive. More precisely, they swap the min and max to get an upper bound on the DEC objective value (aka Bayesian reasoning) which does not yield a constructive algorithm. That means if you produce a constructive solution to the DEC for a particular scenario, you will have a novel algorithm which exhibits (near-)optimal minimax performance. (Instance-dependent results are trickier, but possible.).
Consequently, I've been spending lots of time trying to find as many constructive solutions to the DEC as I can. Now you know the secret so you can do the same.
No comments:
Post a Comment