Machined Learnings: The Minimax Forecaster and Transductive Active Learning

I've been making my way down the NIPS 2011 paper list, and found this nice paper Efficient Online Learning via Randomized Rounding by Cesa-Bianchi and Shamir. This paper is about improving and extending the Minimax Forecaster which is described in Prediction, Learning, and Games. (I own a copy but I confess to not having made it very far into that book.) The Minimax Forecaster uses a different strategy for online learning than mirror descent which is essentially what I (and everybody else?) use everyday. This different setting provides an opportunity to think about adversarial active learning.

Here's the basic setup. There is a game with $T$ rounds in it. There is a set $\mathcal{F}$ of experts. On each round, each expert $f$ produces a prediction $f_t$, player produces a prediction $p_t$, adversary simultaneously produces an outcome $y_t$, and player suffers an instantaneous loss $l (p_t, y_t)$. The experts are static (their predictions to do not depend upon previously observed outcomes), so essentially each expert is an sequence $f_{1:T}$. The player wants to generate a sequence of predictions $p_{1:T}$ which minimizes worst-case regret \[
\sup_{y_{1:T} \in \mathcal{Y}^T} \biggl( L (p_{1:T}, y_{1:T}) - \inf_{f \in \mathcal{F}} L (f_{1:T}, y_{1:T}) \biggr),
\] where $L (p_{1:T}, y_{1:T}) = \sum_s l (p_s, y_s)$ is the total loss. When the observations are binary $\mathcal{Y} = \{ 0, 1 \}$, then an instantaneous loss of $|p_t - y_t|$ corresponds to expected 0-1 loss when the player is randomizing decisions. Amazingly this case yields a closed-form expression for the optimal prediction \[
\begin{aligned}
p^*_t &= \frac{1}{2} \biggl( 1 + R^* (\mathcal{F}, y_{1:t-1}1) - R^* (\mathcal{F}, y_{1:t-1}0) \biggr) \\
&= \frac{1}{2} \biggl( 1 + \mathbb{E}_{\sigma_{t+1:T}} \left[ \inf_{f \in \mathcal{F}} L (f_{1:T}, y_{1:t-1} 0 \sigma_{t+1:T}) - \inf_{f \in \mathcal{F}} L (f_{1:T}, y_{1:t-1} 1 \sigma_{t+1:T}) \right] \biggr),
\end{aligned}
\] where temporal sequence concatenation is denoted lexically, $\sigma_t$ is $\mathrm{Bournelli}(1/2)$ distributed a la Rademacher averages, and $R^* (\mathcal{F}, y_{1:t})$ is the residual game value after some rounds of play, \[
\begin{aligned}
R^* (\mathcal{F}, y_{1:t}) &= \frac{1}{2} \biggl( 1 + R^* (\mathcal{F}, y_{1:t-1}0) + R^* (\mathcal{F}, y_{1:t-1}1) \biggr) \\
&= \frac{T - t}{2} - \mathbb{E}_{\sigma_{t+1:T}}\left[ \inf_{f \in \mathcal{F}} L (f_{1:T}, y_{1:t} \sigma_{t+1:T}) \right].
\end{aligned}
\] Essentially what's happening here is that player is able to make adversary indifferent to playing either option on each round by playing a constant plus the difference between the residual game values associated with playing each option; this causes the residual game value to be a constant plus the average value of continuing after playing each option. Unwrapping the game value recursively leads to the Rademacher style averages. One observation of the paper is that such expectations can be approximated by sampling to achieve a high probability regret bound, aka random playout.

In practice even to do random playout you need to know $f_{1:T}$ for the various experts. When mapping this to a contextual prediction setting, this corresponds to knowing the sequence of features in advance (but not the labels). Thus this is essentially a transductive technique. Some recommendation problems are naturally transductive, and the paper discusses an application to collaborative filtering.

Active Learning?

In principle the setup can be modified to consider active learning. Each round, in addition to generating a prediction, player must make a decision $z_t \in \{ 0, 1 \}$ whether or not to observe $y_t$. If $z_t = 0$, player cannot use the value of $y_t$ in subsequent predictions. Since it is always better for the player to observe $y_t$, there has to be some penalty for doing so, thus consider a constant penalty $\alpha$ per observation. The player wants to generate a sequence of predictions $p_{1:T}$ and queries $z_{1:T}$ which minimizes worst-case regret \[\sup_{y_{1:T} \in \mathcal{Y}^T} \biggl( \sum_s \alpha z_s + L (p_{1:T}, y_{1:T}) - \inf_{f \in \mathcal{F}} L (f_{1:T}, y_{1:T}) \biggr).
\] Concise general closed-form expressions have eluded me thus far, but there is a non-trivial case which yields nice answers: the two-round game.

It never makes sense to observe the final outcome $y_T$, so $z_T = 0$. In the two-round game, then, the question is whether to observe $y_1$. If $y_1$ is not observed (i.e., $z_1 = 0$), player must ballistically plan both predictions without intermediate feedback, \[
\begin{aligned}
(p_1^*, p_2^*) &= \mathop{\operatorname{arg\,inf}}\limits_{p_{1:2} \in \mathcal{P}^2} \sup_{y_{1:2} \in \mathcal{Y}^2} \left( |p_1 - y_1| + |p_2 - y_2| - \inf_{f \in \mathcal{F}} L (f_{1:2}, y_{1:2}) \right).
\end{aligned}
\] This can be solved with Mathematica: here's the incantation.

Minimize[{ z, 
           p1 + p2 - inf00 <= z, 
           p1 + (1 - p2) - inf01 <= z, 
           (1 - p1) + p2 - inf10 <= z, 
           (1 - p1) + (1 - p2) - inf11 <= z }, 
           { p1, p2, z }] // Simplify

This has solution \[
\begin{aligned}
p_1^* &= \frac{1}{2} \left( 1 + \frac{1}{2} \sum_{y_2=0}^1 \left( \inf_{f \in \mathcal{F}} L (f_{1:2}, 0y_2) - \inf_{f \in \mathcal{F}} L (f_{1:2}, 1y_2) \right) \right) & (z_1 = 0), \\
p_2^* &= \frac{1}{2} \left(1 + \frac{1}{2} \sum_{y_1=0}^1 \left( \inf_{f \in \mathcal{F}} L (f_{1:2}, y_10) - \inf_{f \in \mathcal{F}} L (f_{1:2}, y_11) \right) \right) & (z_1 = 0),
\end{aligned}
\] with game value \[
\begin{aligned}
&R (\mathcal{F}, \emptyset | z_1 = 0) \\
&= 1 - \frac{1}{2} \min\left\{ \inf_{f \in \mathcal{F}} L (f_{1:2}, 00) + \inf_{f \in \mathcal{F}} L (f_{1:2}, 11), \inf_{f \in \mathcal{F}} L (f_{1:2}, 01) + \inf_{f \in \mathcal{F}} L (f_{1:2}, 10) \right\}. \\
\end{aligned}
\] Now compare this to the case of $z_1 = 1$, which is the same as the fully observed Minimax Forecaster. \[
\begin{aligned}
p_1^* &= \frac{1}{2} \left( 1 + \frac{1}{2} \sum_{y_2=0}^1 \left( \inf_{f \in \mathcal{F}} L (f_{1:2}, 0y_2) - \inf_{f \in \mathcal{F}} L (f_{1:2}, 1y_2) \right) \right) & (z_1 = 1), \\
p_2^* &= \frac{1}{2} \left(1 + \inf_{f \in \mathcal{F}} L (f_{1:2}, y_10) - \inf_{f \in \mathcal{F}} L (f_{1:2}, y_11) \right) & (z_1 = 1).
\end{aligned}
\] The first round prediction $p_1^*$ is the same whether or not $z_1 = 0$ or $z_1 = 1$, but the second round prediction $p_2^*$ is different. If $z_1 = 0$, then $p_2^*$ is computed by averaging over possible histories; whereas if $z_1 = 1$, then $p_2^*$ is computing using the actual observed history. (Aside: perhaps constant-time Radacher averages will be quantum computing's killer app.)

To decide whether to observe $y_1$ or not, we need to know how much better it is to do so, i.e., the difference in game values. When $z_1=1$ this is the same as the fully observed Minimax Forecaster, \[
\begin{aligned}
&R (\mathcal{F}, \emptyset | z_1 = 1) \\
&= 1 - \frac{1}{4} \left( \inf_{f \in \mathcal{F}} L (f_{1:T}, 00) + \inf_{f \in \mathcal{F}} L (f_{1:T}, 01) + \inf_{f \in \mathcal{F}} L (f_{1:T}, 10) + \inf_{f \in \mathcal{F}} L (f_{1:T}, 11) \right),
\end{aligned}
\] therefore the difference in game value is \[
\begin{aligned}
&R^* (\mathcal{F}, \emptyset | z_t = 0) - R^* (\mathcal{F}, \emptyset | z_t = 1) \\
&= \frac{1}{4} \left| \inf_{f \in \mathcal{F}} L (f_{1:2}, 00) - \inf_{f \in \mathcal{F}} L (f_{1:2}, 01) - \left( \inf_{f \in \mathcal{F}} L (f_{1:2}, 10) - \inf_{f \in \mathcal{F}} L (f_{1:2}, 11) \right) \right|. \\
\end{aligned}
\] This looks like a difference of future differences. If the game value difference exceeds $\alpha$, then we should decide $z_1 = 1$, otherwise not. So, for instance, if every expert predicts the same value on the first round, then the difference of future differences will be zero and we should not observe $y_1$. That certainly sounds like active learning.

So what should a general case $T$ round solution look like? Intuitively, one would hope that if all the experts that have done well in the past predict the same thing on the current instance, that the value of observing $y_t$ for that instance would go down. That is roughly what agnostic active learning does in the IID setting. Here the future is also important, but analogously if all the experts that are in the running for the infimum at the end of the horizon agree on a value, it should be that observing $y_t$ has less value. As we near to the end of the planning horizon, that will be driven mostly by having done well in the past.

Machined Learnings

Sunday, May 6, 2012

The Minimax Forecaster and Transductive Active Learning

Active Learning?

No comments:

Post a Comment