The offset tree is a reduction from cost-sensitive multiclass classification (CSMC) with partial feedback to binary classification. It is designed for when there is a single decision amongst a fixed set of classes (like vanilla CSMC) and historical data where only the value of the decision chosen has been revealed (unlike vanilla CSMC). As a side note, this difference is fundamental, since there are reductions from CSMC to binary classification which have regret independent of the number of classes; whereas the offset tree regret bound (proportional to $|A| - 1$) is a provable lower bound.
So a natural question to ask is: if I can reduce a full information version of a problem (e.g., SSP without recourse) to set of CSMC subproblems, can I reduce a partial information version of the problem to a set of CSMC with partial feedback subproblems (and leverage the offset tree)? I don't know for sure, but I suspect so.
In order to build intuition, however, I'm going to start with something easier. Since I actually want to reduce problems to constrained CSMC, I need something that can handle constrained CSMC with partial feedback. That suggests defining the forfeit offset tree analogous to the forfeit filter tree.
The setup is as follows. There is a distribution $D = D_x \times D_{\omega|x} \times D_{r|\omega,x}$ where $r: A \to [0, 1] \cup \{ -\infty \}$ takes values on the unit interval augmented with $-\infty$, and the components of $r$ that are $-\infty$ valued for a particular instance are revealed as part of the problem instance via $\omega \in \mathcal{P} (A)$ (i.e., $\omega$ is a subset of $A$). The regret of a particular deterministic policy $h: X \times \mathcal{P} (A) \to A$ is \[ v (h) = E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ \max_{k \in A}\; E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (h (x, \omega)) \right] \right]. \] So far this is identical to CSMC with cosmetic changes to align with the reinforcement learning literature: costs (now called ``rewards'') have been negated and compressed into the unit interval; and classes are now called ``actions.'' This is because the only real difference is in the partial nature of the training data, which manifests itself only via the induced distribution for subproblems. To define the induced distribution I'll assume that the historical policy is using a known conditional distribution over actions given an instance $p (a | x, \omega)$; in practice there are ways to relax this assumption.
Algorithm:Forfeit Offset Tree Train
Data: Partially labelled constrained CSMC training data set $S$.
Input: Importance-weighted binary classification routine $\mbox{Learn}$.
Input: A binary tree $T$ over the labels with internal nodes $\Lambda (T)$.
Result: Trained classifiers $\{\Psi_n | n \in \Lambda (T) \}$.
Input: Importance-weighted binary classification routine $\mbox{Learn}$.
Input: A binary tree $T$ over the labels with internal nodes $\Lambda (T)$.
Result: Trained classifiers $\{\Psi_n | n \in \Lambda (T) \}$.
- For each $n \in \Lambda (T)$ from leaves to roots:
- $S_n = \emptyset$.
- For each example $(x, \omega, a, r (a), p (\cdot | x, \omega)) \in S$:
- Let $\lambda$ and $\phi$ be the two classes input to $n$ (the predictions of the left and right subtrees on input $(x, \omega)$ respectively).
- If $\lambda \in \omega$, predict $\phi$ for the purposes of constructing training input for parent node (``$\lambda$ forfeits'');
- else if $\phi \in \omega$, predict $\lambda$ for the purposes of constructing training input for parent node (``$\phi$ forfeits'');
- else (when $\lambda \not \in \omega$ and $\phi \not \in \omega$) if $a = \lambda$ or $a = \phi$:
- If $a = \lambda$ then $a^\prime = \phi$; else $a^\prime = \lambda$.
- Let $y = 1_{a^\prime = \lambda}$, i.e., $a^\prime$ is from the left subtree of $n$.
- If $r (a) < \frac{1}{2}$, $S_n \leftarrow S_n \cup \left\{ \left( x, y, \frac{p (a | x, \omega) + p (a^\prime | x, \omega)}{p (a | x, \omega)} \left(\frac{1}{2} - r (a)\right) \right) \right\}$;
- else $S_n \leftarrow S_n \cup \left\{ \left( x, 1 - y, \frac{p (a | x, \omega) + p (a^\prime | x, \omega)}{p (a | x, \omega)} \left(r (a) - \frac{1}{2}\right) \right) \right\}$.
- Let $\Psi_n = \mbox{Learn} (S_n)$.
- Return $\{\Psi_n | n \in \Lambda (T) \}$.
Algorithm:Forfeit Offset Tree Test
Input: A binary tree $T$ over the labels with internal nodes $\Lambda (T)$.
Input: Trained classifiers $\{\Psi_n | n \in \Lambda (T) \}$.
Input: Instance realization $(x, \omega)$.
Result: Predicted label $k$.
Input: Trained classifiers $\{\Psi_n | n \in \Lambda (T) \}$.
Input: Instance realization $(x, \omega)$.
Result: Predicted label $k$.
- Let $n$ be the root node.
- Repeat until $n$ is a leaf node:
- If all the labels of the leaves in the left-subtree of $n$ are in $\omega$, traverse to the right child;
- else if all the labels of the leaves in the right-subtree of $n$ are in $\omega$, traverse to the left child;
- else if $\Psi_n (x) = 1$, traverse to the left child;
- else (when $\Psi_n (x) = 0$ and at least one label in each subtree is not in $\omega$), traverse to the right child.
- Return leaf label $k$.
Regret Analysis
The regret analysis for the forfeit offset tree is very similar to the regret analysis for the offset tree, with additional arguments for forfeiture cases.Let $\Psi = (T, \{\Psi_n | n \in \Lambda (T) \})$ denote a particular forfeit offset tree (i.e., a choice of a binary tree and a particular set of node classifiers), and let $h^\Psi$ denote the policy that results from the forfeit offset tree. The regret analysis leverages an induced importance-weighted binary distribution $D^\prime (\Psi)$ over triples $(x^\prime, y, w)$ defined as follows:
- Draw $(x, \omega, r)$ from $D$.
- Draw $n$ uniform over the internal nodes $\Lambda (T)$ of the binary tree.
- Let $x^\prime = (x, n)$.
- Let $\lambda$ and $\phi$ be the two classes input to $n$ (the predictions of the left and right subtrees on input $x$ respectively).
- If $\lambda \in \omega$, create importance-weighted binary example $(x^\prime, 0, 0)$;
- else if $\phi \in \omega$, create importance-weighted binary example $(x^\prime, 1, 0)$;
- else (when $\lambda \not \in \omega$ and $\phi \not \in \omega$):
- Draw $a$ from $p (a | x, \omega)$.
- If $a \neq \lambda$ and $a \neq \phi$, reject sample;
- else (when $a = \lambda$ or $a = \phi$):
- If $a = \lambda$, $a^\prime = \phi$; else $a = \phi$, $a^\prime = \lambda$.
- Let $y = 1_{a^\prime = \lambda}$
- If $r (a) < \frac{1}{2}$, create importance-weighted binary example \[\left( x^\prime, y, \frac{p (a | x, \omega) + p (a^\prime | x, \omega)}{p (a | x, \omega)} \left(\frac{1}{2} - r (a)\right) \right);\]
- else, create importance-weighted binary example \[ \left( x^\prime, 1 - y, \frac{p (a | x, \omega) + p (a^\prime | x, \omega)}{p (a | x, \omega)} \left(r (a) - \frac{1}{2}\right) \right). \]
Theorem:Forfeit Offset Tree Regret Bound
For all partially labelled CSMC distributions $D$; all historical policies $p$ such that $p (a | x, \omega) > 0$ whenever $a \not \in \omega$; and all forfeit offset trees $\Psi$, \[ v (h^\Psi) \leq (|A| - 1) q (\Psi) \] where $q (\Psi)$ is the importance-weighted binary regret on the induced subproblem.
Proof: See Appendix.
Proof: See Appendix.
A natural next step is the look at cost-sensitive best m with partial feedback; perhaps the forfeit offset tree will be useful in that context.
Appendix
This is the proof of the regret bound theorem.Consider a fixed $(x, \omega)$. It is useful to talk about the conditional policy regret experienced at an internal node $n$, \[ v (h^\Psi | x, \omega, n) = \max_{k \in \Gamma (n)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (h^\Psi_n (x, \omega)) \right]. \] where $h_n^\Psi$ is the prediction at internal node $n$. When $n$ is the root of the tree, $v (h^\Psi | x, \omega, n)$ is the forfeit offset tree policy regret conditional on $(x, \omega)$.
The proof strategy is to bound $v (h^\Psi | x, \omega, n) \leq \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega)$ via induction. The base case is trivially satisfied for trees with only one leaf (no internal nodes) since it evaluates to $0 \leq 0$. To show the recursion at a particular internal node $n$, let $\lambda$ and $\phi$ be the predictions of the left subtree ($n_\lambda$) and right subtree ($n_\phi$).
Case 1: $\Gamma (n_\lambda) \setminus \omega = \emptyset$. In this case $\lambda \in \omega$ and forfeits, so $\phi$ is chosen. There must be a maximizer in the right subtree, since all values in the left subtree are $-\infty$. Furthermore $q_m (\Psi | x, \omega) = 0$ for both $m = n$ and for $m \in \Lambda (n_\lambda)$ by definition. Therefore \[ \begin{aligned} v (h^\Psi | x, \omega, n) &=
\max_{k \in \Gamma (n)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi) \right] \\ &= \max_{k \in \Gamma (n_\phi)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi) \right] \\ &= v (h^\Psi | x, \omega, n_\phi) \\ &\leq \sum_{m \in \Lambda (n_\phi)} q_m (\Psi | x, \omega) \\ &= \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned} \]
Case 2: $\Gamma (n_\lambda) \setminus \omega \neq \emptyset$ and $\Gamma (n_\phi) \setminus \omega = \emptyset$. In this case $\phi \in \omega$ and $\lambda \not \in \omega$, so $\phi$ forfeits and $\lambda$ is chosen. There must be a maximizer in the left subtree, since all values in the right subtree are $-\infty$. Furthermore $q_m (\Psi | x, \omega) = 0$ for both $m = n$ and for $m \in \Lambda (n_\phi)$ by definition. Therefore \[ \begin{aligned} v (h^\Psi | x, \omega, n) &=
\max_{k \in \Gamma (n)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\lambda) \right] \\ &= \max_{k \in \Gamma (n_\lambda)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\lambda) \right] \\ &= v (h^\Psi | x, \omega, n_\lambda) \\ &\leq \sum_{m \in \Lambda (n_\lambda)} q_m (\Psi | x, \omega) \\ &= \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned} \]
Case 3: $\Gamma (n_\lambda) \setminus \omega \neq \emptyset$ and $\Gamma (n_\phi) \setminus \omega \neq \emptyset$. This is the ``normal'' offset tree case, where both $\lambda \not \in \omega$ and $\phi \not \in \omega$ so no forfeiture happens. The expected importance weights conditioned on $(x, \omega, r)$ and $\lambda \not \in \omega$ and $\phi \not \in \omega$ are \[ \begin{aligned} w_{\lambda|r} &= E_{a \sim p} \biggl[ 1_{a = \lambda} 1_{r (a) \geq \frac{1}{2}}\frac{p (\lambda | x, \omega) + p (\phi | x, \omega)}{p (\lambda | x, \omega)} \left(r (\lambda) - \frac{1}{2}\right) \\ &\quad \quad \quad \quad + 1_{a = \phi} 1_{r (a) < \frac{1}{2}} \frac{p (\lambda | x, \omega) + p (\phi | x, \omega)}{p (\phi | x, \omega)} \left(\frac{1}{2} - r (\phi)\right) \biggr] \biggl/ \\ &\quad \quad E_{a \sim p} \left[ 1_{a = \lambda} + 1_{a = \phi} \right] \\ &= \left( r (\lambda) - \frac{1}{2} \right)_+ + \left( \frac{1}{2} - r (\phi) \right)_+, \\ w_{\phi|r} &= \left( r (\phi) - \frac{1}{2} \right)_+ + \left( \frac{1}{2} - r (\lambda) \right)_+, \\ \end{aligned} \] where $\left( x \right)_+ = \max (x, 0)$. Thus \[ | w_\lambda - w_\phi | = \left| E_{r \sim D_{r|\omega,x}} \left[ w_{\lambda|r} - w_{\phi|r} \right] \right| = \left| E_{r \sim D_{r|\omega,x}} [r (\lambda) - r (\phi)] \right|, \] i.e., the importance-weighted regret at an internal node is equal to the policy regret with respect to the two actions input to that node.
Assume without loss of generality that the classifier chooses $\phi$. If the maximizer comes from the right subtree, then \[ \begin{aligned} v (h^\Psi | x, \omega, n) &= \max_{k \in \Gamma (n_\phi)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi) \right] \\ &= v (h^\Psi | x, \omega, n_\phi) \\ &\leq \sum_{m \in \Lambda (n_\phi)} q_m (\Psi | x, \omega) \\ &\leq \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned} \] If the maximizer comes from the left subtree, then \[ \begin{aligned} v (h^\Psi | x, \omega, n) &= \max_{k \in \Gamma (n_\lambda)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi) \right] \\ &= E_{r \sim D_{r|\omega,x}} \left[ r (\lambda) - r (\phi) \right] + v (h^\Psi | x, \omega, n_\lambda) \\ &= q_n (\Psi | x, \omega) + v (h^\Psi | x, \omega, n_\lambda) \\ &\leq q_n (\Psi | x, \omega) + \sum_{m \in \Lambda (n_\lambda)} q_m (\Psi | x, \omega) \\ &\leq \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned} \] Terminating the induction at the root yields \[ v (h^\Psi | x, \omega) \leq \sum_{n \in \Lambda (T)} q_n (\Psi | x, \omega) = |\Lambda (T)| q (\Psi | x, \omega). \] Taking the expectation of both sides with respect to $D_x \times D_{\omega|x}$ and noting $|\Lambda (T)| = (|A| - 1)$ completes the proof.
No comments:
Post a Comment