So the first thing I ran into when trying to reduce average constrained cost-sensitive best m with partial feedback to average constrained cost-sensitive multiclass classification with partial feedback (CSMC-PF) is that given the way I'm setting up the subproblems there is generally more than one element of the reward vector revealed per historical instance. What's needed is a mash-up of the forfeit filter tree and the forfeit offset tree, which would use the forfeit filter tree update when both inputs to an internal node have revealed values, and otherwise would fall back to the forfeit offset tree update when only one input to an internal node has been revealed.
The average constrained CSMC-PF setup is as follows. There is a distribution $D = D_x \times D_{\omega|x} \times D_{r|\omega,x}$ where $r: A \to [0, 1] \cup \{ -\infty \}$ takes values on the unit interval augmented with $-\infty$, and the components of $r$ that are $-\infty$ valued for a particular instance are revealed as part of the problem instance via $\omega \in \mathcal{P} (A)$ (i.e., $\omega$ is a subset of $A$). The regret of a particular deterministic policy $h: X \times \mathcal{P} (A) \to A$ is \[ v (h) = E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ \max_{k \in A}\; E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (h (x, \omega)) \right] \right]. \] In the training data only partial feedback is available, but unlike the previous post I'll assume that potentially multiple elements of the reward vector are revealed. I'll assume that the historical policy is using a known conditional distribution over the power set of actions given an instance $p (\mathcal{A} | x, \omega)$. I'll use the shorthand $\mathcal{A}$ to refer to realizations from $\mathcal{P} (A)$.
Algorithm:Forfeit Filter-Offset Tree Train
Data: Constrained CSMC-PF training data set $S$.
Input: Importance-weighted binary classification routine $\mbox{Learn}$.
Input: A binary tree $T$ over the labels with internal nodes $\Lambda (T)$.
Result: Trained classifiers $\{\Psi_n | n \in \Lambda (T) \}$.
Input: Importance-weighted binary classification routine $\mbox{Learn}$.
Input: A binary tree $T$ over the labels with internal nodes $\Lambda (T)$.
Result: Trained classifiers $\{\Psi_n | n \in \Lambda (T) \}$.
- For each $n \in \Lambda (T)$ from leaves to roots:
- $S_n = \emptyset$.
- For each example $(x, \omega, \mathcal{A}, \{ r (a) | a \in \mathcal{A} \}, p (\cdot | x, \omega)) \in S$:
- Let $\lambda$ and $\phi$ be the two classes input to $n$ (the predictions of the left and right subtrees on input $(x, \omega)$ respectively).
- If $\lambda \in \omega$, predict $\phi$ for the purposes of constructing training input for parent node (``$\lambda$ forfeits'');
- else if $\phi \in \omega$, predict $\lambda$ for the purposes of constructing training input for parent node (``$\phi$ forfeits'');
- else if $\lambda \in \mathcal{A}$ and $\phi \in \mathcal{A}$
- $S_n \leftarrow S_n \cup \left\{ \left(x, 1_{r (\lambda) > r (\phi)}, |r (\lambda) - r (\phi)|\right) \right\}$.
- else if $\lambda \in \mathcal{A}$ and $\phi \not \in \mathcal{A}$:
- If $r (\lambda) < \frac{1}{2}$, $S_n \leftarrow S_n \cup \left\{ \left( x, 0, \frac{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]}{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} ]} \left(\frac{1}{2} - r (\lambda)\right) \right) \right\}$;
- else $S_n \leftarrow S_n \cup \left\{ \left( x, 1, \frac{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]}{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} ]} \left(r (\lambda) - \frac{1}{2}\right) \right) \right\}$.
- else if $\lambda \not \in \mathcal{A}$ and $\phi \in \mathcal{A}$:
- If $r (\phi) < \frac{1}{2}$, $S_n \leftarrow S_n \cup \left\{ \left( x, 1, \frac{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]}{E_{\mathcal{A} \sim p} [ 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]} \left(\frac{1}{2} - r (\phi) \right) \right) \right\}$;
- else $S_n \leftarrow S_n \cup \left\{ \left( x, 0, \frac{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]}{E_{\mathcal{A} \sim p} [ 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]} \left(r (\phi) - \frac{1}{2}\right) \right) \right\}$.
- Let $\Psi_n = \mbox{Learn} (S_n)$.
- Return $\{\Psi_n | n \in \Lambda (T) \}$.
Algorithm:Forfeit Filter-Offset Tree Test
Input: A binary tree $T$ over the labels with internal nodes $\Lambda (T)$.
Input: Trained classifiers $\{\Psi_n | n \in \Lambda (T) \}$.
Input: Instance realization $(x, \omega)$.
Result: Predicted label $k$.
Input: Trained classifiers $\{\Psi_n | n \in \Lambda (T) \}$.
Input: Instance realization $(x, \omega)$.
Result: Predicted label $k$.
- Let $n$ be the root node.
- Repeat until $n$ is a leaf node:
- If all the labels of the leaves in the left-subtree of $n$ are in $\omega$, traverse to the right child;
- else if all the labels of the leaves in the right-subtree of $n$ are in $\omega$, traverse to the left child;
- else if $\Psi_n (x) = 1$, traverse to the left child;
- else (when $\Psi_n (x) = 0$ and at least one label in each subtree is not in $\omega$), traverse to the right child.
- Return leaf label $k$.
Motivating the Update
The key to leveraging the filter tree style regret bound proof strategy is to ensure that the expected importance weight difference at an internal node is equal to the policy regret with respect to the two inputs to that node. When both reward values are known, the filter tree update gets the job done directly by differencing the inputs. When only one reward value is known, the offset tree update produces the correct result by weighting according to the relative probabilities of observation. Imagining a combination of the two, the expected importance weight of the left input conditioned on $(x, \omega, r)$ and $\lambda \not \in \omega$ and $\phi \not \in \omega$ is \[ \begin{aligned} w_{\lambda|r} &= E_{\mathcal{A}\sim p} \biggl[ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} 1_{r (\lambda) \geq \frac{1}{2}} \alpha_{\lambda, \neg \phi} \left( r (\lambda) - \frac{1}{2} \right) \\ &\quad \quad \quad \quad + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} 1_{r (\phi) < \frac{1}{2}} \alpha_{\neg \lambda, \phi} \left( \frac{1}{2} - r (\phi) \right) \\ &\quad \quad \quad \quad + 1_{\lambda \in \mathcal{A}} 1_{\phi \in \mathcal{A}} 1_{r (\lambda) > r (\phi)} \alpha_{\lambda, \phi} \bigl( r (\lambda) - r (\phi) \bigr) \biggr] \biggl/ \\ &\quad \quad E_{\mathcal{A} \sim p} \left[ 1_{\lambda \in \mathcal{A}} + 1_{\phi \in \mathcal{A}} - 1_{\lambda \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right], \end{aligned} \] where $\alpha_{\lambda,\neg \phi}$ is a (to be determined) scaling factor for when only $\lambda$ is observed and exceeds $\frac{1}{2}$ or when only $\phi$ is observed and does not exceed $\frac{1}{2}$; $\alpha_{\neg \lambda, \phi}$ is for when only $\phi$ is observed and exceeds $\frac{1}{2}$ or when only $\lambda$ is observed and does not exceed $\frac{1}{2}$; and $\alpha_{\lambda, \phi}$ is for when both $\lambda$ and $\phi$ are observed. Inspection suggests \[ \begin{aligned} \alpha_{\lambda, \neg \phi} &= (1 - \gamma) \frac{E_{\mathcal{A} \sim p} \left[ 1_{\lambda \in \mathcal{A}} + 1_{\phi \in \mathcal{A}} - 1_{\lambda \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right]}{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} ]}, \\ \alpha_{\neg \lambda, \phi} &= (1 - \gamma) \frac{E_{\mathcal{A} \sim p} \left[ 1_{\lambda \in \mathcal{A}} + 1_{\phi \in \mathcal{A}} - 1_{\lambda \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right]}{E_{\mathcal{A} \sim p} [ 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]}, \\ \alpha_{\lambda, \phi} &= \gamma \frac{E_{\mathcal{A} \sim p} \left[ 1_{\lambda \in \mathcal{A}} + 1_{\phi \in \mathcal{A}} - 1_{\lambda \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right]}{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]}, \end{aligned} \] for any $\gamma \in [0, 1]$ which would lead to \[ \begin{aligned} w_{\lambda|r} &= (1 - \gamma) \left( r (\lambda) - \frac{1}{2} \right)_+ + (1 - \gamma) \left( \frac{1}{2} - r (\phi) \right)_+ + \gamma \bigl( r(\lambda) - r (\phi) \bigr)_+, \\ w_{\phi|r} &= (1 - \gamma) \left( r (\phi) - \frac{1}{2} \right)_+ + (1 - \gamma) \left( \frac{1}{2} - r (\lambda) \right)_+ + \gamma \bigl( r(\phi) - r (\lambda) \bigr)_+, \\w_{\lambda|r} - w_{\phi|r} &= r (\lambda) - r (\phi). \end{aligned} \] What about $\gamma$? I don't have any theoretical reason for my particular choice, I just figured setting $\gamma$ to the relative probability of the filter update would give the right limiting behaviour (i.e., exactly reproduce offset or filter tree given the corresponding $p (\mathcal{A} | x, \omega)$). That implies \[ \gamma = \frac{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]}{E_{\mathcal{A} \sim p} \left[ 1_{\lambda \in \mathcal{A}} + 1_{\phi \in \mathcal{A}} - 1_{\lambda \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right]}, \] and \[ \begin{aligned} \alpha_{\lambda, \neg \phi} &= \frac{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]}{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} ]}, \\ \alpha_{\neg \lambda, \phi} &= \frac{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}}]}{E_{\mathcal{A} \sim p} [ 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]}, \\ \alpha_{\lambda, \phi} &= 1. \end{aligned} \] Another idea is to choose $\gamma$ to minimize the expected importance weight, but I don't have any results along those lines.Regret Analysis
The regret analysis for the forfeit filter-offset tree is almost identical to the regret analysis for the forfeit offset tree.Let $\Psi = (T, \{\Psi_n | n \in \Lambda (T) \})$ denote a particular forfeit filter-offset tree (i.e., a choice of a binary tree and a particular set of node classifiers), and let $h^\Psi$ denote the policy that results from the forfeit filter-offset tree. The regret analysis leverages an induced importance-weighted binary distribution $D^\prime (\Psi)$ over triples $(x^\prime, y, w)$ defined as follows:
- Draw $(x, \omega, r)$ from $D$.
- Draw $n$ uniform over the internal nodes $\Lambda (T)$ of the binary tree.
- Let $x^\prime = (x, n)$.
- Let $\lambda$ and $\phi$ be the two classes input to $n$ (the predictions of the left and right subtrees on input $x$ respectively).
- If $\lambda \in \omega$, create importance-weighted binary example $(x^\prime, 0, 0)$;
- else if $\phi \in \omega$, create importance-weighted binary example $(x^\prime, 1, 0)$;
- else (when $\lambda \not \in \omega$ and $\phi \not \in \omega$):
- Draw $\mathcal{A}$ from $p (\mathcal{A} | x, \omega)$.
- If $\lambda \not \in \mathcal{A}$ and $\phi \not \in \mathcal{A}$, reject sample;
- else if $\lambda \in \mathcal{A}$ and $\phi \in \mathcal{A}$, create importance-weighted binary example \[\left( x^\prime, 1_{r (\lambda) > r (\phi)}, | r (\lambda) - r (\phi) | \right);\]
- else if $\lambda \in \mathcal{A}$ and $\phi \not \in \mathcal{A}$:
- If $r (\lambda) < \frac{1}{2}$, create importance-weighted binary example \[\left( x^\prime, 0, \frac{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]}{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} ]} \left(\frac{1}{2} - r (\lambda) \right) \right) ;\]
- else (when $r (\lambda) \geq \frac{1}{2}$), create importance-weighted binary example \[ \left( x^\prime, 1, \frac{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]}{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} ]} \left(r (\lambda) - \frac{1}{2}\right) \right) ;\]
- else (when $\lambda \not \in \mathcal{A}$ and $\phi \in \mathcal{A}$):
- If $r (\phi) < \frac{1}{2}$, create importance-weighted binary example \[ \left( x^\prime, 1, \frac{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]}{E_{\mathcal{A} \sim p} [ 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]} \left(\frac{1}{2} - r (\phi) \right) \right) ;\]
- else (when $r (\phi) \geq \frac{1}{2}$), create importance-weighted binary example \[ \left( x^\prime, 0, \frac{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]}{E_{\mathcal{A} \sim p} [ 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]} \left(r (\phi) - \frac{1}{2}\right) \right) .\]
Algorithm:Forfeit Filter-Offset Tree Train
For all partially labelled CSMC distributions $D$; all historical policies $p$ such that $E_{\mathcal{A} \sim p} [ 1_{a \in \mathcal{A}} | x, \omega ] > 0$ whenever $a \not \in \omega$; and all forfeit filter-offset trees $\Psi$, \[ v (h^\Psi) \leq (|A| - 1) q (\Psi) \] where $q (\Psi)$ is the importance-weighted binary regret on the induced subproblem.
Proof: See Appendix.
Proof: See Appendix.
For the offset tree setting, where only one reward component is revealed per instance, the $(|A| - 1)$ dependence is tight. On the other hand, when all reward components are revealed per instance, there are reductions that have a regret independent of the number of actions. I suspect the lower bound from the offset tree paper can be generalized to be a function of the distribution of $|\mathcal{A}|$. What this means in practice is that the forfeit filter-offset tree, unlike the forfeit offset tree, is not ``as good as it gets'' when more than 1 reward is being revealed per historical instance.
Ok now I'm ready to look at cost-sensitive best m with partial feedback.
Appendix
This is the proof of the regret bound.Consider a fixed $(x, \omega)$. It is useful to talk about the conditional policy regret experienced at an internal node $n$, \[ v (h^\Psi | x, \omega, n) = \max_{k \in \Gamma (n)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (h^\Psi_n (x, \omega)) \right]. \] where $h_n^\Psi$ is the prediction at internal node $n$. When $n$ is the root of the tree, $v (h^\Psi | x, \omega, n)$ is the forfeit offset tree policy regret conditional on $(x, \omega)$.
The proof strategy is to bound $v (h^\Psi | x, \omega, n) \leq \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega)$ via induction. The base case is trivially satisfied for trees with only one leaf (no internal nodes) since it evaluates to $0 \leq 0$. To show the recursion at a particular internal node $n$, let $\lambda$ and $\phi$ be the predictions of the left subtree ($n_\lambda$) and right subtree ($n_\phi$).
Case 1: $\Gamma (n_\lambda) \setminus \omega = \emptyset$. In this case $\lambda \in \omega$ and forfeits, so $\phi$ is chosen. There must be a maximizer in the right subtree, since all values in the left subtree are $-\infty$. Furthermore $q_m (\Psi | x, \omega) = 0$ for both $m = n$ and for $m \in \Lambda (n_\lambda)$ by definition. Therefore \[ \begin{aligned} v (h^\Psi | x, \omega, n) &=
\max_{k \in \Gamma (n)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi) \right] \\ &= \max_{k \in \Gamma (n_\phi)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi) \right] \\ &= v (h^\Psi | x, \omega, n_\phi) \\ &\leq \sum_{m \in \Lambda (n_\phi)} q_m (\Psi | x, \omega) \\ &= \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned} \]
Case 2: $\Gamma (n_\lambda) \setminus \omega \neq \emptyset$ and $\Gamma (n_\phi) \setminus \omega = \emptyset$. In this case $\phi \in \omega$ and $\lambda \not \in \omega$, so $\phi$ forfeits and $\lambda$ is chosen. There must be a maximizer in the left subtree, since all values in the right subtree are $-\infty$. Furthermore $q_m (\Psi | x, \omega) = 0$ for both $m = n$ and for $m \in \Lambda (n_\phi)$ by definition. Therefore \[ \begin{aligned} v (h^\Psi | x, \omega, n) &=
\max_{k \in \Gamma (n)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\lambda) \right] \\ &= \max_{k \in \Gamma (n_\lambda)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\lambda) \right] \\ &= v (h^\Psi | x, \omega, n_\lambda) \\ &\leq \sum_{m \in \Lambda (n_\lambda)} q_m (\Psi | x, \omega) \\ &= \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned} \]
Case 3: $\Gamma (n_\lambda) \setminus \omega \neq \emptyset$ and $\Gamma (n_\phi) \setminus \omega \neq \emptyset$. This is the ``normal'' offset tree case, where both $\lambda \not \in \omega$ and $\phi \not \in \omega$ so no forfeiture happens. As shown above, the expected importance weights conditioned on $(x, \omega, r)$ and $\lambda \not \in \omega$ and $\phi \not \in \omega$ satisfy \[ | w_\lambda - w_\phi | = \left| E_{r \sim D_{r|\omega,x}} \left[ w_{\lambda|r} - w_{\phi|r} \right] \right| = \left| E_{r \sim D_{r|\omega,x}} [r (\lambda) - r (\phi)] \right|, \] i.e., the importance-weighted regret at an internal node is equal to the policy regret with respect to the two actions input to that node.
Assume without loss of generality that the classifier chooses $\phi$. If the maximizer comes from the right subtree, then \[ \begin{aligned} v (h^\Psi | x, \omega, n) &= \max_{k \in \Gamma (n_\phi)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi) \right] \\ &= v (h^\Psi | x, \omega, n_\phi) \\ &\leq \sum_{m \in \Lambda (n_\phi)} q_m (\Psi | x, \omega) \\ &\leq \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned} \] If the maximizer comes from the left subtree, then \[ \begin{aligned} v (h^\Psi | x, \omega, n) &= \max_{k \in \Gamma (n_\lambda)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi) \right] \\ &= E_{r \sim D_{r|\omega,x}} \left[ r (\lambda) - r (\phi) \right] + v (h^\Psi | x, \omega, n_\lambda) \\ &= q_n (\Psi | x, \omega) + v (h^\Psi | x, \omega, n_\lambda) \\ &\leq q_n (\Psi | x, \omega) + \sum_{m \in \Lambda (n_\lambda)} q_m (\Psi | x, \omega) \\ &\leq \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned} \] Terminating the induction at the root yields \[ v (h^\Psi | x, \omega) \leq \sum_{n \in \Lambda (T)} q_n (\Psi | x, \omega) = |\Lambda (T)| q (\Psi | x, \omega). \] Taking the expectation of both sides with respect to $D_x \times D_{\omega|x}$ and noting $|\Lambda (T)| = (|A| - 1)$ completes the proof.
No comments:
Post a Comment