- introduced my quest to come up with alternative decision procedures that do not involve providing estimates to standard OR algorithms;
- motivated understanding stochastic shortest path (SSP) without recourse as part of my quest and analyzed a reduction to regression;
- described an error transform of SSP to cost-sensitive multiclass classification (CSMC) based upon Searn.
- noted that a naive reduction of SSP to CSMC, composed with the filter tree reduction of CSMC to binary classification, suggested that an efficient consistent reduction of SSP without recourse to CSMC was possible.
Now I present a reduction of SSP to CSMC which is inspired by the filter tree. This reduction is consistent, i.e., achieving zero regret on the induced CSMC problem implies achieving zero regret on the original SSP problem.
The Algorithm Unlike the Searn-style approach, here I train the classifiers ``last-to-first''. As noted in a previous post, this is only possible because this is SSP without recourse, therefore the continuation costs do not depend upon the path prefix; in SSP with recourse, costs are revealed during traversal, and therefore continuation costs do depend upon the path prefix.
At each stage of training, I am learning paths from any vertex to the target vertex of a particular length. Initially the length is 1, so there is only one possible path from any vertex to the target vertex, and no learning is required. To learn paths of length $k$, I use the previously learned classifiers for paths of length $k-1$ to define the continuation costs of choosing a particular first step from a particular vertex. Once I reach paths of length $n$, I am done. This formulation leverages the simplification from a previous post where every node has a self-connection of zero cost and paths must be of length $n$. As a final post-processing step, I can remove any cycles in the path to get paths of any length up to $n$.
Algorithm:Path Grid Train
Data: Fully connected graph with $n$ vertices; target vertex $t$; source vertex $s$.
Data: Training data set $S$.
Input: Cost-sensitive classification routine $\mbox{Learn}$.
Result: Trained classifiers $\{\Psi_k | k \in [1, n-2] \}$.
Data: Training data set $S$.
Input: Cost-sensitive classification routine $\mbox{Learn}$.
Result: Trained classifiers $\{\Psi_k | k \in [1, n-2] \}$.
- Define $\gamma_{n-1, v} = \{ \{v, t\} \}$ for $v \in [1, n]$.
- For each step $k$ in the path from $n - 2$ to 1:
- $S_k = \emptyset$.
- Let $\{ \gamma_{k+1,v} | v \in [1, n] \}$ be the predicted best paths of length $n - k - 1$ to the target vertex $t$ starting from vertex $v$.
- For each example $(x, f) \in S$:
- For each vertex $v$ from 1 to $n$; or when $k = 1$, only $v = s$:
- Define the estimated cost of continuing through vertex $w$ as $c_w (k, v) = f (e_{vw}) + \sum_{e \in \gamma_{k+1,w}} f (e)$.
- Define the lowest cost as $w^* (k, v) = \operatorname{arg\,min\,}_w c_w (k, v)$.
- $S_k \leftarrow S_k \cup \left\{\bigl( (x, v), w^* (k, v), c_w (k, v) \bigr) \right\}$.
- Let $\Psi_k = \mbox{Learn} (S_k)$.
- Let $\gamma_{k,v} = \Psi_k (x, v) \odot \gamma_{k+1,\Psi_k (x, v)}$.
- For each vertex $v$ from 1 to $n$; or when $k = 1$, only $v = s$:
- Return $\{ \Psi_k | k \in [1, n-2] \}$.
Algorithm:Path Grid Test
Data: Fully connected graph with $n$ vertices; target vertex $t$; source vertex $s$.
Data: Graph realization $x$.
Result: Path $\pi: X \to V^n$.
Data: Graph realization $x$.
Result: Path $\pi: X \to V^n$.
- $\pi_1 (x) = s$.
- $\pi_n (x) = t$.
- $\pi_{k+1} (x) = \Psi_k (x, \pi_k (x))$.
- In practice, remove any cycles in $\pi$; for analysis, leave them in.
I would like to bound it in terms of the cost-sensitive regret on the induced subproblems, and following a trick from the filter tree derivation, I will collapse the multiple subproblems into a single subproblem by defining an induced distribution as follows. Let $D = D_x \times D_{f|x}$ be the distribution of $(x, f)$. Define the induced distribution $D^\prime_\Psi$ of cost-sensitive examples $(x^\prime, c)$ as follows:
- Draw $(x, f)$ from $D$.
- Draw $k$ uniform on $[1, n - 2]$.
- If $k = 1$, $v = s$; else draw $v$ uniform on $[1, n]$.
- Let $x^\prime = (x, k, v)$.
- Let $c_w (k, v) = f (e_{vw}) + \sum_{e \in \gamma_{k+1,w}} f (e)$ for $w \in [1, n]$.
- Create cost-sensitive example $\bigl( x^\prime, c \bigr)$.
Lemma:Next Step Bound
Let $w^*$ be the first vertex in a regret minimizing path \[\underset{p \in P_{kv}}{\operatorname{arg\,min\,}} E_{f \sim D_{f|x}} \left[ \sum_{e \in p} f (e) \right].\] Then for all $(k, v) \in \{ (1, s) \} \cup \{ (k^\prime, v^\prime) | k^\prime \in [2, n - 2], v^\prime \in [1, n] \}$, either
- $r_{kv} (\Psi | x) = r_{k+1,w^*} (\Psi | x)$ if $\Psi_k (x, v) = w^*$; or
- $r_{kv} (\Psi | x) \leq q_{kv} (\Psi | x) + r_{k+1,w^*} (\Psi | x)$ if $\Psi_k (x, v) \neq w^*$.
Theorem:Regret Bound
For all SSP distributions $D$ and all cost-sensitive classifiers $\Psi$, \[r_{spath} (\Psi) \leq (1 + n (n - 3)) q (\Psi).\]Proof: Consider a fixed $x$. Proof is by induction on $r_{kv} (\Psi | x) \leq \sum_{(k^\prime,v^\prime) \in \Omega_{kv}} q_{k^\prime,v^\prime} (\Psi | x)$, where $\Omega_{kv} = \{ (k, v) \} \cup \{ (k^\prime, v^\prime) | k^\prime \in [k + 1, n - 2], v^\prime \in [1, n] \}$ is the set of path-length and vertex pairs that are in any path starting at $(k, v)$.
When $k = n - 2$, it is easily verified directly that $r_{n-2,v} (\Psi | x) = q_{n-2,v} (\Psi | x)$. For $k \in [1, n - 2)$, \[ \begin{aligned} r_{kv} (\Psi | x) &\leq q_{kv} (\Psi | x) + r_{k+1,w^*} (\Psi | x) \\ &\leq q_{kv} (\Psi | x) + \sum_{(k^\prime,v^\prime) \in \Omega_{k+1,w^*}} q_{k^\prime,v^\prime} (\Psi | x) \\ &\leq \sum_{(k^\prime,v^\prime) \in \Omega_{kv}} q_{k^\prime,v^\prime} (\Psi | x), \end{aligned} \] where the first step is from the lemma, the second step leverages the induction hypothesis, and the third step is due to the non-negativity of regret. The induction terminates at the root where \[ r_{1s} (\Psi | x) \leq \sum_{(k, v) \in \Omega_{1s}} q_{kv} (\Psi | x) = (1 + n (n - 3)) q (\Psi | x).\] Taking the expectation of both sides with respect to $D_x$ completes the proof.
When $k = n - 2$, it is easily verified directly that $r_{n-2,v} (\Psi | x) = q_{n-2,v} (\Psi | x)$. For $k \in [1, n - 2)$, \[ \begin{aligned} r_{kv} (\Psi | x) &\leq q_{kv} (\Psi | x) + r_{k+1,w^*} (\Psi | x) \\ &\leq q_{kv} (\Psi | x) + \sum_{(k^\prime,v^\prime) \in \Omega_{k+1,w^*}} q_{k^\prime,v^\prime} (\Psi | x) \\ &\leq \sum_{(k^\prime,v^\prime) \in \Omega_{kv}} q_{k^\prime,v^\prime} (\Psi | x), \end{aligned} \] where the first step is from the lemma, the second step leverages the induction hypothesis, and the third step is due to the non-negativity of regret. The induction terminates at the root where \[ r_{1s} (\Psi | x) \leq \sum_{(k, v) \in \Omega_{1s}} q_{kv} (\Psi | x) = (1 + n (n - 3)) q (\Psi | x).\] Taking the expectation of both sides with respect to $D_x$ completes the proof.
Comparison with Regression Reduction The analysis of reduction of SSP with recourse to regression with $L_2$ loss yielded a $\sqrt{r}$ dependence on the regression regret with a scaling factor of $\sim n^{3/2}$. This reduction to CSMC yields a linear dependence on the CSMC regret but with a scaling factor of $O (n^2)$. Unfortunately then the bounds are not directly comparable.
If I further reduce CSMC to regression I end up with a $O (n^{5/2})$ scaling factor and $\sqrt{r}$ dependence on the regression regret. It intuitively feels like I should be able to reduce to regression via CSMC and get the same result as reducing to regression directly; that suggests that an $O (n)$ scaling factor is possible. I actually separated out the lemma from the theorem because I'm hoping to prove a tighter bound using the lemma (something involving $n$ quantities, rather than $n^2$); but that's for another day.
No comments:
Post a Comment