In previous posts I talked about stochastic shortest path (SSP) with recourse and analyzed a reduction to regression and a reduction to cost-sensitive multiclass classification (CSMC) via Searn. The latter was underwhelming because it only led to a bound on the SSP regret in terms of the error on the underlying CSMC problems, whereas the regression reduction leads to a bound on the SSP regret in terms of the regret on the underlying regression. (See the previous posts on regression reduction and Searn-style reduction for a discussion of the exact variant of SSP analyzed and the notation used).
I have strong reason to believe that a reduction to CSMC with a bound based upon the underlying CSMC regret is possible. I haven't completely worked it out, but my inspiration comes from thinking about what a naive reduction of SSP to CSMC would look like.
First, some further simplifications over previous posts. As before, I'll require all paths to be of length $n$, but now I'll allow any node to connect to itself for zero cost (not just the target node). This means I can associate each potential path with an $(n - 2)$ digit number in base $n$. There an $n^{n-2}$ such paths, and a direct reduction to CSMC would equate each path with a class. With this naive reduction, the shortest path regret and the cost-sensitive regret are identical. Of course, I am ignoring a lot of the structure here, because these classes don't have independent costs: they are related because they are different sums over only $n^2$ quantities.
Since there are so many classes in this naive reduction, I'll choose the filter tree to reduce the CSMC to binary classification. It has the advantage that run-time computation is logarithmic in the number of classes, or $(n - 2) \log_2 (n)$, which actually looks reasonable. Training time is another story: I'll have to train $n^{n-2}$ binary classifiers at all of the internal nodes of the filter tree.
But wait, the filter tree let's me fix any tree over the labels. Suppose that $n = 2^m$ in what follows, so I don't have to think about parents that lack two children in the tree. I can choose a tree such that the $k^\mathrm{th}$ leaf is the $(n-2)$ digit base $n$ number whose value is $k$. In that case, at the first level of internal nodes, all paths being compared will share all but their last digit, so instead of having to learn $n^n / 2$ binary classifiers, I can just learn $n^2 / 2$ binary classifiers. This is because the cost of two paths that share all but their last digit depends only upon their last two digits and not the rest of the path prefix. (This is the important part, true only because this is SSP without recourse: if this were SSP with recourse, costs would be revealed as the paths were traversed, and the conditional distributions of remaining edge costs would be a function of the path). On the second level of internal nodes, I need only learn $n^2 / 4$ classifiers; and for the first $\log_2 (n)$ levels of internal nodes, I only need learn $n^2$ classifiers in total.
On the $(\log_2 (n) + 1)^\mathrm{th}$ level of internal nodes, paths that are being compared share all but their last three digits; and the last digit is determined by the second last digit via the lower levels of internal nodes. This means that for the next $\log_2 (n)$ levels of internal nodes, I still only have to learn $n^2$ classifiers in total.
Following this logic, I have to learn a total of $n^2 (n - 2) \log (n)$ binary classifiers at train time, and perform $(n - 2) \log (n)$ classifications at test time. The regret bound for the filter tree is the sum of the regrets of internal nodes, which suggests a $n^{n-2}$ leading factor, but because there are only $n^2 (n - 2) \log_2 (n)$ actual distinct internal nodes one might hope the regret would scale as the lower quantity.
Now I have to dig into the filter tree and really understand it, and translate this back into a specific reduction for SSP.
No comments:
Post a Comment