Given my current obsession, I thought it would be interesting to ask how someone trained in OR might think about the reduction of cost-sensitive multiclass classification to regression. In particular, viewing it for fixed $x$ as a linear program
\[ \begin{aligned} \min_{d (x, k)} & \sum_k d (x, k) \bar c (x, k) \\ \mbox{s.t. } d (x, k) & \geq 0 \\ \sum_k d (x, k) &= 1, \end{aligned} \] which has solution $d (x, k) = 1_{k = \operatorname{arg\,min\,}_{k \in K}} \bar c (x, k)$. However the $\bar c (x, k)$ are not known but instead we have estimates $g (x, k)$, so what would be a standard OR way of dealing with the uncertainty?
Robust optimization takes the approach of optimizing for the worst case among a set of feasible scenarios. For instance we could say that the estimates $g (x, k)$ can differ from the actual values $\bar c (x, k)$ by an amount of magnitude at most $\Delta$, and then try to minimize our classification cost under the worst-possible estimation error, \[ \begin{aligned} \min_{d (x, k)}\; \max_{\delta (x, k)} & \sum_k d (x, k) \bigl( g (x, k) + \delta (x, k) \bigr) \\ \mbox{s.t. } d (x, k) & \geq 0 \\ \sum_k d (x, k) &= 1 \\ - \Delta \leq \delta (x, k) &\leq \Delta. \end{aligned} \] Uninterestingly, this has the same solution, namely $d (x, k) = 1_{k = \operatorname{arg\,min\,}_{k \in K}} g (x, k)$; this is because the worst-case values of $\bar c (x, k)$ are ordered the same as the estimates (the inner max is always achieved by setting $\delta (x, k) = \Delta$).
If instead we try to minimize our classification regret under the worst-possible estimation error, \[ \begin{aligned} \min_{d (x, k)}\; \max_{\delta (x, k)} & \left\{ \sum_k d (x, k) \bigl( g (x, k) + \delta (x, k) \bigr) - \min_k\; \bigl\{g (x, k) + \delta (x, k) \bigr\} \right\} \\ \mbox{s.t. } d (x, k) & \geq 0 \\ \sum_k d (x, k) &= 1 \\ - \Delta \leq \delta (x, k) &\leq \Delta, \end{aligned} \] then things get more interesting. Let $k^*$ and $k^{**}$ denote the index of the lowest and second lowest estimate, and let $\tilde g (x, k) = g (x, k) - g (x, k^*)$. Clearly $d (k) = 0$ if $\tilde g (x, k) \geq 2 \Delta$ since class $k$ can never be the best choice. Thus if $\tilde g (x, k^{**}) \geq 2 \Delta$ then $d (k^*) = 1$ is the optimum as in the non-robust case. Therefore consider $\tilde g (x, k^{**}) < 2 \Delta$ and consider first pure strategies. If we choose $d (x, k^*) = 1$, the adversary does best by setting $\delta (x, k^{**}) = -\Delta$ and $\delta (x, k^*) = \Delta$ resulting in regret $2 \Delta - \tilde g (x, k^{**})$. If we choose $d (x, k) = 1$ for $k \neq k^*$ when $\tilde g (x, k) < 2 \Delta$, the adversary does best by setting $\delta (x, k^*) = -\Delta$ and $\delta (x, k) = \Delta$ resulting in regret $2 \Delta + \tilde g (x, k)$. Indifference to the adversary's strategy indicates the best mixed strategy satisfies\[d (x, k) \bigl( 2 \Delta + \tilde g (x, k) \bigr) = d (x, k^*) \bigl(2 \Delta - \tilde g (x, k^{**}) \bigr).\]Combined with the normalization constraint this yields\[d (x, k) = \frac{1_{k = k^*} + 1_{k \neq k^*} 1_{\tilde g (x, k) < 2 \Delta} \frac{2 \Delta - \tilde g (x, k^{**})}{2 \Delta + \tilde g (x, k)}}{1 + \sum_{k \neq k^*} 1_{\tilde g (x, k) < 2 \Delta} \left( \frac{2 \Delta - \tilde g (x, k^{**})}{2 \Delta + \tilde g (x, k)} \right)},\]which is a bit easier to read in the case that only $k^{**}$ satisfies $\tilde g (x, k) < 2 \Delta$,\[d (x, k) = \begin{cases} \frac{1}{2} - \frac{\tilde g (x, k^{**})}{4 \Delta} & k = k^{**}; \\\frac{1}{2} + \frac{\tilde g (x, k^{**})}{4 \Delta} & k = k^*; \\0 & \mbox{otherwise}.\end{cases}\] Thus the robust optimization with respect to regret replaces the hard argmax with a ''softmax'' that tends to choose the class with the best estimate, but also chooses classes with comparable estimates with some probability which decreases as their estimates degrade relative to the best choice. I've used softmax style decision rules in the past, but I always thought of it as a necessary cost to pay in order to explore (e.g., to track nonstationarity) in whatever problem I was working on. The idea that it might itself be an optimal strategy for generalization given the inevitable inaccuracy of the estimates employed hadn't occurred to me until now. Given that a softmax activation function like
\[ d (x = k ; \beta) = \frac{e^{\tilde g (x, k) / \beta}}{\sum_k e^{\tilde g (x, k) / \beta}} \] goes from hard argmax ($\beta \to 0$) to random selection ($\beta \to \infty$), it be interesting to see if performance on a problem is actually best for some non-zero value of $\beta$.
No comments:
Post a Comment