Continuing with my current obsession, I decided to really understand the reduction of cost-sensitive multiclass classification (CSMC) to regression. In particular I wanted to derive the well-known error and regret bounds for losses in the $L^p$ family. My hope is that I'll be able to reuse some of these techniques for more complicated problems involving supplying estimated coefficients to standard OR solvers.
Preliminaries. I'm considering the standard cost-per-example setup of cost-sensitive multiclass classification. There are a finite set $K$ of classes which form the decision space. There is a distribution $D = D_x \times D_{c|x}$ from which pairs $(x, c)$ are drawn IID, where $x$ are the features, and $c: K \to \mathbb{R}$ is the cost vector associated with choosing class $k$ for this instance. A cost-sensitive multiclass classifier $h: X \to K$ has regret
\[ r_{csmc} (h) = E_{x \sim D_x} \left[ E_{c \sim D_{c|x}} [ c (h (x)) ] - \min_{k \in K}\; E_{c \sim D_{c|x}} [ c (k) ] \right]. \] An argmax regression strategy to solve cost-sensitive multiclass classifier is a function $g: X \times K \to \mathbb{R}$ which defines an associated cost-sensitive multiclass classifier $h_g: X \to K$ according to \[ h_g (x) = \underset{k \in K}{\operatorname{arg\,min\,}} g (x, k). \] I would like to bound $r_{csmc} (h_g)$ in terms of the error of $g$ on the regression problem, \[ q_{L} (g) = E_{(x, c) \sim D} \left[ \frac{1}{|K|} \sum_{k \in K} L \bigl (g (x, k), c (k) \bigr) \right], \] where $L$ is a loss function for the regression problem. Also of interest are bounds on $r_{csmc} (h_g)$ in terms of the regret of $g$ on the regression problem, \[ \epsilon_{L} (g) = q_{L} (g) - \min_g\; q_{L} (g). \]
Error Bounds Here I'll derive a bound on the $r_{csmc}$ in terms of $q_{L} (g)$ for $L^p$ loss for $p > 1$, where \[ L^p (\hat y, y) = |\hat y - y|^p. \] The basic idea is to consider an instance $(x, c)$ and suppose our regressor has cost estimates which differ from the actual cost by $\delta (x, k)$, \[ g (x, k) = c (k) + \delta (x, k). \] Imagine an adversary which attempts to create a certain amount of cost-sensitive multiclass classification regret on this instance while minimizing the amount of regression error on this instance. This adversary is faced with the following problem: \[ \begin{aligned} & \min_{\delta}\; \sum_{k \in K} L \bigl (c (k) + \delta (x, k), c (k) \bigr) \\ \mathrm{s.t.} \quad & c (h_g (x)) - \min_k\; c (k) = r. \end{aligned} \] Clearly $r$ can only take one of $|K|$ values corresponding to the possible decisions made by $h_g$. Consider each such possible decision $k^{**}$; furthermore let $k^* = {\operatorname{arg\,min\,}}_k c(k)$. Then the adversary has a family of choices of the following form: \[ \begin{aligned} & \min_{\delta}\; \sum_{k \in K} |\delta (x, k)|^p \\ \mathrm{s.t.} \; \forall k \neq k^{**}, \; & c (k^{**}) + \delta (x, k^{**}) \leq c (k) + \delta (x, k). \end{aligned} \] The KKT stationarity condition implies \[ \begin{aligned} p \ \Theta (\delta (x, k)) \ |\delta (x, k)|^{p-1} - \mu_k &= 0 \quad (k \neq k^{**}), \\ p \ \Theta (\delta (x, k^{**})) \ |\delta (x, k^{**})|^{p-1} + \sum_{k \neq k^{**}} \mu_k &= 0, \end{aligned} \] where $\Theta (x)$ is the sign function. Complementary slackness indicates $\mu_k = \delta (x, k) = 0$ for classes other than the two being transposed, the remaining stationarity conditions imply $\delta (x, k^*) = -\delta (x, k^{**})$, and dual feasibility implies $\delta (x, k^*) > 0$. Intuitively, since errors are penalized convexly, the adversary does best by concentrating equal and opposite error on the two classes being transposed. Substituting into the inequality constraint for $k^*$ yields \[ c (k^{**}) - c (k^*) \leq 2 \delta (x, k^*) = 2 \sqrt[p]{ \frac{|K|}{2} \frac{1}{|K|} \sum_{k \in K} |\delta (x, k)|^p }. \] Taking the expectation with respect to $D$ yields the result \[ r_{csmc} (h_g) \leq 2 \sqrt[p] {\frac{|K|}{2}} \left( E_{(x, c) \sim D} \left[ \sqrt[p]{\frac{1}{|K|} \sum_{k \in K} \delta (x, k)^p } \right] \right) \leq 2 \sqrt[p] {\frac{|K|}{2} q_{L^p}} \] where the last inequality follows from Jensen's inequality. This formula can be shown correct in the limit $p \to 1$ directly (note KKT does not apply at $p = 1$ due to lack of continuous differentiability), where it takes the form \[ r_{csmc} (h_g) \leq |K| q_{L^1}. \] Since I hope to achieve small error, the $p$ root is fighting us so the best choice with respect to this bound is $p = 1$. (Trying to go below $p = 1$ eliminates the use of Jensen's inequality so the above derivation stops working).
As observed by Tu and Lin, because $\delta (x, k^*) > 0$ and $\delta (x, k^{**}) < 0$ the above bounds also hold in the case of one-sided $L^p$ loss, defined as \[ L_{(1)}^p (\hat y, y) = \lfloor (2\ 1_{k = k^*} - 1) (y - \hat y) | y - \hat y |^{p-1} \rfloor, \] where $\lfloor x \rfloor = \max (x, 0)$. Basically, underestimating the value of the lowest cost class for each instance is not penalized, and overestimating the value of the non-lowest cost classes for each instance is not penalized. In this case I also have \[ r_{csmc} (h_g) \leq 2 \sqrt[p] {\frac{|K|}{2} q_{L_{(1)}^p}} \] for $p \geq 1$. Since one-sided $L^p$ loss is bounded above by $L^p$ loss, this bound is an improvement.
Regret Bound On a hard problem, achieving small error might not be possible due to endogenous variability in the cost estimates. Bounding the regret on CSMC via the regret on the underlying regression problem is better, since low regret can be achieved even on hard problems.
Looking at the above error bound derivation, an observation is that the adversary has too many degrees of freedom: the regressor is defined solely as a function of $x$, but the adversary was allowed to choose $\delta (x, k)$ differently for each $c$. I will eliminate this problem when deriving the regret bound.
Consider a single instance feature $x$ with associated conditional per-instance cost-vector distribution $D_{c|x}$, and suppose our regressor has cost estimates which differ from a minimum error regressor's estimate $c^* (x, k)$ by $\delta (x, k)$, \[ g (x, k) = c^* (x, k) + \delta (x, k). \] Imagine an adversary which attempts to create a certain amount of cost-sensitive multiclass classification regret on this instance while minimizing the amount of regression regret on this instance. This adversary is faced with the following problem: \[ \begin{aligned} & \min_{\delta}\; E_{c \sim D_{c|x}} \left[ \sum_{k \in K} L \bigl (c^* (x, k) + \delta (x, k), c (k) \bigr) - L \bigl (c^* (x, k), c (k) \bigr) \right] \\ \mathrm{s.t.} \quad & E_{c \sim D_{c|x}} \left[ c (h_g (x)) \right] - \min_k\; E_{c \sim D_{c|x}} \left[ c (k) \right] = r. \end{aligned} \] Again $r$ can only take one of $|K|$ values corresponding to the possible decisions made by $h_g$. Consider each such possible decision $k^{**}$; then the adversary has a family of choices of the following form: \[ \begin{aligned} &\min_{\delta}\; E_{c \sim D_{c|x}} \left[ \sum_{k \in K} L \bigl (c^* (x, k) + \delta (x, k), c (k) \bigr) - L \bigl (c^* (x, k), c (k) \bigr) \right] \\ \mathrm{s.t.} \; \forall k \neq k^{**}, \; & c^* (x, k^{**}) + \delta (x, k^{**}) \leq c^* (x, k) + \delta (x, k). \end{aligned} \] This is hard to deal with in general, but for the case $p = 2$ something magical happens. Considering the objective function (i.e., $|K| \epsilon_L$) for a moment \[ \begin{aligned} &\quad E_{c \sim D_{c|x}} \left[ \sum_{k \in K} \bigl (c^* (x, k) + \delta (x, k) - c (k) \bigr)^2 - \bigl (c^* (x, k) - c (k) \bigr)^2 \right] \\ &= \sum_{k \in K} 2 \delta (x, k) \left( c^*(x, k) - E_{c \sim D_{c|x}} \left[ c (k) \right ] \right) + \delta (x, k)^2 \\ &= \sum_{k \in K} \delta (x, k)^2, \end{aligned} \] where I have used a property of $L^2$ that $c^*(x, k) = E_{c \sim D_{c|x}} \left[ c (k) \right ]$. Now I have the same objective function for $L^2$ loss as in the error bound above. Analogous reasoning applies and I end up with \[ r_{csmc} (h_g) \leq \sqrt{2 |K| \epsilon_{L^2} (h_g)}, \] which looks similar to the error bound for $L^2$, but the right hand side is in terms of regret. What about other losses, like $L^p$ loss for $p \neq 2$ or one-sided variants? Simple counter-examples indicate that zero regret on the regression problem for these losses does not imply zero regret on CSMC. Because the key condition $c^*(x, k) = E_{c \sim D_{c|x}} \left[ c (k) \right ]$ only applies for $L^2$, I expect additional terms to appear in the objective function for the adversary's problem, changing the result.
Finally, it should be noted there are reductions of CMSC to binary classification that achieve regret bounds linear in the binary regret, e.g., all pairs and filter tree.
No comments:
Post a Comment