In particular I'm interested in subsampling data associated with the more prevalent label in a binary classification problem. This is different than the typical active learning scenario: I'm assuming all the data labels are available and can be inspected at no cost, but for some reason you don't want to process it all. Maybe the data is collected by a distributed sensor network and you want to minimize the costs of communication by dropping data in the periphery, or maybe you have a big data set and you want to generate a sample that fits on your laptop so you can experiment while traveling.
I've had good luck empirically with subsampling negative examples when faced with classification problems involving very imbalanced data in favor of negatives. This practice is very well justified when the objective function is AUC, since there is a deviation bound for empirical AUC from actual AUC which is minimized for a given budget of examples when the data set has an equal number of positives and negatives. However I'm not aware of a similarly clean result for classification, even though empirically I've had good luck with classification losses and biased subsampling. Therefore I came up with the following motivating example.
Context-Free Classification
Let me begin by describing a game of context-free classification.- Adversary chooses $q$.
- Referee generates a training set $S$ by repeating the following IID until $n$ samples are accepted:
- Choose $y = \mathrm{heads}$ with probability $q$, otherwise $y = \mathrm{tails}$.
- Player inspects training set $S$, minimizes empirical risk over $\{ \mathrm{heads}, \mathrm{tails} \}$ breaking ties via equal randomization, and reports $\hat y (S)$.
- Referee collects expected regret per example from Player, defined as \[
|2 q - 1| 1_{\hat y (S) \neq y^* (q)},
\] where \[
y^* (q) = \begin{cases}
\mathrm{heads} & \mathrm{if}\;\; q \geq \frac{1}{2} \\
\mathrm{tails} & \mathrm{otherwise}
\end{cases}.
\] In other words Player pays based upon how many more prediction mistakes Player makes on average than the best constant.
\mathrm{Pr} (\hat y \neq y^* | q; n) = \begin{cases}
\sum_{c = 0}^{\lfloor n / 2 \rfloor} {n \choose c} q^c (1 - q)^{n - c} & \mathrm{if}\;\; q \geq 1/2 \\
\sum_{c = \lfloor n / 2 \rfloor + 1}^n {n \choose c} q^c (1 - q)^{n - c} & \mathrm{if}\;\; q < 1/2
\end{cases}.
\] When $n$ is even there is an extra term corresponding to the randomization given a tie. The probability of Player making a mistake is maximized as $q \to 1/2$, however the expected regret is proportional to $|2 q - 1|$; therefore optimal play for the Adversary is to choose an intermediate $q$. For $n = 21$ it looks like the following,
As $n$ increases, the Adversary's best plays move towards $q = 1/2$, and the value of the game to the Adversary decreases.
Importance-Weighted Context Free Classification
Now consider what happens when the Player decides to subsample heads in the training data and then minimize importance-weighted empirical risk. Just to be clear, the game is as follows:- Player chooses a heads subsampling rate $w$.
- Adversary chooses $q$.
- Referee generates a training set $S$ by repeating the following IID until $n$ samples are accepted:
- Choose $y = \mathrm{heads}$ with probability $q$, otherwise $y = \mathrm{tails}$.
- If $y = \mathrm{heads}$, reject sample with probability $w$.
- Player inspects training set $S$, minimizes importance-weighted empirical risk over $\{ \mathrm{heads}, \mathrm{tails} \}$ breaking ties via equal randomization, and reports $\hat y (S)$.
- Referee collects expected regret per example from Player (as above).
What Theory Suggests
Subsampling has two effects. The first is to change the probability of heads in the training set to \[\tilde q (q, w) = \frac{q w}{1 - q + q w}.
\] The second is to change the number of counts required to guess heads from $n / 2$ to $n w / (1 + w)$.
When $q > 1/2$ and $0 < w \leq 1$, $\frac{w}{1 + w} < \tilde q (q, w)$, therefore the probability of being incorrect is bounded above by the Azuma-Hoeffding inequality on the martingale $\sum (Y_i - \tilde q (q, w))$, \[
\begin{aligned}
\mathrm{Pr} (\hat y \neq y^* | q; w; n) &= \mathrm{Pr} \left(Y < \frac{n w}{1 + w} \right) + \frac{1}{2} \mathrm{Pr} \left( Y = \frac{n w}{1 + w} \right) \\
&\leq \mathrm{Pr} \left( Y \leq \frac{n w}{1 + w} \right) \\
&= \mathrm{Pr} \left( Y - n \tilde q (q, w) \leq n \left( \frac{w}{1 + w} - \tilde q (q, w) \right) \right) \\
%&\leq \exp (\frac{- n^2 \left( \frac{w}{1 + w} - \tilde q (q, w) \right)^2}{ 2 n \max\{ \tilde q (q, w)^2, (1 - \tilde q (q, w))^2 \} } ) \\
&\leq \exp \left( -\frac{n}{2 \max\{ \tilde q (q, w)^2, (1 - \tilde q (q, w))^2 \}} \left(\tilde q (q, w) - \frac{w}{1 + w} \right)^2 \right) \\
&\doteq \varphi (q, n, w).
\end{aligned}
\] Here's what's interesting about this bound: it is minimized when $\tilde q (q, w) = \frac{1}{2}$, because \[
\begin{aligned}
\frac{d}{dw} \log \varphi (q, n, w) &= \begin{cases}
-\frac{n w}{4 (1 - 3 q + 2 q^2)^2 (1 + w)^3} < 0 & \mathrm{if}\;\; \tilde q (q, w) < \frac{1}{2} \\ \frac{n}{4 (1 - 2 q)^2 q^2 (1 + w)^3} > 0 & \mathrm{if}\;\; \tilde q (q, w) > \frac{1}{2}
\end{cases}.
\end{aligned}
\] Therefore the bound suggests subsampling for balanced label expectation. This is in retrospect not surprising given the role of entropy in large deviations theory.
Note if the Adversary chooses $q < 1/2$ we can relabel the coins and repeat the argument (i.e., subsampling the other coin). However, since in the game the subsampling rate is chosen before the adversary chooses $q$, Player should choose $w = 1$ to be minimax optimal, unless the Adversary is constrained to have $q > 1/2$. This game is too simple to constrain the Adversary in this manner without making Player strategy trivial, but with a richer context this constraint can make sense.
What Actually Happens
Unfortunately the bound is somewhat loose, and in the context-free game suggests more aggressive subsampling than is warranted by exact computation. Here is the expected regret as a function of $w$ for $n = 21$ when Adversary plays $q = 3/4$,The subsampling rate is continuous but the realizations are discrete, so there are discrete jumps in the regret as the ``breakeven'' number of clicks passes through an integer value (red = $\epsilon$ above = ``breaking ties in favor of tails''; blue = $\epsilon$ below = ``breaking ties in favor of heads''). The dashed line is the regret at $w = 1$, and the green line is the bound. There does appear to be an optimal $w < 1$, but it might be due to the discrete nature of the realizations rather than something fundamental. Now let's look at the same graph with $n = 105$.
Now it is clear that there is an optimal $w < 1$, and it's also clear that the bound is suggesting a $w$ which is too aggressive.
Back to the drawing board.
No comments:
Post a Comment