I got some good questions about subsampling zero-reward examples whose answers I thought would make a good blog post.
Why Do It?
I realize I'm bucking a trend here, after all, ``
there's no data like more data.'' If you can comfortably handle all of your data, then by all means, use it all. The NIPS workshop on
Learning on Cores, Clusters, and Clouds was all about scaling up machine learning. Still in practice I think there are many conditions in which one cannot handle all the data even with such parallelism, and in those cases biased subsampling is better than uniform subsampling if you know the data is very imbalanced. Here are some scenarios:
- In mobile applications, one might have to choose between processing the data locally (using precious power) or transmitting the data for central processing (using precious bandwidth). Subsampling can make either choice less costly.
- In online learning applications (not an online learning algorithm applied to an offline data set, but actually applied online) one needs a strategy for flow control when the input data stream exceeds the learner's throughput.
- In online learning with a feedback loop (e.g., advertising), active learning is my guess of how the most sophisticated systems of the future will control the return flow. However, biased subsampling is easy to implement right now :)
- When doing experimentation, the human machine-learning expert does not want a lot of learning latency when trying things out, even if learning latency for the final product is tolerable. Biased subsampling is better than uniform sampling at maintaining tight bounds between empirical and actual risk for a fixed budget of examples (maybe: see below). My advisor in grad school told me that HMMs always kicked the ass of neural networks in speech recognition, not because HMMs were inherently better, but because they could be trained faster, so in practice one could try lots more things. (Oops, now I sound ancient).
Subsampling gains tend to compose with parallelization gains, i.e., if you get two orders of magnitude from parallelization and two orders of magnitude from subsampling, then together you get four orders of magnitude.
Does It Work?
I have some empirical anecdotes.
At eHarmony we ended up doing the following sequence of experiments, which in hindsight appear rational and scientific. What actually happened is that each stage here represents another instance of model building and being impatient people we kept wondering how we could do things faster than last time. We were scared of screwing something up, however (code even more than math), so we double checked at each stage against a control.
- [Stage 0]: How we started: a classification task (actually, density estimation on a binary variable).
- non-subsampled data for training, calibration, and validation.
- [Stage 1]: Baby steps on a classification problem.
- subsampled data for training vs. non-subsampled data for training.
- non-subsampled data for calibration and validation.
- noted that out-of-sample generalization (validation) was not impacted (statistically speaking) by training on subsampled data.
- [Stage 2]: Gaining confidence on a classification problem.
- subsampled data for training.
- subsampled data for calibration vs. non-subsampled data for calibration.
- non-subsampled data for validation.
- noted that out-of-sample generalization (validation) was not impacted (statistically speaking) by training on subsampled data.
- [Stage 3]: Wanting to go as fast as possible on a classification problem.
- subsampled data for training and calibration.
- subsampled data for validation vs. non-subsampled data for validation.
- noted that both validation techniques gave statistically identical estimates of generalization error.
- [Stage 4]: Wanting to go as fast as possible on a regression problem.
- minor rethought all of the subsample machinery so that it applied to regression and not just classification.
- felt our wheaties: just tried subsampled data everywhere like with classification.
- liked the results, declared victory.
The net result is that nowadays we work exclusively with subsampled data at all stages of model building.
One thing I never tried, unfortunately, is comparing uniform to biased subsampling, i.e., fixing the number of total examples. All of the above experiments compare no subsampling to biased subsampling, i.e., conserving the number of positive reward examples, and experimenting with using less zero reward examples. Furthermore all of the above experiments asked the question ``are the results just as good with subsampling.'' In contrast a comparison of uniform to biased subsampling with a fixed number of total examples could ask the question ``are the subsampled results better.''
Should It Work?
Generally I think about having a fixed budget of examples and then optimizing a deviation bound between empirical and actual risk.
I discussed in a
previous post that for AUC loss, the deviation bound for empirical AUC from actual AUC is minimized for a given budget of examples when the data set has an equal number of positives and negatives. Subsampling for AUC loss problems therefore is very well justified.
For more general losses, e.g. corresponding to regression or classification, in a
previous post I discussed the bound of
Cortes et. al. specialized to the case of subsampling a highly biased set,
R (h) \leq \widehat R_w (h) + \frac{2 (\log |H| + \log \frac{1}{\delta})}{3 m} \frac{p_0}{\beta} + \sqrt{\frac{2 (\log |H| + \log \frac{1}{\delta})}{m} \left(1 - \frac{(\beta - p_0)^2}{\beta (1 - \beta)} \right)}.
Here
p_0 is the fraction of zero-reward examples in the original distribution and
\beta is the fraction of zero-reward examples in the subsampled distribution. Minimizing this bound with respect to
\beta for small
m and
p_0 \to 1 yields
\beta^* = \frac{4 \Psi}{8 \Psi - 9 m} + O (p_0 - 1),
where
\Psi = 2 \left( \log |H| + \log \frac{1}{\delta} \right).
So for
m \ll \Psi this suggests subsampling to roughly equal proportions is the best choice. However
m \ll \Psi is uninteresting since it implies the bound is loose. For large
m the bound is minimized via
\beta^* = p_0 + O \left(\frac{1}{\sqrt{m}} \right),
suggesting that no subsampling (or uniform subsampling) is the best choice. Hey, that's not the result I wanted ... I need a better bound :)
Perhaps the right answer is a schedule where initially zero-reward examples are aggressively subsampled and then as examples flow in subsampling becomes less aggressive until at the end the original distribution is being used (and the entire time importance-weighting is being used with importance-weights approaching unity as subsampling diminishes).
Overall the theoretical case for subsampling for regression or classification problems is currently less compelling than the theoretical case for subsampling AUC loss problems. What can I say? I still do it all the time and I've been happy with the results. YMMV.
How Sensitive is the Recipe to the Guess?
In the
previous post I gave a simple recipe based upon a guess of the true zero-reward probability
\hat p_0. This guess determines the zero-reward subsampling rate
l = (1 - \hat p_0) / \hat p_0, as well as the importance weights
w (x, 0) = 2 \hat p_0 and
w (x, y \neq 0) = 2 (1 - \hat p_0). The guess will be off a bit, however, so do these values still work?
Since the sampling factor (
l) is a free parameter, there is no way to get it ``wrong'', but the importance weights depend upon
p_0 and
l and so could be incorrect. If the true zero-reward probability is
p_0 then
\begin{aligned}
w (x, 0) &= 2 \hat p_0 + \frac{1 - 2 \hat p_0}{1 - \hat p_0} (p_0 - \hat p_0), \\
w (x, y \neq 0) &= 2 (1 - \hat p_0) + \frac{1 - 2 \hat p_0}{\hat p_0} (p_0 - \hat p_0).
\end{aligned}
The latter line indicates robustness but the former line is a concern, because as
\hat p_0 \to 1 the zero-reward importance weight is increasingly sensitive to differences between
\hat p_0 and
p_0. Essentially what is happening is that the correct importance weight is 1 if
p_0 = 1, but in that nonsensical limit every zero-reward example is rejected and no data is observed. Stepping back from that extreme, as
p_0 \to 1 slightly underestimating the true zero-reward rate will lead to more than 1/2 of the subsampled examples being zero-reward implying
w (x, 0) is too large, and slightly overestimating the true zero-reward rate will lead to less than 1/2 of the subsampled examples being zero-reward implying
w (x, 0) is too small.
However the entire situation is mitigated by the fact that the correct
w (x, 0) is lower bounded by 1 and the estimate is upper bounded by 2. Thus when using an SGD optimization approach, this is equivalent to tweaking the learning rate by at most a factor of 2 (since the ratio
w (x, y \neq 0) / w (x, 0) = l is correct). This contrasts sharply with using (incorrect!) weights
\tilde w (x, 0) = l^{-1},
\tilde w (x, 1) = 1, which when coupled with SGD is equivalent to scaling the learning rate by a diverging factor.
So overall I feel very good about using the recipe for speeding up online learning when using SGD as the optimization strategy. On the other hand, if a non-SGD based online algorithm is being applied to an offline pile of data, it's probably better to start with the recipe weights as unnormalized weights and then normalize the weights as described in
Cortes et. al. section 6. If a non-SGD based online algorithm is being used online, I'm not sure exactly what to do, but perhaps an online scheme analogous to normalizing the weights would work, e.g., normalizing over recent (subsampled) history.
What about Informed Subsampling?
In the recipe I talked about subsampling based entirely on the reward (
y) oblivious to the context (
x). What about also looking at
x? I intuit this is a good idea, especially if there are obvious segmentations of the input space that greatly influence the reward distribution. At eHarmony we have such a segmentation in terms of customer type (new user, paying customer, formerly paying customer, etc.). There are only a few of these customer types, each of them has lots of support in the historical data, and they have very different base rates for just about everything we like to model. So in that case we have a handful of guesses
\hat p_0 (x) based upon the customer type, with the importance weight and sampling rate given by the recipe values in each region of constant
\hat p_0 (x). When I've done this I end up building completely different models for each customer type, but that's because I'm using vowpal wabbit and I want to implicitly interact customer type with everything else. I believe this approach should still work even if the data is all fed to one single learner, but full disclosure I've never tried that.
In the limit of knowing
p_0 (x) = E_P [1_{y = 1} | x], subsampling would produce a learning distribution
Q such that at each point zero and non-zero reward labels are equiprobable. The
Cortes et. al. bound doesn't indicate that this is advantageous (the
d_2 (P || Q) term presumably would increase and the other term is not improved). However it also doesn't indicate that biased subsampling based only upon
y is advantageous either, except for small
m. So once again I've seen this work empirically, but I don't have a good theoretical explanation for it, therefore YMMV.