- Model selection. Out-of-bag estimates function analogously to cross-validated estimates. Monitoring out-of-bag estimates instead of progressive validation loss is a better idea when looping over a data set multiple times.
- Confidence intervals. When developing a standalone decision system, a point prediction is usually sufficient; but when developing a subsystem to be integrated into a larger decision system, interval predictions provide a richer summary to downstream consumers.
- Better results. Ensembling can improve model generalization due to a bias-variance tradeoff. I haven't seen any real lift on the data sets I've tried so far, however, and this is understood and expected: linear models have low variance and so don't benefit from bagging. To benefit from bagging you have to do something whose final output is more sensitive to differences in the training data, e.g., neural networks, trees, or linear models with automated data-driven feature selection.
We can model this as an importance-weighted classification problem, transforming pairs $(x_i, r_i)$ into triples $(x_i, y_i, c_i)$ according to \[
(x_i, r_i) \to (x_i, 1_{r_i > 0.68}, |r_i - 0.68|).
\] For the evaluation set we can decode the decision whether-or-not to mail directly as the model's predicted class. In this case the constant predictor corresponds to a policy that mails no-one or mails everyone; as anybody who has ever had a mailbox will attest, it is more profitable to mail everyone than to mail no-one so ``mail everybody'' is the baseline policy to beat. (Parenthetically: maybe instead of machine learning trade-secret spam filters, we should just machine learn and openly distribute a model of who actually wants to buy Viagra and save everybody else from being bothered.)
I put together a naive ``kitchen-sink'' model by taking every feature in the training set, encoding dates numerically as months since January 1900, encoding numerics using splines with empirical deciles as knot-points, and encoding all other variables categorically. I also put together a model that has 10 features in it gleaned from looking at the write-ups of the winners (encoded using the same strategy as the kitchen-sink). Here are the results: \[
\begin{array}{c|c|c|c}
\mbox{Model } &\mbox{ Training Set Profit } &\mbox{ Training Set Profit (out-of-bag) } &\mbox{ Evaluation Set Profit } \\ \hline
\mbox{Mail everybody } & \$10788.54 & \$10704 \pm \$529 & \$10560.08 \\
\mbox{10 features } & \$17596.35 & \$14617 \pm \$434 & \$14896.07 \\
\mbox{Kitchen sink } & \$24353.80 & \$493 \pm \$80 & \$154.92 \\
\end{array}
\] So the kitchen sink severely overfits the training set, but out-of-bag estimates detect this. Woot.
Unfortunately the bagging reduction is not really faster than running multiple vee-dubs. It is faster for simple models where I/O overhead is substantial, but for more complicated models amortizing the I/O is only a modest benefit. It is more convenient to have everything orchestrated internally, and the reduction should compose with the distributed learning capabilities of vee-dub, but in the laptop zone it's not a huge win.
Of course once you have bagging, the natural next step is to try some unstable learning methods. Stay tuned!
No comments:
Post a Comment