Some notes:
- No frills. One hidden layer and the activation function is tanh, the only configuration is the number of hidden units.
- Composition with other reductions should work but is not extensively tested. In isolation the reduction does binary classification and regression.
- It's not a complete dog but it's not as fast as possible either: more optimizations are possible. Nonetheless if your problem is high-dimensional and sparse the basic infrastructure of vee-dub (i.e., parsing, hashing, etc.) should produce a significant win with respect to neural network implementations that assume dense feature vectors.
- It should work with any available loss function and all available learning algorithm variants.
- Quadratics and ngrams work and are applied at the input-to-hidden layer.
Those who enjoy solving toy problems will be pleased to know that vee-dub can now solve 3-parity with two hidden units.
pmineiro@pmineirovb-931% ./make-parity 3 -1 |f 1:1 2:-1 3:-1 -1 |f 1:-1 2:1 3:-1 1 |f 1:1 2:1 3:-1 -1 |f 1:-1 2:-1 3:1 1 |f 1:1 2:-1 3:1 1 |f 1:-1 2:1 3:1 -1 |f 1:1 2:1 3:1 1 |f 1:-1 2:-1 3:-1 pmineiro@pmineirovb-932% ./make-parity 3 | ../vowpalwabbit/vw --nn 2 --passes 2000 -k -c --cache_file cache -f model -l 10 --invariant final_regressor = model Num weight bits = 18 learning rate = 10 initial_t = 1 power_t = 0.5 decay_learning_rate = 1 randomly initializing neural network output weights and hidden bias creating cache_file = cache Warning: you tried to make two write caches. Only the first one will be made. Reading from num sources = 1 average since example example current current current loss last counter weight label predict features 1.550870 1.550870 3 3.0 1.0000 -1.0000 4 1.919601 2.288332 6 6.0 1.0000 0.7762 4 2.011137 2.120980 11 11.0 1.0000 -1.0000 4 2.154878 2.298620 22 22.0 1.0000 0.3713 4 2.354256 2.553635 44 44.0 -1.0000 1.0000 4 2.286332 2.216827 87 87.0 -1.0000 1.0000 4 2.222494 2.158657 174 174.0 1.0000 0.8935 4 1.716414 1.210335 348 348.0 -1.0000 -0.9598 4 1.368982 1.021549 696 696.0 1.0000 0.9744 4 1.151838 0.934694 1392 1392.0 1.0000 1.0000 4 0.976327 0.800816 2784 2784.0 1.0000 1.0000 4 0.756642 0.536958 5568 5568.0 1.0000 1.0000 4 0.378355 0.000000 11135 11135.0 -1.0000 -1.0000 4 finished run number of examples = 16000 weighted example sum = 1.6e+04 weighted label sum = 0 average loss = 0.2633 best constant = -6.25e-05 total feature number = 64000 pmineiro@pmineirovb-933% ./make-parity 3 | ../vowpalwabbit/vw -i model -t -p /dev/stdout --quiet -1.000000 -1.000000 1.000000 -1.000000 1.000000 1.000000 -1.000000 1.000000With -q ff, I can do 4-parity with 2 hidden units. Woot.
Going more realistic, I tried mnist. The task is multiclass classification of handwritten digits, so I used the one-against-all reduction with the neural network reduction. The raw pixel values are decent features because the digits are size-normalized and centered. An example line looks like
pmineiro@pmineirovb-658% zcat test.data.gz | head -1 |p 202:0.328125 203:0.72265625 204:0.62109375 205:0.58984375 206:0.234375 207:0.140625 230:0.8671875 231:0.9921875 232:0.9921875 233:0.9921875 234:0.9921875 235:0.94140625 236:0.7734375 237:0.7734375 238:0.7734375 239:0.7734375 240:0.7734375 241:0.7734375 242:0.7734375 243:0.7734375 244:0.6640625 245:0.203125 258:0.26171875 259:0.4453125 260:0.28125 261:0.4453125 262:0.63671875 263:0.88671875 264:0.9921875 265:0.87890625 266:0.9921875 267:0.9921875 268:0.9921875 269:0.9765625 270:0.89453125 271:0.9921875 272:0.9921875 273:0.546875 291:0.06640625 292:0.2578125 293:0.0546875 294:0.26171875 295:0.26171875 296:0.26171875 297:0.23046875 298:0.08203125 299:0.921875 300:0.9921875 301:0.4140625 326:0.32421875 327:0.98828125 328:0.81640625 329:0.0703125 353:0.0859375 354:0.91015625 355:0.99609375 356:0.32421875 381:0.50390625 382:0.9921875 383:0.9296875 384:0.171875 408:0.23046875 409:0.97265625 410:0.9921875 411:0.2421875 436:0.51953125 437:0.9921875 438:0.73046875 439:0.01953125 463:0.03515625 464:0.80078125 465:0.96875 466:0.2265625 491:0.4921875 492:0.9921875 493:0.7109375 518:0.29296875 519:0.98046875 520:0.9375 521:0.22265625 545:0.07421875 546:0.86328125 547:0.9921875 548:0.6484375 572:0.01171875 573:0.79296875 574:0.9921875 575:0.85546875 576:0.13671875 600:0.1484375 601:0.9921875 602:0.9921875 603:0.30078125 627:0.12109375 628:0.875 629:0.9921875 630:0.44921875 631:0.00390625 655:0.51953125 656:0.9921875 657:0.9921875 658:0.203125 682:0.23828125 683:0.9453125 684:0.9921875 685:0.9921875 686:0.203125 710:0.47265625 711:0.9921875 712:0.9921875 713:0.85546875 714:0.15625 738:0.47265625 739:0.9921875 740:0.80859375 741:0.0703125Here are some results. \[
\begin{array}{c|c|c}
\mbox{Model } &\mbox{ Test Errors } &\mbox{ Notes } \\ \hline
\mbox{Linear } &\mbox{ 848 } & \\
\mbox{Ngram } &\mbox{ 436 } & \verb!--ngram 2 --skips 1! \\
\mbox{Quadratic } &\mbox{ 301 } & \verb!-q pp! \\
\mbox{NN } &\mbox{ 273 } & \verb!--nn 40!
\end{array}
\] Quadratic gives good results but is s-l-o-w to train (each example has > 10000 features). NGram gives some of the boost of quadratic and is fairly zippy (due to the encoding these are horizontal n-grams; vertical n-grams would presumably also help). A 40 hidden unit neural network does better than Quadratic and is also faster to train. The command line invocation looks something like this:
pmineiro@pmineirovb-74% ./vw --oaa 10 -l 0.5 --loss_function logistic --passes 10 --hash all -b 22 --adaptive --invariant --random_weights 1 --random_seed 14 --nn 40 -k -c --cache_file nncache train.data.gz -f nnmodel