The splice site dataset from the 2008 Pascal Large Scale Learning Challenge is a collection of 50 million labeled DNA sequences each of which is 200 base pairs in length. For our purposes it is a binary classification problem on strings with a limited alphabet. Here are some examples:
% paste -d' ' <(bzcat dna_train.lab.bz2) <(bzcat dna_train.dat.bz2) | head -3 -1 AGGTTGGAGTGCAGTGGTGCGATCATAGCTCACTGCAGCCTCAAACTCCTGGGCTCAAGTGATCCTCCCATCTCAGCCTCCCAAATAGCTGGGCCTATAGGCATGCACTACCATGCTCAGCTAATTCTTTTGTTGTTGTTGTTGAGACGAAGCCTCGCTCTGTCGCCCAGGCTGGAGTGCAGTGGCACAATCTCGGCTCG -1 TAAAAAAATGACGGCCGGTCGCAGTGGCTCATGCCTGTAATCCTAGCACTTTGGGAGGCCGAGGCGGGTGAATCACCTGAGGCCAGGAGTTCGAGATCAGCCTGGCCAACATGGAGAAATCCCGTCTCTACTAAAAATACAAAAATTAGCCAGGCATGGTGGCGGGTGCCTGTAATCCCAGCTACTCGGGAGGCTGAGGT -1 AAAAGAGGTTTAATTGGCTTACAGTTCCGCAGGCTCTACAGGAAGCATAGCGCCAGCATCTCACAATCATGACAGAAGATGAAGAGGGAGCAGGAGCAAGAGAGAGGTGAGGAGGTGCCACACACTTTTAAACAACCAGATCTCACGAAAACTCAGTCACTATTGCAAGAACAGCACCAAGGGGACGGTGTTAGAGCATTIt turns out if you break these strings up into $n$-grams than logistic regression does quite well. Here's a small program that will process the DNA sequences into 4-grams and output a vee-dub compatible format.
% less quaddna2vw.cpp #include <iostream> #include <string> namespace { using namespace std; unsigned int codec (const string::const_iterator& c) { return *c == 'A' ? 0 : *c == 'C' ? 1 : *c == 'G' ? 2 : 3; } } int main (void) { using namespace std; while (! cin.eof ()) { string line; getline (cin, line); if (line.length ()) { string::const_iterator ppp = line.begin (); string::const_iterator pp = ppp + 1; string::const_iterator p = pp + 1; unsigned int offset = 1; cout << " |f"; for (string::const_iterator c = p + 1; c != line.end (); ++ppp, ++pp, ++p, ++c) { unsigned int val = 64 * codec (ppp) + 16 * codec (pp) + 4 * codec (p) + codec (c); cout << " " << offset + val << ":1"; offset += 256; } cout << endl; } } return 0; }I'll use the following Makefile to drive the learning pipeline.
% less Makefile SHELL=/bin/zsh CXXFLAGS=-O3 .SECONDARY: all: %.check: @test -x "$$(which $*)" || { \ echo "ERROR: you need to install $*" 1>&2; \ exit 1; \ } dna_train.%.bz2: wget.check wget ftp://largescale.ml.tu-berlin.de/largescale/dna/dna_train.$*.bz2 quaddna2vw: quaddna2vw.cpp quaddna.model.nn%: dna_train.lab.bz2 dna_train.dat.bz2 quaddna2vw vw.check time paste -d' ' \ <(bzcat $(word 1,$^)) \ <(bzcat $(word 2,$^) | ./quaddna2vw) | \ tail -n +1000000 | \ vw -b 24 -l 0.05 --adaptive --invariant \ --loss_function logistic -f $@ \ $$([ $* -gt 0 ] && echo "--nn $* --inpass") quaddna.test.%: dna_train.lab.bz2 dna_train.dat.bz2 quaddna.model.% quaddna2vw vw.check paste -d' ' \ <(bzcat $(word 1,$^)) \ <(bzcat $(word 2,$^) | ./quaddna2vw) | \ head -n +1000000 | \ vw -t --loss_function logistic -i $(word 3,$^) -p $@ quaddna.perf.%: dna_train.lab.bz2 quaddna.test.% perf.check paste -d' ' \ <(bzcat $(word 1,$^)) \ $(word 2,$^) | \ head -n +1000000 | \ perf -ROC -APRHere's the result of one sgd pass through the data using logistic regression.
% make quaddna.perf.nn0 g++ -O3 -I/home/pmineiro/include -I/usr/local/include -L/home/pmineiro/lib -L/usr/local/lib quaddna2vw.cpp -o quaddna2vw time paste -d' ' \ <(bzcat dna_train.lab.bz2) \ <(bzcat dna_train.dat.bz2 | ./quaddna2vw) | \ tail -n +1000000 | \ vw -b 24 -l 0.05 --adaptive --invariant \ --loss_function logistic -f quaddna.model.nn0 \ $([ 0 -gt 0 ] && echo "--nn 0 --inpass") final_regressor = quaddna.model.nn0 Num weight bits = 24 learning rate = 0.05 initial_t = 0 power_t = 0.5 using no cache Reading from num sources = 1 average since example example current current current loss last counter weight label predict features 0.673094 0.673094 3 3.0 -1.0000 -0.0639 198 0.663842 0.654590 6 6.0 -1.0000 -0.0902 198 0.623277 0.574599 11 11.0 -1.0000 -0.3074 198 0.579802 0.536327 22 22.0 -1.0000 -0.3935 198 ... 0.011148 0.009709 22802601 22802601.0 -1.0000 -12.1878 198 0.009952 0.008755 45605201 45605201.0 -1.0000 -12.7672 198 finished run number of examples = 49000001 weighted example sum = 4.9e+07 weighted label sum = -4.872e+07 average loss = 0.009849 best constant = -0.9942 total feature number = 9702000198 paste -d' ' <(bzcat dna_train.lab.bz2) 53.69s user 973.20s system 36% cpu 46:22.36 total tail -n +1000000 3.87s user 661.57s system 23% cpu 46:22.36 total vw -b 24 -l 0.05 --adaptive --invariant --loss_function logistic -f 286.54s user 1380.19s system 59% cpu 46:22.43 total paste -d' ' \ <(bzcat dna_train.lab.bz2) \ <(bzcat dna_train.dat.bz2 | ./quaddna2vw) | \ head -n +1000000 | \ vw -t --loss_function logistic -i quaddna.model.nn0 -p quaddna.test.nn0 only testing Num weight bits = 24 learning rate = 10 initial_t = 1 power_t = 0.5 predictions = quaddna.test.nn0 using no cache Reading from num sources = 1 average since example example current current current loss last counter weight label predict features 0.000020 0.000020 3 3.0 -1.0000 -17.4051 198 0.000017 0.000014 6 6.0 -1.0000 -17.3808 198 0.000272 0.000578 11 11.0 -1.0000 -5.8593 198 0.000168 0.000065 22 22.0 -1.0000 -10.5622 198 ... 0.008531 0.008113 356291 356291.0 -1.0000 -14.7463 198 0.008372 0.008213 712582 712582.0 -1.0000 -7.1162 198 finished run number of examples = 1000000 weighted example sum = 1e+06 weighted label sum = -9.942e+05 average loss = 0.008434 best constant = -0.9942 total feature number = 198000000 paste -d' ' \ <(bzcat dna_train.lab.bz2) \ quaddna.test.nn0 | \ head -n +1000000 | \ perf -ROC -APR APR 0.51482 ROC 0.97749That's 47 minutes of wall clock training time for a test APR of 0.514. (If you read the above closely you'll note I'm using the first million lines of the file as test data and the rest as training data.) This is fairly respectable; entries to the large-scale learning challenge got APR of circa 0.2 which is what you would get from unigram logistic regression, whereas the best known approaches on this dataset take multiple core-days to compute and get an APR of circa 0.58.
During the above run quaddna2vw uses 100% of 1 cpu and vw uses about 60% of another. In other words, vw is not the bottleneck, and we can spend a little extra cpu learning without any actual wall clock impact. Ergo, time to go one louder, by specifying a small number of hidden units and direct connections from the input to output layer via --nn 8 --inpass. Everything else remains identical.
% make quaddna.perf.nn8 time paste -d' ' \ <(bzcat dna_train.lab.bz2) \ <(bzcat dna_train.dat.bz2 | ./quaddna2vw) | \ tail -n +1000000 | \ vw -b 24 -l 0.05 --adaptive --invariant \ --loss_function logistic -f quaddna.model.nn8 \ $([ 8 -gt 0 ] && echo "--nn 8 --inpass") final_regressor = quaddna.model.nn8 Num weight bits = 24 learning rate = 0.05 initial_t = 0 power_t = 0.5 using input passthrough for neural network training randomly initializing neural network output weights and hidden bias using no cache Reading from num sources = 1 average since example example current current current loss last counter weight label predict features 0.600105 0.600105 3 3.0 -1.0000 -0.2497 198 0.576544 0.552984 6 6.0 -1.0000 -0.3317 198 0.525074 0.463309 11 11.0 -1.0000 -0.6047 198 0.465905 0.406737 22 22.0 -1.0000 -0.7760 198 ... 0.010760 0.009331 22802601 22802601.0 -1.0000 -11.5363 198 0.009633 0.008505 45605201 45605201.0 -1.0000 -11.7959 198 finished run number of examples = 49000001 weighted example sum = 4.9e+07 weighted label sum = -4.872e+07 average loss = 0.009538 best constant = -0.9942 total feature number = 9702000198 paste -d' ' <(bzcat dna_train.lab.bz2) 58.24s user 1017.98s system 38% cpu 46:23.54 total tail -n +1000000 3.77s user 682.93s system 24% cpu 46:23.54 total vw -b 24 -l 0.05 --adaptive --invariant --loss_function logistic -f 2341.03s user 573.53s system 104% cpu 46:23.61 total paste -d' ' \ <(bzcat dna_train.lab.bz2) \ <(bzcat dna_train.dat.bz2 | ./quaddna2vw) | \ head -n +1000000 | \ vw -t --loss_function logistic -i quaddna.model.nn8 -p quaddna.test.nn8 only testing Num weight bits = 24 learning rate = 10 initial_t = 1 power_t = 0.5 predictions = quaddna.test.nn8 using input passthrough for neural network testing using no cache Reading from num sources = 1 average since example example current current current loss last counter weight label predict features 0.000041 0.000041 3 3.0 -1.0000 -15.2224 198 0.000028 0.000015 6 6.0 -1.0000 -16.5099 198 0.000128 0.000247 11 11.0 -1.0000 -6.7542 198 0.000093 0.000059 22 22.0 -1.0000 -10.7089 198 ... 0.008343 0.007864 356291 356291.0 -1.0000 -14.3546 198 0.008138 0.007934 712582 712582.0 -1.0000 -7.0710 198 finished run number of examples = 1000000 weighted example sum = 1e+06 weighted label sum = -9.942e+05 average loss = 0.008221 best constant = -0.9942 total feature number = 198000000 paste -d' ' \ <(bzcat dna_train.lab.bz2) \ quaddna.test.nn8 | \ head -n +1000000 | \ perf -ROC -APR APR 0.53259 ROC 0.97844From a wall-clock standpoint this is free: total training time increased by 1 second, and vw and quaddna2vw are now roughly equal in throughput. Meanwhile APR has increased from 0.515 to 0.532. This illustrates the basic idea: engineer features that are good for your linear model, and then when you run out of steam, try to add a few hidden units. It's like turning the design matrix up to eleven.
I speculate something akin to gradient boosting is happening due to the learning rate schedule induced by the adaptive gradient. Specifically if the direct connections are converging more rapidly than the hidden units are effectively being asked to model the residual from the linear model. This suggests a more explicit form of boosting might yield an even better free lunch.