Tuesday, August 27, 2013

Cosplay

If you're a fan of primal linear methods, then you've probably spent a lot of time thinking about feature engineering. If you've used kernel learning, then you've probably spent a lot of time thinking about kernels that are appropriate to your problem, which is another way of thinking about feature engineering. It turns out there is a way to leverage the work of the kernel community while solving a primal convex optimization problem: random feature maps. This idea has been around for a while: the paper by Rahimi and Recht that really kicked things off is from 2007, Alex has blogged about it and refined the technique, and Veeramachaneni has a nice blog post which graphically explores the technique. One general strategy is to find an integral representation of a kernel, and then approximate the integral representation via Monte Carlo. At the end of the day, this looks like a randomized feature map $\phi$ for which dot products $\phi (x)^\top \phi(y)$ in the primal are converging to the value of kernel function $k (x, y)$. When using the Fourier basis for the integral representation of the kernel, the random feature map consists of cosines, so we call it ``cosplay'' over in here in CISL.

The technique deserves to be more well known than it is, because it gives good learning performance (when the associated kernel is a good choice for the problem), it is straightforward to implement, and it is very fast. I'm hoping that I can increase awareness by providing a simple implementation on a well-known dataset, and in that spirit here is a Matlab script which applies the technique to mnist. Before running this you need to download mnist in matlab format, and download maxent and lbfgs for matlab.

rand('seed',867);
randn('seed',5309);

tic
fprintf('loading mnist');

% get mnist from http://cs.nyu.edu/~roweis/data/mnist_all.mat
load('mnist_all.mat');

trainx=single([train0; train1; train2; train3; train4; train5; train6; train7; train8; train9])/255.0;
testx=single([test0; test1; test2; test3; test4; test5; test6; test7; test8; test9])/255.0;
st=[size(train0,1); size(train1,1); size(train2,1); size(train3,1); size(train4,1); size(train5,1); size(train6,1); size(train7,1); size(train8,1); size(train9,1)];
ss=[size(test0,1); size(test1,1); size(test2,1); size(test3,1); size(test4,1); size(test5,1); size(test6,1); size(test7,1); size(test8,1); size(test9,1)];
paren = @(x, varargin) x(varargin{:});
yt=[]; for i=1:10; yt=[yt; repmat(paren(eye(10),i,:),st(i),1)]; end
ys=[]; for i=1:10; ys=[ys; repmat(paren(eye(10),i,:),ss(i),1)]; end

clear i st ss
clear train0 train1 train2 train3 train4 train5 train6 train7 train8 train9
clear test0 test1 test2 test3 test4 test5 test6 test7 test8 test9

fprintf(' finished: ');
toc

tic
fprintf('computing random feature map');

% (uncentered) pca to 50 ... makes subsequent operations faster,
% but also makes the random projection more efficient by focusing on
% where the data is

opts.isreal = true; 
[v,~]=eigs(double(trainx'*trainx),50,'LM',opts);
trainx=trainx*v;
testx=testx*v; 
clear v opts;

% estimate kernel bandwidth using the "median trick"
% this is a standard Gaussian kernel technique

[n,k]=size(yt);
[m,p]=size(testx);
sz=3000;
perm=randperm(n);
sample=trainx(perm(1:sz),:);
norms=sum(sample.^2,2);
dist=norms*ones(1,sz)+ones(sz,1)*norms'-2*sample*sample';
scale=1/sqrt(median(dist(:)));

clear sz perm sample norms dist;

% here is the actual feature map:
% Gaussian random matrix, uniform phase, and cosine

d=4000;
r=randn(p,d);
b=2.0*pi*rand(1,d);
trainx=cos(bsxfun(@plus,scale*trainx*r,b));
testx=cos(bsxfun(@plus,scale*testx*r,b));

fprintf(' finished: ');
toc

tic
fprintf('starting logistic regression (this takes a while)\n');

% get @maxent and lbfgs.m from http://www.cs.grinnell.edu/~weinman/code/
% if you get an error about randint being undefined, change it to randi

addpath recognition;
addpath opt;
addpath local;

C0=maxent(k,d);
[~,trainy]=max(yt');
options.MaxIter=300; 
options.Display='off';
C1=train(C0,trainy,trainx,'gauss',4.2813,[],[],[],options);
% regularizer was chosen by cross-validation as follows
%perm=randperm(n);
%it=logical(zeros(1,n));
%it(perm(1:int32(0.8*n)))=1;
%[C1,V]=cvtrain(C0,trainy(perm),trainx(perm,:),'gauss',10.^linspace(-4,4,20), ...
%               [],0,[],it,[],@accuracy);
        
fprintf('finished: ');
toc
fprintf('train accuracy is %g\n',accuracy(C1,trainy,trainx));
[~,testy]=max(ys');
fprintf('test accuracy is %g\n',accuracy(C1,testy,testx));

Here's the result of running the script on my laptop:
>> clear all; cosplay
loading mnist finished: Elapsed time is 2.227499 seconds.
computing random feature map finished: Elapsed time is 6.994094 seconds.
starting logistic regression (this takes a while)
finished: Elapsed time is 219.007670 seconds.
train accuracy is 0.99905
test accuracy is 0.9822
This approaches the performance of the Gaussian kernel SVM, but with simplicity and speed. By trying different random feature maps, you can improve upon this result.

If you like this sort of thing, make sure to check out the Randomized Methods for Machine Learning workshop at NIPS 2013.