The next time somebody asks, "Whose Afraid of Non-Convex Loss Functions?", I'm going to raise my hand. This dyadic model is only mildly non-convex but is proving challenging to get right.
I downloaded the bookcrossing dataset. The rankings are ordinal so the standard evaluation metric is mean absolute error. Fortunately $\tau$-quantile loss also has an analytic solution to the dynamics, so I thought I was set. $\tau$-quantile loss is defined by \[
l (p, y) = \begin{cases} \tau (y - p) & y > p \\ (1 - \tau) (p - y) & y \leq p \end{cases}.
\] When $\tau = 1/2$ this is the same as mean absolute error.
I fired this up on the bookcrossing dataset, but it didn't work, in the following sense. As I increased the learning rate, it was possible to exhibit pathological behaviour, e.g., progressive validation loss diverging. This is despite the fact that the importance aware update cannot overrun the label. What I believe is happening (based upon printf) is that there are many ways to cause the loss on the current example to go to zero, and some of those ways involve a large dyadic parameter multiplied by a small dyadic parameter; unfortunately this combination is unlikely to generalize well.
$L^2$ regularization will favor dyadic parameter configurations of uniform magnitude, and fortunately if only the dyadic parameters are regularized there is an analytical solution for $\tau$-quantile loss (and hinge loss). The dynamics with regularization are given by \[
\begin{aligned}
a' (h) &= -\eta b (h) \frac{\partial l}{\partial p}\biggr|_{p = a (h)^\top b (h) + (w - s (h) x)^\top x} - \eta \lambda a (h), \\
b' (h) &= -\eta a (h) \frac{\partial l}{\partial p}\biggr|_{p = a (h)^\top b (h) + (w - s (h) x)^\top x} - \eta \lambda b (h), \\
s' (h) &= \eta \frac{\partial l}{\partial p}\biggr|_{p = a (h)^\top b (h) + (w - s (h) x)^\top x},
\end{aligned}
\] which for $\tau$-quantile loss has the solution \[
\begin{aligned}
a (h) &= e^{-\eta \lambda h} \left( a_0 \cosh (-s (h)) + b_0 \sinh (-s (h)) \right), \\
b (h) &= e^{-\eta \lambda h} \left( b_0 \cosh (-s (h)) + a_0 \sinh (-s (h)) \right), \\
s (h) &= -\eta h \begin{cases} \tau & y > p \\ (\tau - 1) & y \leq p \end{cases}.
\end{aligned}
\] Without regularization, once $p = y$ the dynamics halt. With regularization this is not necessarily true; if $\lambda$ is large enough the dynamics can go past the label making progress on the regularizer. Even if $p = y$ is an attractor, it's possible to trade off linear parameters (which are not regularized) for dyadic ones while maintaining a constant prediction. I ignored those weird cases and just stopped updating when $p = y$.
So I tried that, and it also didn't work, but in a different way. It does not exhibit pathological behaviour with high learning rates as long as $\lambda$ is non-zero (this is good: this means reductions and active learning will not go haywire). In addition, it is possible to drive the progressive validation loss on the training set much lower with dyadic dimensions that without. However this improvement does not translate to the test set. In other words, it fit the training data better, but it didn't generalize well.
So I actually implemented the approach from Menon and Elkan in order to compare. I found that it also didn't generalize well on the bookcrossing dataset. Then I noticed that the number of ratings, users, etc. reported in their paper didn't match what I had. I'm not sure what's up with the bookcrossing dataset, so I decided to switch to the 1M movielens dataset. The Menon and Elkan approach did show some uplift here, not as large as reported in the paper but on the other hand I didn't implement all the bells and whistles from their paper like variable regularization and reduced parameterization. Now I had a dataset for which I expected to make progress. Funny thing about machine learning: it's alot easier to get something to work when you know it's supposed to :)
Here's how my dyadic fit to the 1M movielens ratings performs.\[
\begin{array}{c|c|c}
\mbox{Model } &\mbox{ $0.5$-quantile loss on Test Set } &\mbox{ Comments } \\ \hline
\mbox{Best Constant } &\mbox{ 0.428 } &\mbox{ Median rating is 4 } \\ \hline
\mbox{Linear } &\mbox{ 0.364 } & \mbox{Prediction is $\alpha_{rid} + \beta_{tid} + c$} \\ \hline
\mbox{Dyadic $k=1$ } &\mbox{ 0.353 } & \mbox{Prediction is $a_{rid}^\top b_{tid} + \alpha_{rid} + \beta_{tid} + c$} \\ \hline
\mbox{Dyadic $k=2$ } &\mbox{ 0.351 } &\mbox{ Like previous with $a_{rid}, b_{tid}$ as 2-vectors } \\ \hline
\mbox{Dyadic $k=5$ } &\mbox{ 0.349 } &\mbox{ Like previous with 5-vectors instead of 2-vectors}
\end{array}
\] Ok once again I say, where's the beef? Most of the predictive power over the best constant is coming from the linear components. Looking at table 3 from the Menon and Elkan paper, I can see that they also don't show dramatic lift with increasing latent dimensionality.
Friday, May 27, 2011
Friday, May 20, 2011
A Dyadic Importance Aware Update: Part II
Implementation Odds and Ends
It turns out the dyadic model I discussed in my previous post was a little simpler than is typically used. A more typical model structure is \[p = a^\top b + a^\top 1 + 1^\top b + w^\top x
\] however with a change of variables \[
\begin{aligned}
\tilde a (h) &= a (h) + 1 \\
\tilde b (h) &= b (h) + 1 \\
\tilde p &= p + 1^\top 1
\end{aligned}
\] one recovers the original equations. This also has the effect of moving the saddle point of the gradient dynamics away from zero; however initializing all model parameters to zero still has problems because there is nothing to break the symmetry between the model parameters. (I chose to break the symmetry by applying a small perturbation to the dyadic parameters on their first update.) In what follows I'll be referring to the original equations to ease the exposition, but in practice the change of variables is employed.
Numerically computing the prediction for a particular step-size, \[
p (s) = a_0^\top b_0 \cosh (2 s) - (||a_0||^2 + ||b_0||^2) \cosh (s) \sinh (s) + x^\top (w - s x)
\] and the implicitly defined inverse $p^{-1} (y)$ turn out to be important. This form is very convenient because only the norm of $a_0$ and $b_0$ along with their inner product is required to simulate a particular step. In terms of software: the interface for a loss function needs to be augmented with two extra scalar inputs ($||a_0||^2 + ||b_0||^2$, and $a_0^\top b_0$; and with the change of variables, $1^\top 1$).
In the forward direction, $p (s)$ is a recipe for catastrophic cancellation so some care is required when computing it. Once this is addressed the reverse direction $p^{-1} (y)$ is mostly well-behaved and can be solved using Newton's method with backtracking line search. However, when $a^\top b \approx \frac{1}{2} (||a_0||^2 + ||b_0||^2)$ and $x \approx 0$, precision issues dominate. Essentially when there are no linear features ($x \approx 0$), the dyadic terms lie near a hyperdiagonal (e.g., $a_0 \approx \pm b_0$), and the prediction is the wrong sign, then it takes a very large $s$ to change the sign of the prediction. Fortunately putting a constant linear feature in the model ensures $x^\top x \geq 1$. In this case encountering a hyperdiagonal essentially implies no learning for the dyadic model parameters, but hopefully dyads are paired ``sufficiently randomly'' in the training sequence that this is not a problem in practice.
The function $p^{-1} (y)$ is used directly to enforce the boundary for hinge loss. It is also related to the smallest importance weight that would cause the prediction to change sign, a quantity that comes up during active learning. (Well it comes up in active learning with linear models; whether the same theory applies to dyadic models is not clear, but I'm hoping sampling ``near the boundary'' and using importance weights will do something reasonable.) The smallest importance weight that causes the model output to be $x$ when the label is $y$ is given by $h_{min} (x, y) = s^{-1} (p^{-1} (x); y)$. Leveraging the separation of variables trick from Karampatziakis and Langford, the update function $s (h; y)$ can be inverted without solving the initial value problem via \[
\begin{aligned}
s' (h; y) &= \eta \frac{\partial l (p, y)}{\partial p}\biggr|_{p = p (s)} = \eta F (s; y), \\
dh &= \frac{1}{\eta F (s; y)} ds \\
s^{-1} (x; y) &= \frac{1}{\eta} \int_{0}^x \frac{d s}{F (s; y)}.
\end{aligned}
\] For hinge loss this boils down to $h_{min} (x, y) = -p^{-1} (x) / (\eta y)$; for other losses the integral has to be done numerically, but for active learning purposes presumably only a rough numerical integration need be done.
Does It Work?
Well, there are many ways to answer that question :)I had some dyadic data lying around corresponding to the results from a Mechanical Turk HIT in which I asked users to guess the ethnicity of Twitter users based upon their public profile: there are 10,000 tasks and 5 raters each, for a total of 50,000 data points. I split the data up, trained on most of it (15/16ths), and tested on the rest (1/16th). I'm only using the identifiers as features, e.g., an input line might look like
2 2|rater r_2737 |task t_89 |id r_2737 t_89for a model with one latent dimension and
2 2|rater r_2737 r2_2737 |task t_89 t2_89 |id r_2737 t_89If that input seems funny, that's because right now I don't emit linear features for feature spaces that are dot-producted together (in this case, "rater" and "task" are dot-producted). This violates DRY so I'll probably change this. (Also, I shouldn't have to regenerate the input to change the number of latent dimensions, so that needs to change.)
My nominallabelextract model I put together to process this data is akin to a dyadic model with one latent dimension. With the above framework I can easily add additional latent dimensions, which is reminiscent of the Welinder et. al. multidimensional embedding. The goal is a bit different here, however: in those cases, the idea is to extract ``ground truth'', i.e., a probability distribution over the ``true'' ethnicity for the task. Here the goal will be merely to predict the label that a rater will assign to a task. This is related to the true underlying label of the task, but also related to properties of the rater.
As an aside, to the extent that it is possible to reliably predict the test set labels from the training set labels, that would mean that I paid too much money for my labels since the ones in the test set were redundant. Using active learning techniques on this problem might mitigate the redundancy. Perversely, however, this might reward a rater answering randomly with more work.
I randomly did the training/test split, trained each model on the same training set and tested each model on the same test set. The idea is to have some (alot!) of twinning of individual raters and individual tasks between the two data sets (but not of pairs; each pair occurs at most once). I'm using a scoring filter tree reduction to reduce multiclass to binary, and then I'm using hinge loss for the binary classifier. Here are some results. \[
\begin{array}{c|c|c}
\mbox{Model } & \mbox{0/1 Loss on Test Set }(\pm 0.007) &\mbox{ Comments } \\ \hline
\mbox{Best Constant } &\mbox{ 0.701 } & \mbox{Most frequent ethnicity occurs }\approx 30\% \mbox{ of the time} \\ \hline
\mbox{Linear } &\mbox{ 0.214 } & \mbox{Prediction is }\alpha_{rid} + \beta_{tid} + c \\ \hline
\mbox{Dyadic }k=1 &\mbox{ 0.198 } & \mbox{Prediction is }a_{rid}^\top b_{tid} + \alpha_{rid} + \beta_{tid} + c \\ \hline
\mbox{Dyadic }k=2 &\mbox{ 0.198 } &\mbox{ Like previous with }a_{rid}, b_{tid} \mbox{ as 2-vectors } \\ \hline
\mbox{Dyadic }k=3 &\mbox{ 0.199 } &\mbox{ Like previous with 3-vectors instead of 2-vectors}
\end{array}
\] Ok so this is not blowing my socks off $\ldots$ the differences are not statistically significant (using a bogus binomial variance estimate). There seems to be a tiny bit of juice when adding a single latent dimension, and after that it poops out. In retrospect this is a poor choice of dataset for testing since absent adversarial behaviour by the raters the important variables are the rater bias and the true task label, both of which are nicely captured by the linear model. By the way I chose all the learning parameters to give the best result for the linear model, then reused them for the other models.
On the other hand, operationally things look promising. First the speed penalty is fairly minor between dyadic $k=3$ and linear (80 seconds vs 50 seconds to do 160 passes). Since an input record is bigger in the dyadic $k=3$ case, I tried training a linear model on the dyadic $k=3$ input data, in which case the difference disappears. I don't think the difference is parsing; rather I suspect more weight values are being accessed per line which is causing more cache misses. In any event, for an equal number of effective features, the overhead of computing $p^{-1} (y)$ is not noticeable. Of course, hinge loss has an analytic solution to the dynamics; for other loss functions I have to solve an initial value problem, which might slow things down considerably.
Second the results are robust with respect to the setting of the learning rate. There is definitely a best learning rate in terms of minimizing test set loss, but when I varied the learning rate over 4 orders of magnitude there was never any pathological behaviour. Perhaps in a more complicated model I would encounter trouble, or perhaps when using a different loss function (a large learning rate might complicate numerically solving the initial value problem). Also, I haven't implemented regularization, which might be required to get a more complicated model working.
So next steps are to apply this to some of the testing datasets mentioned in Menon and Elkan, e.g., the bookcrossing dataset; and to implement for other loss functions and see if I get bad results (due to approximating the multidimensional ODE with a one-dimensional ODE), instability with respect to the learning rate (arising when solving the initial value problem), or unacceptably slow throughput (due to all the additional numerics on each update).
Friday, May 13, 2011
A Dyadic Importance Aware Update
So recently I talked about dyadic modeling, and how it appears very close to the kind of model which is currently implemented inside vowpal wabbit. However there are several wrinkles that arise due to the ``apparently minor'' differences.
The basic algorithm in vowpal is essentially a GLM optimized with SGD. One of the neat tricks under the hood is the idea of importance weight aware updates. The key observation is that for a GLM all the gradients for a particular example point in the same direction and differ only in their magnitude, i.e., if the prediction is $p = w^\top x$ then $\nabla_w l (p, y) = x \frac{\partial l}{\partial p}|_{p=w^\top x}$. Karampatziakis and Langford exploit this to simulate the limit of a large number of small gradient steps via the solution to an ordinary differential equation. The resulting updates have three nice properties. The first is safety: for loss functions where it is possible to overshoot the label, the importance aware update will never overshoot the label, unlike a naive scaling of the gradient. The second is invariance: two sequential updates applied with importance weights $w_1$ and $w_2$ is equivalent to a single update with importance weight $w_1 + w_2$. Safety makes the results of SGD less sensitive to the exact choice of learning rate; whereas invariance is important for reductions that leverage importance weights. What's the third nice property? It turns out there are closed form updates for all sorts of commonly used loss functions so the importance aware update is cheap to compute.
Dyadic models are not GLMs, but some of the principles still apply. Consider a simple dyadic model, \[
\begin{aligned}
p &= a^\top b, \\
\nabla_a l (p, y) &= b \frac{\partial l}{\partial p}|_{p=a^\top b}, \\
\nabla_b l (p, y) &= a \frac{\partial l}{\partial p}|_{p=a^\top b}.
\end{aligned}
\] All the gradients do not point in the same direction, but nonetheless attempting to create an importance weight aware update yields a pair of equations, \[
\begin{aligned}
a' (h) &= -\eta b (h) \frac{\partial l}{\partial p}\biggr|_{p = a (h)^\top b (h)}, \\
b' (h) &= -\eta a (h) \frac{\partial l}{\partial p}\biggr|_{p = a (h)^\top b (h)}.
\end{aligned}
\] Mathematica can only solve this for me with hinge loss; in the linear regime the solution looks like \[
\begin{aligned}
a (h) &= a_0 \cosh (\eta h y) + b_0 \sinh (\eta h y), \\
b (h) &= b_0 \cosh (\eta h y) + a_0 \sinh (\eta h y),
\end{aligned}
\] and then as soon as one hits the boundary where $a (h)^\top b (h) = y$ there are no more changes. Mathematica can also analytically obtain the smallest $h_{min}$ for which $a (h)^\top b (h) = y$ and it is not too bad to compute (involving $\cosh^{-1}$ and square root); although when $a_0 \approx b_0$ some care is required (there is a series expansion available).
This update has all three desirable properties. Stopping once the product hits the label ensures safety. Invariance holds and can be verified in the linear regime by leveraging the identity $\cosh (a + b) = \cosh (a) \cosh (b) + \sinh (a) \sinh (b)$. Finally the update is relatively cheap to compute.
If one replaces the prediction with $p = w^\top x + a^\top b$, the story remains essentially the same. The equations are \[
\begin{aligned}
a' (h) &= -\eta b (h) \frac{\partial l}{\partial p}\biggr|_{p = a (h)^\top b (h) + (w - s (h) x)^\top x}, \\
b' (h) &= -\eta a (h) \frac{\partial l}{\partial p}\biggr|_{p = a (h)^\top b (h) + (w - s (h) x)^\top x}, \\
s' (h) &= \eta \frac{\partial l}{\partial p}\biggr|_{p = a (h)^\top b (h) + (w - s (h) x)^\top x},
\end{aligned}
\] which for hinge loss has solution \[
\begin{aligned}
a (h) &= a_0 \cosh (\eta h y) + b_0 \sinh (\eta h y), \\
b (h) &= b_0 \cosh (\eta h y) + a_0 \sinh (\eta h y), \\
s (h) &= -\eta h y.
\end{aligned}
\] However the computation of $h_{min}$ is more complicated; I can't get an analytic expression for it, so unfortunately one-dimensional root finding is required, \[
\text{Solve}\left[\frac{1}{2} \left(||a_0||^2+||b_0||^2\right) \sinh (2 \eta h y)+a^\top b \cosh (2 \eta h y)+x^\top (\eta h x y+w)=y,h\right].
\] This might disqualify the update from being considered ``cheap to compute''.
Overall I think this update might have some relevance for dyadic classification problems where hinge loss would be a reasonable choice. Quantile loss is also amenable analytically but other losses are not.
When I do a parametric plot of $a (h)$ and $b (h)$ for a one-dimensional dyadic update starting from different starting points I get a totally cool picture which I wanted to post. It reminds me of a warp drive trail from science fiction or something.
Numerically integrating the above equations for different losses produces similar looking pictures, although for some losses like logistic there is no boundary so the swoosh keeps going. This suggests the hinge loss case captures the essential curvature of the update, but the magnitude of the update is different for different losses due to the nonlinearity in the loss. That further suggests a hack like \[
\begin{aligned}
a (h) &= a_0 \cosh (-s (h)) + b_0 \sinh (-s (h)), \\
b (h) &= b_0 \cosh (-s (h)) + a_0 \sinh (-s (h)), \\
s' (h) &= \eta \frac{\partial l}{\partial p}\biggr|_{p = a (h)^\top b (h) + (w - s (h) x)^\top x}.
\end{aligned}
\] So here are some guesses about this hack. First guess: this is a descent direction since if the loss is linearized at $h = 0$ I get the hinge loss update which the above equations reproduce. Second guess: this update is invariant, since by substitution I can reduce this to a first-order ODE, and cosh and sinh do not spoil the continuous differentiability of the loss function. Third guess: this update has the safety property.
Of course this update involves numerically integrating a one-dimensional ODE, and so might prove too expensive in practice. Basically the hack reduces a multi-dimensional ODE to a one-dimensional ODE which is a computational savings, but definitely not as cool as closed form.
To get a feel for why invariance works, consider what the value of $a$ is after doing two sequential updates with importance weights $h_1$ and $h_2$, \[
\begin{aligned}
a (p_1, h_2) &= a (p_0, h_1) \cosh (-s (p_1, h_2)) + b (p_0, h_1) \sinh (-s (p_1, h_2)), \\
p_0 &= a_0^\top b_0 + w^\top x \\
p_1 &= a (p_0, h_1)^\top b (p_0, h_1) + (w - s (p_0, h_1) x)^\top x \\
\end{aligned}
\] where the dependence upon the initial prediction $p_0$ and the intermediate prediction $p_1$ has been made explicit. Now note \[
\begin{aligned}
a (p_0, h_1) &= a_0 \cosh (-s (p_0, h_1)) + b_0 \sinh (-s (p_0, h_1)), \\
b (p_0, h_1) &= b_0 \cosh (-s (p_0, h_1)) + a_0 \sinh (-s (p_0, h_1)). \\
\end{aligned}
\] Substituting into above and using the hyperbolic addition identities yields \[
a (p_1, h_2) = a_0 \cosh (-s (p_0, h_1) - s (p_1, h_2)) + b_0 \sinh (-s (p_0, h_1) - s (p_1, h_2)),
\] therefore if \[
s (p_0, h_1 + h_2) = s (p_0, h_1) + s (p_1, h_2),
\] then \[
a (p_0, h_1 + h_2) = a (p_1, h_2),
\] and similarly for $b$.
The basic algorithm in vowpal is essentially a GLM optimized with SGD. One of the neat tricks under the hood is the idea of importance weight aware updates. The key observation is that for a GLM all the gradients for a particular example point in the same direction and differ only in their magnitude, i.e., if the prediction is $p = w^\top x$ then $\nabla_w l (p, y) = x \frac{\partial l}{\partial p}|_{p=w^\top x}$. Karampatziakis and Langford exploit this to simulate the limit of a large number of small gradient steps via the solution to an ordinary differential equation. The resulting updates have three nice properties. The first is safety: for loss functions where it is possible to overshoot the label, the importance aware update will never overshoot the label, unlike a naive scaling of the gradient. The second is invariance: two sequential updates applied with importance weights $w_1$ and $w_2$ is equivalent to a single update with importance weight $w_1 + w_2$. Safety makes the results of SGD less sensitive to the exact choice of learning rate; whereas invariance is important for reductions that leverage importance weights. What's the third nice property? It turns out there are closed form updates for all sorts of commonly used loss functions so the importance aware update is cheap to compute.
Dyadic models are not GLMs, but some of the principles still apply. Consider a simple dyadic model, \[
\begin{aligned}
p &= a^\top b, \\
\nabla_a l (p, y) &= b \frac{\partial l}{\partial p}|_{p=a^\top b}, \\
\nabla_b l (p, y) &= a \frac{\partial l}{\partial p}|_{p=a^\top b}.
\end{aligned}
\] All the gradients do not point in the same direction, but nonetheless attempting to create an importance weight aware update yields a pair of equations, \[
\begin{aligned}
a' (h) &= -\eta b (h) \frac{\partial l}{\partial p}\biggr|_{p = a (h)^\top b (h)}, \\
b' (h) &= -\eta a (h) \frac{\partial l}{\partial p}\biggr|_{p = a (h)^\top b (h)}.
\end{aligned}
\] Mathematica can only solve this for me with hinge loss; in the linear regime the solution looks like \[
\begin{aligned}
a (h) &= a_0 \cosh (\eta h y) + b_0 \sinh (\eta h y), \\
b (h) &= b_0 \cosh (\eta h y) + a_0 \sinh (\eta h y),
\end{aligned}
\] and then as soon as one hits the boundary where $a (h)^\top b (h) = y$ there are no more changes. Mathematica can also analytically obtain the smallest $h_{min}$ for which $a (h)^\top b (h) = y$ and it is not too bad to compute (involving $\cosh^{-1}$ and square root); although when $a_0 \approx b_0$ some care is required (there is a series expansion available).
This update has all three desirable properties. Stopping once the product hits the label ensures safety. Invariance holds and can be verified in the linear regime by leveraging the identity $\cosh (a + b) = \cosh (a) \cosh (b) + \sinh (a) \sinh (b)$. Finally the update is relatively cheap to compute.
If one replaces the prediction with $p = w^\top x + a^\top b$, the story remains essentially the same. The equations are \[
\begin{aligned}
a' (h) &= -\eta b (h) \frac{\partial l}{\partial p}\biggr|_{p = a (h)^\top b (h) + (w - s (h) x)^\top x}, \\
b' (h) &= -\eta a (h) \frac{\partial l}{\partial p}\biggr|_{p = a (h)^\top b (h) + (w - s (h) x)^\top x}, \\
s' (h) &= \eta \frac{\partial l}{\partial p}\biggr|_{p = a (h)^\top b (h) + (w - s (h) x)^\top x},
\end{aligned}
\] which for hinge loss has solution \[
\begin{aligned}
a (h) &= a_0 \cosh (\eta h y) + b_0 \sinh (\eta h y), \\
b (h) &= b_0 \cosh (\eta h y) + a_0 \sinh (\eta h y), \\
s (h) &= -\eta h y.
\end{aligned}
\] However the computation of $h_{min}$ is more complicated; I can't get an analytic expression for it, so unfortunately one-dimensional root finding is required, \[
\text{Solve}\left[\frac{1}{2} \left(||a_0||^2+||b_0||^2\right) \sinh (2 \eta h y)+a^\top b \cosh (2 \eta h y)+x^\top (\eta h x y+w)=y,h\right].
\] This might disqualify the update from being considered ``cheap to compute''.
Overall I think this update might have some relevance for dyadic classification problems where hinge loss would be a reasonable choice. Quantile loss is also amenable analytically but other losses are not.
When I do a parametric plot of $a (h)$ and $b (h)$ for a one-dimensional dyadic update starting from different starting points I get a totally cool picture which I wanted to post. It reminds me of a warp drive trail from science fiction or something.
Numerically integrating the above equations for different losses produces similar looking pictures, although for some losses like logistic there is no boundary so the swoosh keeps going. This suggests the hinge loss case captures the essential curvature of the update, but the magnitude of the update is different for different losses due to the nonlinearity in the loss. That further suggests a hack like \[
\begin{aligned}
a (h) &= a_0 \cosh (-s (h)) + b_0 \sinh (-s (h)), \\
b (h) &= b_0 \cosh (-s (h)) + a_0 \sinh (-s (h)), \\
s' (h) &= \eta \frac{\partial l}{\partial p}\biggr|_{p = a (h)^\top b (h) + (w - s (h) x)^\top x}.
\end{aligned}
\] So here are some guesses about this hack. First guess: this is a descent direction since if the loss is linearized at $h = 0$ I get the hinge loss update which the above equations reproduce. Second guess: this update is invariant, since by substitution I can reduce this to a first-order ODE, and cosh and sinh do not spoil the continuous differentiability of the loss function. Third guess: this update has the safety property.
Of course this update involves numerically integrating a one-dimensional ODE, and so might prove too expensive in practice. Basically the hack reduces a multi-dimensional ODE to a one-dimensional ODE which is a computational savings, but definitely not as cool as closed form.
To get a feel for why invariance works, consider what the value of $a$ is after doing two sequential updates with importance weights $h_1$ and $h_2$, \[
\begin{aligned}
a (p_1, h_2) &= a (p_0, h_1) \cosh (-s (p_1, h_2)) + b (p_0, h_1) \sinh (-s (p_1, h_2)), \\
p_0 &= a_0^\top b_0 + w^\top x \\
p_1 &= a (p_0, h_1)^\top b (p_0, h_1) + (w - s (p_0, h_1) x)^\top x \\
\end{aligned}
\] where the dependence upon the initial prediction $p_0$ and the intermediate prediction $p_1$ has been made explicit. Now note \[
\begin{aligned}
a (p_0, h_1) &= a_0 \cosh (-s (p_0, h_1)) + b_0 \sinh (-s (p_0, h_1)), \\
b (p_0, h_1) &= b_0 \cosh (-s (p_0, h_1)) + a_0 \sinh (-s (p_0, h_1)). \\
\end{aligned}
\] Substituting into above and using the hyperbolic addition identities yields \[
a (p_1, h_2) = a_0 \cosh (-s (p_0, h_1) - s (p_1, h_2)) + b_0 \sinh (-s (p_0, h_1) - s (p_1, h_2)),
\] therefore if \[
s (p_0, h_1 + h_2) = s (p_0, h_1) + s (p_1, h_2),
\] then \[
a (p_0, h_1 + h_2) = a (p_1, h_2),
\] and similarly for $b$.
Tuesday, May 10, 2011
Cost Sensitive Multi Label: An Observation
I'm faced with a cost-sensitive multi-label classification (CSML) problem, i.e., there is a set of labels $K$ and I am to assign any or all of them to instances (in my case, of text documents). One approach is to treat it as a cost-sensitive multi-class classification (CSMC) problem on the power set of labels $\mathcal{P} (K)$. At that point, I could reduce to binary classification using a filter tree. This has some advantages, such as consistency (zero regret on the induced subproblems implies zero regret on the original problem). It also has substantial disadvantages, both theoretical (the constant in the regret bound is scaling as $O (2^{|K|})$) and practical (the most general implementation would have time complexity scaling as $O(2^{|K|})$).
Another popular approach is to learn $|K|$ independent binary classifiers at test time, query them independently at test time, and output the union. This has nice practical properties (time complexity scaling as $O (|K|)$). However decomposing a problem into a set of independent subproblems is generally a formula for creating an inconsistent reduction, so I was suspicious of this approach.
So here's a fun observation: learning a set of independent binary classifiers is equivalent to a filter tree approach on the CSMC problem with the following conditions.
In this case, here's what happens. At the $k^\mathrm{th}$ level, the tournaments are all between sets that differ only in their $k^\mathrm{th}$ significant bit. The classifier for every node at this level has the form \[
\Phi (x) = 1_{f_k (x) > 0}
\] and all the importance weights for all the subproblems at this level are identical (because of the use of Hamming loss). Thus the entire $k^\mathrm{th}$ level of the tree is equivalent to a binary classifier which is independently learning whether or not to include the $k^\mathrm{th}$ element of $K$ into the final result.
So the good news is: this means learning independent binary classifiers is consistent if Hamming loss is the loss function on the CSML problem. Of course, the constant in the regret bound is huge, and the structural assumptions on the classifiers are significant, so in practice performance might be quite poor.
What about other loss functions? If the structural assumptions on the classifiers are retained, then each level of the filter tree can still be summarized with a single binary classifier. However, the importance weight of the examples fed to this classifier depend upon the output of the classifiers at the previous levels.
As an example, consider 0-1 loss on the entire set of labels. At the lowest level of the tree, the only tournament with a non-zero importance weight (cost difference) is the one which considers whether or not to include the first label conditional on all other labels being correct. At the second level of the tree, the only tournament that could possibly have a non-zero importance weight is the one that considers whether or not to include the second label conditional on all other labels being correct. However, if the first classifier made an error, this condition will not filter up the tree, and all importance weights will be zero. In general, as soon as one of the classifiers makes a mistake, training stops. So the training procedure can be roughly outlined as:
Another popular approach is to learn $|K|$ independent binary classifiers at test time, query them independently at test time, and output the union. This has nice practical properties (time complexity scaling as $O (|K|)$). However decomposing a problem into a set of independent subproblems is generally a formula for creating an inconsistent reduction, so I was suspicious of this approach.
So here's a fun observation: learning a set of independent binary classifiers is equivalent to a filter tree approach on the CSMC problem with the following conditions.
- The filter tree is a scoring filter tree, i.e., the classifier at each node is \[
\Phi (x) = 1_{f (x; \lambda) > f (x; \phi)}.
\] - The scoring function further decomposes into a weighted set of indicator functions, \[
f (x; \lambda) = \sum_{k \in K} 1_{k \in \lambda} f_k (x)
\] - The loss function for the CSMC problem is Hamming loss.
- The tree is constructed such that at level $k$, the two inputs to each node differ only in the presence or absence of the $k^{\mathrm{th}}$ element of $K$. Thinking of the elements of $\mathcal{P} (K)$ as bit strings, the tournaments at level $k$ are deciding between the $k^{\mathrm{th}}$ bit in the binary expansion.
In this case, here's what happens. At the $k^\mathrm{th}$ level, the tournaments are all between sets that differ only in their $k^\mathrm{th}$ significant bit. The classifier for every node at this level has the form \[
\Phi (x) = 1_{f_k (x) > 0}
\] and all the importance weights for all the subproblems at this level are identical (because of the use of Hamming loss). Thus the entire $k^\mathrm{th}$ level of the tree is equivalent to a binary classifier which is independently learning whether or not to include the $k^\mathrm{th}$ element of $K$ into the final result.
So the good news is: this means learning independent binary classifiers is consistent if Hamming loss is the loss function on the CSML problem. Of course, the constant in the regret bound is huge, and the structural assumptions on the classifiers are significant, so in practice performance might be quite poor.
What about other loss functions? If the structural assumptions on the classifiers are retained, then each level of the filter tree can still be summarized with a single binary classifier. However, the importance weight of the examples fed to this classifier depend upon the output of the classifiers at the previous levels.
As an example, consider 0-1 loss on the entire set of labels. At the lowest level of the tree, the only tournament with a non-zero importance weight (cost difference) is the one which considers whether or not to include the first label conditional on all other labels being correct. At the second level of the tree, the only tournament that could possibly have a non-zero importance weight is the one that considers whether or not to include the second label conditional on all other labels being correct. However, if the first classifier made an error, this condition will not filter up the tree, and all importance weights will be zero. In general, as soon as one of the classifiers makes a mistake, training stops. So the training procedure can be roughly outlined as:
- Given a training datum $(x, Y) \in X \times \mathcal{P} (K)$,
- For each $k = 1 \ldots |K|$
- Let $\hat y_k$ be the prediction of classifier $k$ of whether label $k$ is in the target set.
- Add $(x, 1_{k \in Y})$ to training set for classifier $k$. Note all non-zero importance weights are 1 so this is just binary classification.
- If $\hat y_k \neq 1_{k \in Y}$, stop iterating over $k$ for this training datum.
Friday, May 6, 2011
Latent Factors and Vowpal Wabbit
There's a nice paper by Menon and Elkan about dyadic prediction with latent features. It cares about the right things: ease of implementation, ability to incorporate multiple sources of information, and scalability. The model can be thought of as a mash-up of multinomial logit and matrix factorization.
It is easiest to describe in the binary case without side information. In this case it is basically logistic regression with dyadic input $x = (r, c)$, \[
\log \frac{p (y = 1 | x, w)}{p (y = 0 | x, w)} = \alpha^T_r \beta_c,
\] where $\alpha$ is a vector of $k$ latent factors associated with $r$ and $\beta$ is a vector of $k$ latent factors associated with $c$. $r$ and $c$ are identifiers here, e.g., user ids and movie ids, user ids and advertisement ids, etc.
I can almost do this in vowpal wabbit right now. Suppose there are two latent dimensions in the model ($k = 2$). An training datum $(y, (r, c))$ would be encoded via
If you've followed this far, you can probably see how additional features $s (x)$ associated with the dyad can be added in another namespace to augment the model with side information.
For multi-valued data the authors use multinomial logit but I suspect a scoring filter tree with a logistic base learner could get the job done.
Finally, the authors suggest that regularization is necessary for good results. Possibly I can get away with using the ``only do one pass through the training data'' style of regularization.
It is easiest to describe in the binary case without side information. In this case it is basically logistic regression with dyadic input $x = (r, c)$, \[
\log \frac{p (y = 1 | x, w)}{p (y = 0 | x, w)} = \alpha^T_r \beta_c,
\] where $\alpha$ is a vector of $k$ latent factors associated with $r$ and $\beta$ is a vector of $k$ latent factors associated with $c$. $r$ and $c$ are identifiers here, e.g., user ids and movie ids, user ids and advertisement ids, etc.
I can almost do this in vowpal wabbit right now. Suppose there are two latent dimensions in the model ($k = 2$). An training datum $(y, (r, c))$ would be encoded via
y |alpha latent_0_r latent_1_r |beta latent_0_c latent_1_ce.g. for the training datum $(1, (16, 432))$,
1 |alpha latent_0_16 latent_1_16 |beta latent_0_432 latent_1_432and then choose --quadratic ab and --loss logistic. Unfortunately this does not do the right thing. First, it creates some extra features (it is a outer product, not an inner product). Second, these extra features have their own independent weights, whereas in the model the weight is the product of the individual weights. A possible solution is to add a --dotproduct option to vowpal which would take two namespaces and emulate the features corresponding to their inner product (in this case the order of the features in the input would matter).
If you've followed this far, you can probably see how additional features $s (x)$ associated with the dyad can be added in another namespace to augment the model with side information.
y |alpha latent_0_r latent_1_r |beta latent_0_c latent_1_c |side feature...Similarly it is easy to incorporate side information associated with each component, which would not be placed inside the alpha and beta namespaces to avoid getting picked up by the --dotproduct (in fact, for typical side information associated with the components, --quadratic on the component side information would be reasonable). Note the authors report better results learning the latent model first, fixing the latent weights, and then learning the weights associated with the side information.
For multi-valued data the authors use multinomial logit but I suspect a scoring filter tree with a logistic base learner could get the job done.
Finally, the authors suggest that regularization is necessary for good results. Possibly I can get away with using the ``only do one pass through the training data'' style of regularization.