Let's start with the “talk of the conference”. I mean this in the spirit of Time's “Man of the Year”, i.e., I'm not condoning the content, just noting that it was the most impactful. And of course the winner is ... Ilya Sutsveker's talk Sequence to Sequence Learning with Neural Networks. The swagger was jaw-dropping: as introductory material he declared that all supervised vector-to-vector problems are now solved thanks to deep feed-forward neural networks, and then proceeded to declare that all supervised sequence-to-sequence problems are now solved thanks to deep LSTM networks. Everybody had something to say about this talk. On the positive side, the inimitable John Hershey told me over drinks that LSTM has allowed his team to sweep away years of cruft in their speech cleaning pipeline while getting better results. Others with less charitable interpretations of the talk probably don't want me blogging their intoxicated reactions.
It is fitting that the conference was in Montreal, underscoring that the giants of deep learning have transitioned from exiles to rockstars. As I learned the hard way, you have to show up to the previous talk if you want to get into the room when one of these guys is scheduled at a workshop. Here's an actionable observation: placing all the deep learning posters next to each other in the poster session is a bad idea, as it creates a ridiculous traffic jam. Next year they should be placed at the corners of the poster session, just like staples in a grocery store, to facilitate the exposure of other material.
Now for my personal highlights. First let me point out that the conference is so big now that I can only experience a small part of it, even with the single-track format, so you are getting a biased view. Also let me congratulate Anshu for getting a best paper award. He was an intern at Microsoft this summer and the guy is just super cool.
Distributed Learning
Since this is my day job, I'm of course paranoid that the need for distributed learning is diminishing as individual computing nodes (augmented with GPUs) become increasingly powerful. So I was ready for Jure Leskovec's workshop talk. Here is a killer screenshot.Jure said every grad student is his lab has one of these machines, and that almost every data set of interest fits in RAM. Contemplate that for a moment.
Nonetheless there was some good research in this direction.
- Weizhu Chen, Large Scale L-BFGS using Map-Reduce. Weizhu sits down the corridor from me and says I'm crazy for thinking distributed is dead, so talking to him reduces my anxiety level.
- Virginia Smith (presenting) et. al., COCOA: Communication-Efficient Distributed Dual Coordinate Ascent. Excellent talk, excellent algorithm, excellent analysis. Here's some free career advice: try to be a postdoc in Michael Jordan's lab.
- Inderjit S. Dhillon, NOMAD: A Distributed Framework for Latent Variable Models. I wasn't really joking when I made this poster. However I find Dhillon's approach to managing asynchronicity in the distributed setting to be attractive, as it seems possible to reason about and efficiently debug such a setup.
- McWilliams et. al., LOCO: Distributing Ridge Regression with Random Projections. Another excellent algorithm backed by solid analysis. I think there could be good implications for privacy as well.
- Wang et. al., Median Selection Subset Aggregation for Parallel Inference. I think of this as “ cheaper distributed L1 ” via a communication efficient way of combining L1 optimizations performed in parallel.
Other Trends
Randomized Methods: I'm really hot for randomized algorithms right now so I was glad to see healthy activity in the space. LOCO (mentioned above) was one highlight. Also very cool was Radagrad, which is a mashup of Adagrad and random projections. Adagrad in practice is implemented via a diagonal approximation (e.g., in vowpal wabbit), but Krummenacher and McWilliams showed that an approximation to the full Adagrad metric can be tractably obtained via random projections. It densifies the data, so perhaps it is not appropriate for text data (and vowpal wabbit focuses on sparse data currently), but the potential for dense data (i.e., vision, speech) and nonlinear models (i.e., neural networks) is promising.Extreme Learning Clearly someone internalized the most important lesson from deep learning: give your research program a sexy name. Extreme learning sounds like the research area for those who like skateboarding and consuming a steady supply of Red Bull. What it actually means is multiclass and multilabel classification problems where the number of classes is very large. I was pleased that Luke Vilnis' talk on generalized eigenvectors for large multiclass problems was well received. Anshu's best paper winning work on approximate maximum inner product search is also relevant to this area.
Discrete Optimization I'm so clueless about this field that I ran into Jeff Bilmes at baggage claim and asked him to tell me his research interests. However assuming Ilya is right, the future is in learning problems with more complicated output structures, and this field is pushing in an interesting direction.
Probabilistic Programming Rob Zinkov didn't present (afaik), but he showed me some sick demos of Hakaru, the probabilistic programming framework his lab is developing.
Facebook Labs I was happy to see that Facebook Labs is tackling ambitious problems in text understanding, image analysis, and knowledge base construction. They are thinking big ... extreme income inequality might be bad for the long-term stability of western democracy, but its causing a golden age in AI research.