The keynote talks were all excellent, consistent with the integrative “big picture” heritage of the conference. My favorite was by Daphne Koller, who talked about the “other online learning”, i.e., pedagogy via telecommunications. Analogous to how moving conversations online allows us to precisely characterize the popularity of Snooki, moving instruction online facilitates the use of machine learning to improve human learning. Based upon the general internet arc from early infovore dominance to mature limbic-stimulating pablum, it's clear the ultimate application of the Coursera platform will be around courtship techniques, but in the interim a great number of people will experience more substantial benefits.
As far as overall themes, I didn't detect any emergent technologies, unlike previous years where things like deep learning, randomized methods, and spectral learning experienced a surge. Intellectually the conference felt like a consolidation phase, as if the breakthroughs of previous years were still being digested. However, output representation learning and extreme classification (large cardinality multiclass or multilabel learning) represent interesting new frontiers and hopefully next year there will be further progress in these areas.
There were several papers about improving the convergence of stochastic gradient descent which appeared broadly similar from a theoretical standpoint (Johnson and Zhang; Wang et. al.; Zhang et. al.). I like the control variate interpretation of Wang et. al. the best for generating an intuition, but if you want to implement something than Figure 1 of Johnson and Zhang has intelligible pseudocode.
Covariance matrices were hot, and not just for PCA. The BIG & QUIC algorithm of Hseih et. al. for estimating large sparse inverse covariance matrices was technically very impressive and should prove useful for causal modeling of biological and neurological systems (presumably some hedge funds will also take interest). Bartz and Müller had some interesting ideas regarding shrinkage estimators, including the “orthogonal complement” idea that the top eigenspace should not be shrunk since the sample estimate is actually quite good.
An interesting work in randomized methods was from McWilliams et. al., in which two random feature maps are then aligned with CCA over unlabeled data to extract the “useful” random features. This is a straightforward and computationally inexpensive way to leverage unlabeled data in a semi-supervised setup, and it is consistent with theoretical results from CCA regression. I'm looking forward to trying it out.
The workshops were great, although as usual there are so many interesting things going on simultaneously that it made for difficult choices. I bounced between extreme classification, randomized methods, and big learning the first day. Michael Jordan's talk in big learning was excellent, particularly the part juxtaposing decreasing computational complexity of various optimization relaxations with increasing statistical risk (both effects due to the expansion of the feasible set). This is starting to get at the tradeoff between data and computation resources. Extreme classification (large cardinality multiclass or multilabel learning) is an exciting open area which is important (e.g., for structured prediction problems that arise in NLP) and appears tractable in the near-term. Two relevant conference papers were Frome et. al. (which leveraged word2vec to reduce extreme classification to regression with nearest-neighbor decode) and Cisse et. al. (which exploits the near-disconnected nature of the label graph often encountered in practice with large-scale multi-label problems).
The second day I mostly hung out in spectral learning but I saw Blei's talk in topic modeling. Spectral learning had a fun discussion session. The three interesting questions were
- Why aren't spectral techniques more widely used?
- How can spectral methods be made more broadly easily applicable, analogous to variational Bayes or MCMC for posterior inference?
- What are the consequences of model mis-specification, and how can spectral methods be made more robust to model mis-specification?