Happy New Year! My New Year's resolution is to be less afraid of non-convex optimization. Statistically there is a high likelihood that I will return to only optimizing convex losses by February :).
But here's a fun paper along these lines in the meantime, Deep Fried Convnets. The idea here is to use a fast kernel approximation to replace the fully connected final layers of a deep convolutional neural network. Gradients can be computed for the kernel approximation and passed through to the lower convolutional layers, so the entire architecture can be trained end-to-end using SGD, including fun tricks like dropout on the kernel approximation.
Alex Smola is a smart guy and I think he gets the lessons from the recent success of deep learning. In fact it seems we have to re-learn this lesson every decade or so, namely end-to-end training of a non-convex architecture can yield superior results and SGD is extremely versatile. I see Deep Fried Convnets along the same lines as John Hershey's Deep Unfolding ideas for neural networks, in that one starts with a model (e.g., a kernel machine), create a parameterized approximation to the model (e.g., fastfood), and then (nonconvex) optimizes the approximation end-to-end using SGD.