Multi-label classification is interesting because it is a gateway drug to structured prediction. While it is possible to think about multi-label as multi-class over the power set of labels, this approach falls apart quickly unless the number of labels is small or the number of active labels per instance is limited. The structured prediction viewpoint is that multi-label inference involves a set of binary predictions subject to a joint loss, which satisfies the haiku definition of structured prediction.
Nikos and I independently discovered what Reed and Hollmén state eloquently in a recent paper:
Competitive methods for multi-label data typically invest in learning labels together. To do so in a beneficial way, analysis of label dependence is often seen as a fundamental step, separate and prior to constructing a classifier. Some methods invest up to hundreds of times more computational effort in building dependency models, than training the final classifier itself. We extend some recent discussion in the literature and provide a deeper analysis, namely, developing the view that label dependence is often introduced by an inadequate base classifier ...Reed and Hollmén use neural network style nonlinearities, while Nikos and I use a combination of randomized embeddings and randomized kernel approximations, but our conclusion is similar: given a flexible and well-regularized generic nonlinearity, label dependencies can be directly modeled when constructing the classifier; furthermore, this is both computationally and statistically more efficient than current state-of-the-art approaches.
The use of neural network style nonlinearities for multi-label is extremely reasonable for this setting, imho. Advancing the successes of deep learning into structured prediction is currently a hot topic of research, and it is partially tricky because it is unclear how to render an arbitrary structured prediction problem onto a structure which is amenable to (SGD) optimization (c.f., LSTMs for sequential inference tasks). Fortunately, although multi-label has a structured prediction interpretation, existing deep architectures for multi-class require only slight modifications to apply to multi-label. (“Then why are you using randomized methods?”, asks the reader. The answer is that randomized methods distribute very well and I work in a Cloud Computing laboratory.)