The Approach
My approach is based upon DiscLDA (and its cousin Labeled LDA). DiscLDA modifies the original LDA such that a document class label $y_d$ can remap the original document topics $z$ into transformed topics $u$ which then control the word emission probabilities. Here's the slate representation from the DiscLDA paper,and for those who think algebraically, here's a description of the generative model.
- Draw topic proportions $\theta | \alpha \sim \mbox{Dir} (\alpha)$.
- Draw word emission vectors $\phi | \beta \sim \mbox{Dir} (\beta)$.
- Draw class label $y | \pi \sim p (y | \pi)$ from some prior distribution.
- For each word
- Draw topic assignment $z_n | \theta \sim \mbox{Mult} (\theta)$.
- Draw transformed topic assignment $u_n | z_n, T, y \sim \mbox{Mult} (T^y_{u_n, z_n})$.
- Draw word $w_n | z_n, \phi_{1:K} \sim \mbox{Mult} (\phi_{z_n})$.
T^y = \begin{pmatrix}
0 & I_{k_0} 1_{y = 1} \\
\vdots & \vdots \\
0 & I_{k_0} 1_{y = |Y|} \\
I_{k_1} & 0
\end{pmatrix}.
\] Note that there are $(k_0 + k_1)$ possible $z$ and $(|Y| k_0 + k_1)$ possible $u$. Furthermore a single $z_i$ will either be associated with a single $u$ independent of $y$ or a set of $|Y|$ possible $u$ which are indexed by $y$. When implementing I found it convenient to think in terms of a mapping \[
u^y_z = \begin{cases} z & z < k_1 \\ z + k_0 (y - 1) & z \geq k_1 \end{cases},
\] which mechanically suggests that changing $y$ causes a set of $k_0$ topics to be ``switched out''.
Although online variational Bayes is currently king of the hill in terms of learning algorithm speed and scalability, I found Gibbs sampling relatively easy to reason about and implement so I went with that. Assuming the label $y_{d_i}$ for document $d_i$ is observed, the Gibbs sampler is essentially the same as for unsupervised LDA, \[
p (z_i = j | w, z_{-i}, y) \propto \frac{n^{(w_i)}_{-i,u^{y_{d_i}}_j} + \beta}{n^{(\cdot)}_{-i, u^{y_{d_i}}_j} + W \beta} \frac{n^{(d_i)}_{-i,u^{y_{d_i}}_j} + \alpha}{n^{(d_i)}_{-i,\cdot} + (k_0 + k_1) \alpha}.
\] If sampling is conceptualized as done on the $u$ variables, this is the result from Labeled LDA in the case of a single-label: the Gibbs sampler is the same as original LDA except that transitions are only allowed between feasible topics given the document label.
If $y_{d_i}$ is unobserved, I could integrate it out. However for ease of implementation I decided to Gibbs sample it as well, \[
p (y_d = k | w, z, y_{-d}) \propto p (w_d | w_{-d}, y_d = k, z, y_{-d}) p (y_d = k | z, y_{-d}).
\] The first term is the likelihood of the word sequence for document $d$ given all the topic and label assignments for all the documents and all the words in the other documents. This is given by \[
p (w_d | w_{-d}, y_d = k, z, y_{-d}) \propto \prod_{\{i | d_i = d, z_i \geq k_1\}} \frac{n^{(w_i)}_{-d, u^k_{z_i}} + \beta}{n^{(\cdot)}_{-d, u^k_{z_i}} + W \beta}.
\] That reads: for each word in the document $d$, count how many times it was assigned to topic $u^k_{z_i}$ in all other documents $n^{(w_i)}_{-d, u^k_{z_i}}$, count how many times any word was assigned to topic $u^k _{z_i}$ in all other documents $n^{(\cdot)}_{-d, u^k_{z_i}}$, and apply Dirichlet-Multinomial conjugacy. Basically, in terms of the $u$ variables: how well does each set of per-label topics explain the words in the document? Note common topics contribute the same factor to the likelihood so as an optimization the product can be taken over label-specific topics only. In particular if the document only uses common topics there is no preference for any label in this term.
The second term, $p (y_d = k | z, y_{-d}) = p (y_d = k | y_{-d})$ is the prior (to observing the document text) distribution of the document label. In an online setting having the prior distribution over document labels depend upon previously observed document labels would be reasonable, but in a batch Gibbs sampler it is sufficient to ask the user to specify a fixed multinomial distribution over the class labels, i.e., $p (y_{d_i} = k | y_{-d_i}) = p_k$.
Does it Work?
The original motivation was to extract features to build a better classifier, so the best way to formulate this question is, ``does using semi-supervised LDA result in features that are better for classification than either the features obtained 1) by running unsupervised LDA on all the data or 2) running supervised LDA on the labeled subset of the data.'' I'm still working on answering that question.However there is another sense of the technique ``working''. Since unlabeled data vastly outnumbers labeled data, one fear I had was that the semi-supervised approach would essentially degenerate into the unsupervised approach. I have some data that suggests this is not the case.
Here's the setup: I have circa 2 million ``documents'', where each document is the set of people that a particular Twitter profile is following. I have labeled roughly 1% of these documents as either male or female based upon Mechanical Turk results. I've got an initial Gibbs sampler implementation as outlined above which I've run while looking at two statistics: average log probability of each sampled topic, and average log probability of each observed document label. Generally I would expect the topic probability to improve over time as the system learns coherent sets of words. However, what happens to the label probability? Possibly, observed labels could become less likely over time if the pressure to explain the document text overwhelms the pressure to conform to the observed labels. I was especially worried about this since I took the short cut of Gibbs sampling the unobserved labels rather than integrating them out.
Here is what a typical run looks like on my data. (Sadly it takes half a day on my laptop to generate this output.)
On the first pass through the data, labels are assigned essentially randomly to unlabeled data and the documents seem to cancel each other out, consequently the observed labels are aggressively fit (average $\log(p) \approx -0.02$ which is like $p \approx 98\%$). The average label log probability is only computed on observed labels; so at this point the topics aren't very good at predicting the tokens in the documents, but they are very good at predicting the observed labels. As the sampler initially proceeds topic probability increases but label probability decreases. When I was watching the output of a large run for the first time I figured I was screwed and eventually the labels would get completely washed out. However eventually the sampler starts explaining both the tokens and the observed labels fairly well. After 100 passes here I ended up with observed labels still well explained with average $\log (p) \approx -0.08$ which is like $p \approx 92\%$. While the unsupervised data has degraded the model's explanatory power with respect to the observed labels, this might turn out to be a good thing in terms of generalization (i.e., ``built in regularization'').
So the next step is to compare this semi-supervised LDA with unsupervised LDA and supervised LDA in the context of my original classification problem to see if there is any point to all of this beyond my own amusement.
An alternative approach with similar motivations but perhaps a more efficient implementation is using semi-supervised Searn with latent structures = topics. I don't have any empirical evidence comparing these kinds of approaches yet.
ReplyDeleteThanks for bringing that to my attention.
ReplyDelete