My previous batch EM approaches can be considered as maximizing the auxiliary function \[
F (\alpha, \beta, \gamma, q) = E_{Z \sim q}[\log L (D | \alpha, \beta, \gamma, Z)] + \log P (\alpha, \beta, \gamma) + E_{Z \sim q}[\log P (Z)] + H (q),
\] where $\alpha$ are worker-indexed parameters, $\beta$ are item-indexed parameters, $\gamma$ are global parameters, $q$ is the joint distribution over all unobserved labels, $Z$ is the set of all unobserved labels, $D$ is the data set of item-worker-label triples, $\log L (D | \alpha, \beta, \gamma, Z)$ is the log-likelihood of the data set, $P (\alpha, \beta, \gamma)$ is the prior distribution over the generative model parameters, $P (Z)$ is the prior unobserved label distribution, and $H (q)$ is the entropy of the unobserved label distribution.
The unobserved label distribution is assumed to factor over items, $q (Z) = \prod_i q_i (Z_i)$, as is the prior distribution $P (Z) = \prod_i P_i (Z_i)$. Alternatively only a constrained maximum of the auxiliary function is found, subject to this constraint. The data likelihood is assumed independent conditioned upon $(\alpha, \beta, Z)$, leading to \[
\begin{split}
F (\alpha, \beta, \gamma, q) &= \\
&\sum_i E_{Z_i \sim q_i} [ \log L (D_i | \alpha, \beta_i, \gamma, Z_i)] + \log P (\alpha, \beta, \gamma) \\
&+ \sum_i E_{Z_i \sim q_i} [ \log P_i (Z_i)] + \sum_{q_i} H (q_i),
\end{split}
\] where $i$ indexes items and $D_i$ is the set of data associated with item $i$. Further assuming the prior distribution is of the form $P (\alpha, \beta, \gamma) = P (\alpha, \gamma) \prod_i P (\beta_i)$ and rearranging yields \[
\begin{aligned}
F (\alpha, \beta, \gamma, q) &= \sum_i F_i (\alpha, \beta_i, \gamma, q_i), \\
F_i (\alpha, \beta_i, \gamma, q_i) &= \\
E_{Z_i \sim q_i} [ \log L &(D_i | \alpha, \beta_i, \gamma, Z_i)] + \frac{1}{|I|} \log P (\alpha, \gamma) + \log P (\beta_i) + E_{Z_i \sim q_i} [ \log P_i (Z_i)] + H (q_i),
\end{aligned}
\] where $|I|$ is the total number of items. Now the objective function looks like a sum of terms where $\beta_i$ and $q_i$ only appear once. This indicates that, if the data were streamed in blocks corresponding to the same item and the optimal $\alpha$ and $\gamma$ were already known, the $\beta_i$ and $q_i$ could be individually maximized and discarded. Of course, the optimal $\alpha$ and $\gamma$ are not known, but hopefully over time as more data is encountered the estimates get increasingly good. That suggests the following procedure:
- Receive a block $D_i$ of item-worker-label triples corresponding to a single item.
- Maximize $F_i (\alpha, \beta_i, \gamma, q_i)$ with respect to $\beta_i$ and $q_i$.
- Basically I run EM on this block of data with fixed $\alpha$ and $\gamma$.
- Set $\alpha \leftarrow \alpha + \eta_t \nabla_{\alpha} F_i\bigr|_{\alpha, \beta^*_i, \gamma, q^*_i}$ and $\gamma \leftarrow \gamma + \eta_t \nabla_{\gamma} F_i\bigr|_{\alpha, \beta^*_i, \gamma, q^*_i}$.
- $\eta_t$ is a learning which decays over time, e.g., $\eta_t = \eta_0 (\tau_0 + t)^{-\rho}$.
- $\eta_0 \geq 0$, $\tau_0 \geq 0$ and $\rho \geq 0$ are tuning parameters for the learning algorithm.
- Effectively $|I|$ is also a tuning parameter which sets the relative importance of the prior.
- If desired (e.g., ``inference mode''), output $\beta^*_i$ and $q^*_i$.
- Discard $\beta^*_i$ and $q^*_i$.
- Return to (1).
Scalability with respect to the number of workers is a potential issue. This is because $\alpha$ is maintained as state, and it is indexed by worker (e.g., in nominallabelextract, $\alpha_w$ is the confusion matrix for worker $w$). To overcome this I use the hashing trick: I have a fixed finite number of $\alpha$ parameters and I hash the worker id to get the $\alpha$ for that worker. When I get a hash collision this means I treat two (or more) workers as equivalent, but it allows me to bound the space usage of the algorithm up front. In practice doing hashing tricks like this always seems to work out fabulously. In this particular context, in the limit of a very large number of workers I will model every worker with the population confusion matrix. This is a graceful way to degrade as the sample complexity overwhelms the (fixed) model complexity. (I don't actually anticipate having a large number of workers; the way crowdsourcing seems to go is, one does some small tasks to identify high quality workers and then a larger version of the task restricted to those workers).
Here's an example run involving 40 passes over a small test dataset.
% time ~/src/nincompoop/nominalonlineextract/src/nominalonlineextract --initial_t 10000 --n_items 9859 --n_labels 5 --priorz 1,1,1,1,1 --model flass --data <(./multicat 40 =(sort -R ethnicity4.noe.in)) --eta 1 --rho 0.5 initial_t = 10000 eta = 1.000000 rho = 0.500000 n_items = 9859 n_labels = 5 n_workers = 65536 symmetric = false test_only = false prediction file = (no output) priorz = 0.199987,0.199987,0.199987,0.199987,0.199987 cumul since example current current current avg q last counter label predict ratings -1.183628 -1.183628 2 -1 0 5 -1.125888 -1.092893 5 -1 0 5 -1.145204 -1.162910 10 -1 0 5 -1.081261 -1.009520 19 0 0 5 -1.124367 -1.173712 36 -1 3 3 -1.083097 -1.039129 69 -1 0 4 -1.037481 -0.988452 134 -1 1 2 -0.929367 -0.820539 263 -1 1 5 -0.820125 -0.709057 520 -1 4 5 -0.738361 -0.653392 1033 -1 1 4 -0.658806 -0.579719 2058 -1 1 5 -0.610473 -0.562028 4107 -1 4 5 -0.566530 -0.522431 8204 -1 0 3 -0.522385 -0.478110 16397 -1 2 4 -0.487094 -0.451771 32782 -1 0 3 -0.460216 -0.433323 65551 -1 4 5 -0.441042 -0.421860 131088 -1 2 5 -0.427205 -0.413365 262161 -1 0 5 -0.420944 -0.408528 394360 -1 1 -1 ~/src/nincompoop/nominalonlineextract/src/nominalonlineextract --initial_t 85.77s user 0.22s system 99% cpu 1:26.41 totalIf that output format looks familiar, it's because I've jacked vowpal wabbit's output style (again). The first column is the progressive validated auxiliary function, i.e., the (averaged over items) $F_i$ function evaluated prior to updating the model parameters ($\alpha$ and $\gamma$). It is akin to a log-likelihood and if everything is working well it should get bigger as more data is consumed.
nominallabelextract, the implementation of the batch EM analog to the above, converges in about 90 seconds on this dataset and so the run times are a dead heat. For larger datasets, there is less need to do so many passes over the dataset so I would expect the online version to become increasingly advantageous. Furthermore I've been improving the performance of nominallabelextract for several months whereas I just wrote nominalonlineextract so there might be additional speed improvements in the latter. Nonetheless it appears for datasets that fit into memory batch EM is competitive.
nominalonlineextract is available from the nincompoop code repository on Google code. I'll be putting together online versions of the other algorithms in the near-term (the basic approach holds for all of them, but there are different tricks for each specific likelihood).