Raykar et. al. note that a classifier trained on the crowdsourced data will ultimately agree or disagree with particular crowdsourced labels. It would be nice to use this to inform the model of each worker's likely errors, but in the sequential procedure I've been using so far, there is no possibility of this: first ground truth is estimated, than the classifier is estimated. Consequently they propose to jointly estimate ground truth and the classifier to allow each to inform the other.
At this point let me offer same plate diagrams to help elucidate.
This is a plate diagram corresponding to the generative models I've been using so far. An unobserved ground truth label $z$ combines with a per-worker model parametrized by vector $\alpha$ and scalar item difficulty $\beta$ to create an observed worker label $l$ for an item. $\mu$, $\rho$, and $p$ are hyperprior parameters for the prior distributions of $\alpha$, $\beta$, and $z$ respectively. Depending upon the problem (multiclass, ordinal multiclass, or multilabel) the details of how $z$, $\alpha$, and $\beta$ produce a distribution over $l$ change but the general structure is given by the above diagram.
Raykar et. al. extend the generative model to allow for observed item features.
The diagram supposes that item features $\psi$ and worker labels $l$ are emitted conditionally independently given the true label $z$. This sounds bogus, since presumably the item features drive the worker directly or at least indirectly via the scalar difficulty, unless perhaps the item features are completely inaccessible to the crowdsource worker. It might be a reasonable next step to try and enrich the above diagram to address the concern, but the truth is all generative models are convenient fictions, so I'm using the above for now. Raykar et. al. provide a batch EM algorithm for the joint classification, but the above fits nicely into the online algorithm I've been using.
Here's the online procedure, for each input pair $(\psi, \{ (w_i, l_i) \})$.
- Using the item features $\psi$, interrogate a classifier trained using a proper scoring rule, and interpret the output as $P (z | \psi)$.
- Use $P (z | \psi)$ as the prior distribution for $z$ in the online algorithms previously discussed for processing the set of crowdsourced labels $\{ (w_i, l_i) \}$. This produces result $P (z | \psi, \{ (w_i, l_i ) \})$.
- Update the classifier using SGD on the expected prior scoring rule loss against distribution $P (z | \psi, \{ (w_i, l_i ) \})$. For instance, with log loss (multiclass logistic regression) the objective function is the cross-entropy, \[
\sum_j P (z = j | \psi, \{ (w_i, l_i) \}) \log P (z = j | \psi).
\]
Note if you observe ground truth $\tilde z$ for a particular instance, then the worker model is updated as if $P (z = j | \psi) = 1_{z = \tilde z}$ as the prior distribution, and the classifier is updated as if $P (z = j | \psi, \{ (w_i, l_i) \}) = 1_{z = \tilde z}$. In this case the classifier update is the same as ``vanilla'' logistic regression, so this can be considered a generalization of logistic regression to crowdsourced data.
I always add the constant item feature to each input. Thus in the case where there are no item features, the algorithm is the same as before except that it is learning the prior distribution over $z$. Great, that's one less thing to specify. In the case where there are item features, however, things get more interesting. If there is a feature which is strongly indicative of the ground truth (e.g., lang=es on a Twitter profile being strongly indicative of a Hispanic ethnicity), the model can potentially identify accurate workers who happened to disagree with their peers on every item they labeled, if the worker agrees with other workers on items which share some dispositive features. This might occur if a worker happens to get unlucky and colocate on several tasks with multiple inaccurate workers. This really starts to pay off when those multiple inaccurate workers have their influence reduced on other items which are more ambiguous.
Here is a real life example. The task is prediction of the gender of a Twitter profile. Mechanical Turk workers are asked to visit a particular profile and then choose a gender: male, female, or neither. ``neither'' is mostly intended for the Twitter accounts of organizations like the Los Angeles Dodgers, not necessarily RuPaul. The item features are whatever can be obtained via GET users/lookup (note all of these features are readily apparent to the Mechanical Turk worker). Training examples end up looking like
A26E8CJMP5S4WN:2,A8H56XB9K7DB5:2,AU9LVYE38Q6S2:2,AHGJTOTIPCL8X:2 WONBOTTLES,180279525|firstname taste |restname this ? ?? |lang en |description weed girls life cool #team yoooooooo #teamblasian #teamgemini #teamcoolin #teamcowboys |utc_offset utc_offset_-18000 |profile sidebar_252429 background_1a1b1f |location spacejam'n in my jet foolIf that looks like Vowpal Wabbit, it's because I ripped off their input format again, but the label specification is enriched. In particular zero or more worker:label pairs can be specified, as well as an optional true label (just a label, no worker). Here's what multiple passes over a training set look like.
initial_t = 10000 eta = 1.0 rho = 0.9 n_items = 10130 n_labels = 3 n_worker_bits = 16 n_feature_bits = 16 test_only = false prediction file = (no output) data file = (stdin) cumul since cumul since example current current current current avg q last avg ce last counter label predict ratings features -0.52730 -0.52730 -0.35304 -0.35304 2 -1 0 4 7 -0.65246 -0.73211 -0.29330 -0.25527 5 -1 0 4 23 -0.62805 -0.60364 -0.33058 -0.36786 10 -1 1 4 13 -0.73103 -0.86344 -0.29300 -0.24469 19 -1 0 4 12 -0.76983 -0.81417 -0.25648 -0.21474 36 -1 0 4 20 -0.75015 -0.72887 -0.26422 -0.27259 69 -1 2 4 12 -0.76571 -0.78134 -0.25690 -0.24956 134 -1 2 4 37 -0.76196 -0.75812 -0.24240 -0.22752 263 -1 0 4 21 -0.74378 -0.72467 -0.25171 -0.26148 520 -1 2 4 12 -0.75463 -0.76554 -0.24286 -0.23396 1033 -1 2 2 38 -0.72789 -0.70122 -0.24080 -0.23874 2058 -1 0 4 30 -0.68904 -0.65012 -0.25367 -0.26656 4107 -1 2 4 25 -0.61835 -0.54738 -0.25731 -0.26097 8204 -1 0 4 11 -0.55034 -0.48273 -0.24362 -0.23001 16397 -1 2 3 12 -0.49055 -0.43083 -0.20390 -0.16423 32782 -1 2 3 29 -0.44859 -0.40666 -0.15410 -0.10434 65551 -1 2 4 12 -0.42490 -0.40117 -0.11946 -0.08477 131088 -1 0 4 9 -0.41290 -0.40090 -0.10018 -0.08089 262161 -1 2 4 9 -0.40566 -0.39841 -0.08973 -0.07927 524306 -1 0 4 33 -0.40206 -0.39846 -0.08416 -0.07858 1048595 -1 2 4 22 -0.40087 -0.39869 -0.08206 -0.07822 1620800 -1 0 4 18 applying deferred prior updates ... finished gamma: \ ground truth | 0 1 2 label | 0 | -1.0000 0.0023 0.0038 1 | 0.0038 -1.0000 0.0034 2 | 0.0038 0.0018 -1.0000That output takes about 3 minutes to produce on my laptop. If that looks like Vowpal Wabbit, it's because I ripped off their output format again. The first two columns are the EM auxiliary function, which is akin to a log-likelihood, so increasing numbers indicate the worker model is better able to predict the worker labels. The next two columns are the cross-entropy for the classifier, so increasing numbers indicate the classifier is better able to predict the posterior (with respect to crowdsource worker labels) over ground truth from the item features.
The above software is available from the Google code repository. It's called playerpiano, since I find the process of using crowdsource workers to provide training data for classifiers reminiscent of Vonnegut's dystopia, in which the last generation of human master craftsmen had their movements recorded onto tape before being permanently evicted from industrial production. Right now playerpiano only supports nominal problems but I've written things so hopefully it will be easy to add ordinal and multilabel into the same executable.
No comments:
Post a Comment