In my previous post I discussed my ongoing difficulties with the results from a Mechanical Turk HIT. I indicated that I would hand-label some of the data and then implement clamping (known label) in my generative model to attempt to improve the results. Since then I've done the clamping implementation and released to nincompoop.
Well the first thing I learned trying to hand-label the data is that I basically asked the Turkers to do the impossible. It is not possible to reliably distinguish between whites and hispanics (somewhat ill-defined terms, actually) on the basis of a photo alone. The only reason I'm able to disambiguate is because I have access to additional information (e.g., the person's real name). Lesson learned: always try to perform the HIT to determine feasibility before sending to Mechanical Turk.
I hand-labeled about 20% of the profiles, held-out 1/4 of the hand-labels to assess quality of the label estimation, and clamped the rest. I ended up with the following results on the held-out labels: columns are the label assigned by nominallabelextract (i.e., $\operatorname{arg\,max}_k\; p (Z=k)$), and rows are the labels assigned by ``Mechanical Me''. (Note: invalid was one of the choices from the HIT, indicating that the photo was improper.) \[
\begin{array}{c|c|c|c|c|c|c}
& \mbox{black} & \mbox{white} & \mbox{asian} & \mbox{hispanic} & \mbox{other} & \mbox{invalid} \\ \hline
\mbox{black} & 106 & 0 & 0 & 2 & 0 & 8 \\
\mbox{white} & 0 & 35 & 0 & 1 & 0 & 7 \\
\mbox{asian} & 4 & 7 & 39 & 13 & 16 & 23 \\
\mbox{hispanic} & 0 & 4 & 1 & 3 & 1 & 1 \\
\end{array}
\] Now it is interesting to compare this to how the model does without access to any of the clamped values: \[
\begin{array}{c|c|c|c|c|c|c}
& \mbox{black} & \mbox{white} & \mbox{asian} & \mbox{hispanic} & \mbox{other} & \mbox{invalid} \\ \hline
\mbox{black} & 106 & 0 & 0 & 2 & 0 & 8 \\
\mbox{white} & 0 & 35 & 0 & 1 & 0 & 7 \\
\mbox{asian} & 4 & 7 & 42 & 11 & 12 & 26 \\
\mbox{hispanic} & 0 & 5 & 0 & 2 & 2 & 1 \\
\end{array}
\] It's a wash, or if anything clamping has slightly degraded things.
My dream of labeling a small amount of data to rescue the larger pile has been destroyed. What's happening? Intuitively for clamping to help there needs to be Mechanical Turk workers who label like I do, such that nominallabelextract can extrapolate from agreement on the known set to high reliability on the unknown set. When I spot-checked, however, there were cases when I clamped a value (e.g., hispanic), but all 5 workers from Mechanical Turk agreed on a different label (e.g., white). Therefore I suspect there are no workers who label like I do, because none of them have access to the additional information that I have.
So basically I have to redesign the HIT to contain additional information.
Wednesday, January 26, 2011
Monday, January 24, 2011
Modeling Mechanical Turk Part II
In a previous post I talked about a multi-valued image labeling problem for which I was utilizing Mechanical Turk to get training data. I discussed a generative model of Mechanical Turk labeling which entailed a model of each worker's confusion matrix. At that time I noted that in fact the workers seemed to be mostly making similar errors, in particular, being systematically quite bad at distinguishing between whites and hispanics, hispanics and asians, and whites and asians. I therefore mused that using a hierarchical model for the confusion matrix would allow me to use population-level information to inform my per-worker confusion matrix model, improving the fit.
Since then I've made that change to the nominallabelextract software in the nincompoop project by putting a hierarchical Gaussian prior on the elements of the confusion matrix. The model is now \[
\begin{aligned}
\gamma_{kk} &= 0 \\
\gamma_{kl} &\sim N (1, 1) \;\; (k \neq l) \\
\alpha_i^{(kk)} &= 0 \\
\alpha_i^{(kl)} &\sim N (\gamma_{kl}, 1) \;\; (k \neq l) \\
\log \beta_j &\sim N (1, 1) \\
p (L_{ij} = l | Z_j = k, \alpha_i, \beta_j) &\propto e^{-\alpha_i^{(k,l)} \beta_j}
\end{aligned}
\] where \[
\begin{array}{c|c}
\mbox{Variable} & \mbox{Description} \\ \hline
k, l & \mbox{index labels} \\
j & \mbox{indexes images} \\
i & \mbox{indexes workers} \\
\gamma & \mbox{label-pair reliability hyperprior} \\
\alpha_i & \mbox{per-worker label-pair reliability} \\
\beta_j & \mbox{per-image difficulty} \\
L_{ij} & \mbox{observed label assigned to image by worker} \\
Z_j & \mbox{unknown true label associated with image}
\end{array}
\] Training still proceeds via ``Bayesian EM''. I folded the $\gamma$ estimation into the m-step, which appears numerically stable.
I ran the new hyperprior-enabled model on the data from my previous blog post; here's the resulting $\gamma$ estimate. Note: row labels are the true labels $Z$ and column labels are the observed labels $L$. \[
\begin{array}{c|c|c|c|c|c}
\gamma & \mbox{black} & \mbox{white} & \mbox{asian} & \mbox{hispanic} & \mbox{other} \\ \hline
\mbox{black} & 0 & 1.969921 & 1.608217 & 1.538128 & 2.104743 \\
\mbox{white} & 1.822261 & 0 & 1.062852 & 1.160873 & 1.767781 \\
\mbox{asian} & 1.494157 & 0.911748 & 0 & 1.003832 & 1.644094 \\
\mbox{hispanic} & 0.811841 & 0.383368 & 0.190436 & 0 & 1.338488 \\
\mbox{other} & 1.017143 & 0.579123 & -0.225708 & 0.607709 & 0\\
\end{array}
\] Since the diagonal elements are 0, cells where $\gamma_{kl} < 0$ indicate that apriori a rater is more likely to output the wrong label than the correct one. So for instance the model says that when the true label is other, a rater is apriori more likely to label it asian than other. Of course, if a rater is unlikely to output the true label, that raises the question of how the model can figure this out. It potentially could be identifying a small set of raters that are consistent with each other with respect to assigning the other label, and using that to infer that the typical rater is likely to mistake others. However, Murphy's Law being what it is, I think the above $\gamma$ matrix is telling me that my data is not very good and I'm in the weeds.
So does this extra hyperprior machinery make a difference in label assignments? Here's a matrix of counts, where the rows are the non-hyperprior model assignments and the columns are the hyperprior model assignments. \[
\begin{array}{c|c|c|c|c|c}
& \mbox{black} & \mbox{white} & \mbox{asian} & \mbox{hispanic} & \mbox{other} \\ \hline
\mbox{black} & 1689 & 0 & 0 & 0 & 0 \\
\mbox{white} & 1 & 908 & 1 & 4 & 0 \\
\mbox{asian} & 0 & 0 & 872 & 9 & 59 \\
\mbox{hispanic} & 4 & 2 & 9 & 470 & 7 \\
\mbox{other} & 0 & 2 & 4 & 3 & 208
\end{array}
\] Mostly they agree, although the hyperprior model converts a sizeable chunk of asians into others. In addition, the magnitudes of the $p (Z)$ vectors can be slightly different without affecting the label (i.e., $\operatorname{arg\,max}_k\; p (Z=k)$), and the magnitudes can matter when doing cost-sensitive multiclass classification. However I don't think it will. Basically my data is pretty crappy in some areas and it's hard to overcome crappy data with statistical legerdemain. Still I'm glad I'm implemented the hyperprior machinery as it makes it very easy to see exactly how I am screwed.
Fortunately, although there is ``no data like good data'', I have still have a possibility for success. If I implement clamping (i.e., the ability to assign known values to some of the hidden labels) and hand-label some of the examples, I might be able to leverage a small amount of high quality data to clean up a large amount of low quality data. So I'll try that next. If that fails, there is going to be a lot ``Mechanical Me'' in the future.
Since then I've made that change to the nominallabelextract software in the nincompoop project by putting a hierarchical Gaussian prior on the elements of the confusion matrix. The model is now \[
\begin{aligned}
\gamma_{kk} &= 0 \\
\gamma_{kl} &\sim N (1, 1) \;\; (k \neq l) \\
\alpha_i^{(kk)} &= 0 \\
\alpha_i^{(kl)} &\sim N (\gamma_{kl}, 1) \;\; (k \neq l) \\
\log \beta_j &\sim N (1, 1) \\
p (L_{ij} = l | Z_j = k, \alpha_i, \beta_j) &\propto e^{-\alpha_i^{(k,l)} \beta_j}
\end{aligned}
\] where \[
\begin{array}{c|c}
\mbox{Variable} & \mbox{Description} \\ \hline
k, l & \mbox{index labels} \\
j & \mbox{indexes images} \\
i & \mbox{indexes workers} \\
\gamma & \mbox{label-pair reliability hyperprior} \\
\alpha_i & \mbox{per-worker label-pair reliability} \\
\beta_j & \mbox{per-image difficulty} \\
L_{ij} & \mbox{observed label assigned to image by worker} \\
Z_j & \mbox{unknown true label associated with image}
\end{array}
\] Training still proceeds via ``Bayesian EM''. I folded the $\gamma$ estimation into the m-step, which appears numerically stable.
I ran the new hyperprior-enabled model on the data from my previous blog post; here's the resulting $\gamma$ estimate. Note: row labels are the true labels $Z$ and column labels are the observed labels $L$. \[
\begin{array}{c|c|c|c|c|c}
\gamma & \mbox{black} & \mbox{white} & \mbox{asian} & \mbox{hispanic} & \mbox{other} \\ \hline
\mbox{black} & 0 & 1.969921 & 1.608217 & 1.538128 & 2.104743 \\
\mbox{white} & 1.822261 & 0 & 1.062852 & 1.160873 & 1.767781 \\
\mbox{asian} & 1.494157 & 0.911748 & 0 & 1.003832 & 1.644094 \\
\mbox{hispanic} & 0.811841 & 0.383368 & 0.190436 & 0 & 1.338488 \\
\mbox{other} & 1.017143 & 0.579123 & -0.225708 & 0.607709 & 0\\
\end{array}
\] Since the diagonal elements are 0, cells where $\gamma_{kl} < 0$ indicate that apriori a rater is more likely to output the wrong label than the correct one. So for instance the model says that when the true label is other, a rater is apriori more likely to label it asian than other. Of course, if a rater is unlikely to output the true label, that raises the question of how the model can figure this out. It potentially could be identifying a small set of raters that are consistent with each other with respect to assigning the other label, and using that to infer that the typical rater is likely to mistake others. However, Murphy's Law being what it is, I think the above $\gamma$ matrix is telling me that my data is not very good and I'm in the weeds.
So does this extra hyperprior machinery make a difference in label assignments? Here's a matrix of counts, where the rows are the non-hyperprior model assignments and the columns are the hyperprior model assignments. \[
\begin{array}{c|c|c|c|c|c}
& \mbox{black} & \mbox{white} & \mbox{asian} & \mbox{hispanic} & \mbox{other} \\ \hline
\mbox{black} & 1689 & 0 & 0 & 0 & 0 \\
\mbox{white} & 1 & 908 & 1 & 4 & 0 \\
\mbox{asian} & 0 & 0 & 872 & 9 & 59 \\
\mbox{hispanic} & 4 & 2 & 9 & 470 & 7 \\
\mbox{other} & 0 & 2 & 4 & 3 & 208
\end{array}
\] Mostly they agree, although the hyperprior model converts a sizeable chunk of asians into others. In addition, the magnitudes of the $p (Z)$ vectors can be slightly different without affecting the label (i.e., $\operatorname{arg\,max}_k\; p (Z=k)$), and the magnitudes can matter when doing cost-sensitive multiclass classification. However I don't think it will. Basically my data is pretty crappy in some areas and it's hard to overcome crappy data with statistical legerdemain. Still I'm glad I'm implemented the hyperprior machinery as it makes it very easy to see exactly how I am screwed.
Fortunately, although there is ``no data like good data'', I have still have a possibility for success. If I implement clamping (i.e., the ability to assign known values to some of the hidden labels) and hand-label some of the examples, I might be able to leverage a small amount of high quality data to clean up a large amount of low quality data. So I'll try that next. If that fails, there is going to be a lot ``Mechanical Me'' in the future.
Tuesday, January 18, 2011
Modeling Mechanical Turk
There was a nice paper by Welinder et. al. at NIPS this year about building a statistical model of Mechanical Turkers in order to better infer the ``ground truth'' that is typically subsequently used by supervised learning algorithms. It was an aha! moment for me as I realized that I had been using Mechanical Turk without thinking very deeply about what I was doing. I resolved to do better on the next occasion I had to use Mechanical Turk and that occasion has arrived.
My (sub)problem is basically identifying the race of a person from a headshot of that person. Acceptable choices are ``black'', ``white'', ``asian'', ``hispanic'', ``other'', or to reject the photo as not being an actual headshot of an actual person (like any data set, mine has some funny business). I made a simple image labeling HIT, loaded 5000 images into Mechanical Turk, and asked for each one to be labeled by 5 unique workers.
It turns out there have been multiple papers in the last few years regarding crowdsourcing generally and Mechanical Turk in particular. I'm going to focus on an earlier paper describing the GLAD framework of Whitehill et. al. which is similar in purpose to the Welinder et. al. paper. There are three reasons for this. First, I found Whitehill et. al. a bit easier to understand and adapt to the multiclass case. Second, Whitehill et. al. provide a reference software implementation which served as a useful consistency check when implementing the multiclass version. Third, one of the authors on the Whitehill et. al. paper is a former advisor of mine.
\begin{array}{c|c|c|c}
\mbox{Criterion } &\mbox{ Positive Labels } &\mbox{ False Positives } &\mbox{ Cumulative Labels } \\ \hline
\mbox{Exactly 5 out of 5 } &\mbox{ 920 } &\mbox{ 0/100 } &\mbox{ 920 } \\
\mbox{Exactly 4 out of 5 } &\mbox{ 460 } &\mbox{ 0/100 } &\mbox{ 1380 } \\
\mbox{Exactly 3 out of 5 } &\mbox{ 221 } &\mbox{ 4/100 } &\mbox{ 1601 } \\
\mbox{Exactly 2 out of 5 } &\mbox{ 41 } &\mbox{ 6/41 } &\mbox{ 1642 }
\end{array}
\] For the false positives column, I selected 100 random examples from the set in question and manually labelled them. For the ``exactly 2 out of 5'' criterion, I required that no other label achieved more than 1 label (so strictly speaking, this requires access to the original multiclass ratings and not binary versions of them).
Overall majority voting does pretty good. If I stick with 4 out of 5 or better, the number of false positives is expected to be low. Here's a similar table, but looking for negative labels. \[
\begin{array}{c|c|c|c}
\mbox{Criterion } &\mbox{ Negative Labels } &\mbox{ False Negatives } &\mbox{ Cumulative Labels } \\ \hline
\mbox{Exactly 0 out of 5 } &\mbox{ 2849 } &\mbox{ 1/100 } &\mbox{ 2849 } \\
\mbox{Exactly 1 out of 5 } &\mbox{ 351 } &\mbox{ 6/50 } &\mbox{ 3200 }
\end{array}
\] As before, the false negatives column is actually me manually labelling a subset satisfying the criterion. It looks like high quality negative labels are only available at the ``exactly 0 out of 5'' level. Choosing ``4 or more out of 5'' for positives and ``0 out of 5'' for negatives results in a ratio of positive to negative examples of roughly 1:2. It also leaves 771 images unlabelled, implying the true ratio of positive to negative examples in the training set could be as high as 3:4 and as low as 2:9. Being incorrect about the relative frequencies would manifest itself in the classifier resulting from training on this data as a bias towards false negatives or false positives.
\begin{array}{c|c|c|c}
\mbox{Criterion } &\mbox{ Positive Labels } &\mbox{ False Positives } &\mbox{ Cumulative Labels } \\ \hline
\mbox{$p (Z = 1) = 1$ } &\mbox{ 923 } &\mbox{ see below } &\mbox{ 923 } \\
$0.9869 \leq p (Z = 1) < 1$ &\mbox{ 460 } &\mbox{ see below } &\mbox{ 1383 } \\
$0.6 \leq p (Z = 1) < 0.9869$ &\mbox{ 219 } &\mbox{ see below } &\mbox{ 1602 } \\
$0.395 \leq p (Z = 1) < 0.6$ &\mbox{ 41 } &\mbox{ 6/41 } &\mbox{ 1642 } \\
\end{array}
\] The ``exactly 5 out of 5'' set and the ``$p (Z = 1) = 1$'' set are the same
except that the latter contains 3 extra images. I spot checked the extra 3 images and they were all true positives. The ``exactly 4 out of 5'' set and the ``$0.9869 \leq p (Z = 1) < 1$'' differ by 13 pictures (26 total), so I labelled those manually. All of them were true positives. The ``exactly 3 out of 5'' set and the ``$0.6 \leq p (Z = 1) < 0.9869$'' set share 201 common images, so I labelled the differences manually. 2 out of 20 images were false positives in the ``exactly 3 out of 5'' set, while 0 out of 18 images were false positives in the ``$0.6 \leq p (Z = 1) < 0.9869$'' set. The ``exactly 2 out of 5'' set and the ``$0.395 \leq p (Z = 1) < 0.6$'' set share only 13 common images, so I labelled all the images in the latter. The false positive rate was identical, even though only 1 false positive is shared between the two sets.
Here's a table for negative labels. \[
\begin{array}{c|c|c|c}
\mbox{Criterion } &\mbox{ Negative Labels } &\mbox{ False Negatives } &\mbox{ Cumulative Labels } \\ \hline
\mbox{$p (Z = 1) = 0$ } &\mbox{ 2850 } &\mbox{ see below } &\mbox{ 2850 } \\
\mbox{$0 < p (Z = 1) < 0.022$ } &\mbox{ 351 } &\mbox{ see below } &\mbox{ 3201 }
\end{array}
\] The ``exactly 0 out of 5'' set and the ``$p (Z = 1) = 0$'' set are the same except that the latter contains one extra image. I spot checked the extra image and it is a true negative. The ``exactly 1 out of 5'' set and the ``$0 < p (Z = 1) < 0.022$'' differ by 17 pictures (34 total), so I labelled those manually. 10 out of 17 of the ``exactly 1 out of 5'' uniques were false negatives, whereas 6 out of 17 of the ``$0 < p (Z = 1) < 0.022$'' uniques were false negatives.
Overall the GLAD strategy shows some modest precision lift over majority voting for this data set. If 1601 positive labels are desired, GLAD would have an estimated 7 false positives to majority voting's 9 false positives. Meanwhile if 3200 negative labels are desired, GLAD would have an estimated 38 false negatives to majority voting's 42.
p (L_{ij} = Z_j | \alpha_i, \beta_j) = \frac{1}{1 + e^{-\alpha_i \beta_j}}
\] This is equivalent to assuming a confusion matrix probability of the form \[
p (L_{ij} = k | Z_j = l, \alpha_i, \beta_j) \propto e^{-\alpha_i^{(k,l)} \beta_j} : \; k, l \in \{ 0, 1 \},
\] where $\alpha_i^{(k,k)} = 0$ and $\alpha_i^{(k,l)} = \alpha_i^{(l,k)}$. The latter symmetry property basically says a rater is equally likely to confuse negatives for positives and positives for negatives, an assumption that Welinder et. al. relax.
The confusion matrix formulation suggests a straightforward generalization to the multiclass case where $k$ and $l$ range over more than just $0$ and $1$. I'll drop the symmetry assumption in $\alpha$ so I'll be able to model a per-rater bias. Although I do not have access to the paper, I suspect this model is the same as that proposed by Dawid and Skene in 1979 (regarding, apparently, errors in medical patient records: could they have envisioned how their model would be applied 30 years later?). As with the original GLAD, training proceeds via ``Bayesian EM'' (see software release below).
In practice, this is $|K| (|K| - 1)$ parameters per rater when there are $|K|$ labels, which might be too much model complexity. In my data set there were 167 workers for my 5000 images, with a median number of ratings per worker of 71. Workers with 71 or more ratings are responsible for 22960 out of the 25000 ratings I collected. If numbers like that are typical, there is definitely room for a larger number of model parameters per rater, so for binary ground truth it is presumably always beneficial to drop the symmetry assumption and model per-rater bias.
However, when the number of classes $|K|$ gets very large, having $|K| (|K| - 1)$ parameters per rater is not useful without additional assumptions. One possibility is to assume all errors are equally probable as in Welinder and Perona, but that doesn't fit the error patterns I'm seeing in the data. I suspect there is room for additional papers in this area elaborating a useful hierarchical prior for multiclass observations where the per-rater $\alpha$ would be informed by a population level estimate of the probability of confusing two classes or even the probability of having a particular confusion matrix. I'll leave this improvement for the future. (But not too far in the future: I can tell just from spot checking the data that most workers are making the same kinds of errors).
\begin{array}{c|c|c|c|c|c|c|c}
\mbox{Method } &\mbox{ Asian } &\mbox{ Black } &\mbox{ Hispanic } &\mbox{ Other } &\mbox{ White } &\mbox{ Invalid } &\mbox{ No Label } \\ \hline
\mbox{Multiclass GLAD (all) } &\mbox{ 941 } &\mbox{ 1690 } &\mbox{ 490 } &\mbox{ 217 } &\mbox{ 914 } &\mbox{ 748 } &\mbox{ n/a } \\
\mbox{Majority Vote } &\mbox{ 950 } &\mbox{ 1601 } &\mbox{ 137 } &\mbox{ 27 } &\mbox{ 818 } &\mbox{ 676 } &\mbox{ 793 } \\
\mbox{Multiclass GLAD (threshold) } &\mbox{ 724 } &\mbox{ 1599 } &\mbox{ 325 } &\mbox{ 148 } &\mbox{ 794 } &\mbox{ 617 } &\mbox{ 793 } \\
\mbox{MV} \bigcap \mbox{M-GLAD (threshold)} &\mbox{ 686 } &\mbox{ 1579 } &\mbox{ 115 } &\mbox{ 27 } &\mbox{ 742 } &\mbox{ 586 } &\mbox{ 423 }
\end{array}
\] Majority voting is unable to assign a label unless 3 raters agree, leading to 793 images not being assigned a label. For Multiclass GLAD (threshold), I chose a minimum label probability of 0.8461 in order to assign an image that label; this led to the same number of images not being assigned a label. I also forced Multiclass GLAD to assign a label to every image, and the result suggests that labels on images of lower label confidence are less likely to be ``black'' than labels on images of high label confidence.
For each label, I took a random sample of images that were given by that label exclusively either by Multiclass GLAD (threshold) or Majority Vote (i.e., I ignored images that were assigned the same label by both algorithms). I labelled these manually in order to assess the error rate on the difference set. \[
\begin{array}{c|c|c|c|c|c|c|c}
\mbox{Label } & \Delta \mbox{MV Error Rate} & \Delta \mbox{M-GLAD Error Rate} \\ \hline
\mbox{Asian } &\mbox{ 1/38 } &\mbox{ 1/38 } \\
\mbox{Black } &\mbox{ 4/22 } &\mbox{ 1/20 } \\
\mbox{Hispanic } &\mbox{ 15/22 } &\mbox{ 18/22 } \\
\mbox{White } &\mbox{ 11/20 } &\mbox{ 6/20 } \\
\mbox{Other } &\mbox{ n/a } &\mbox{ 10/20 }
\end{array}
\] Overall, distinguishing between hispanics and asians is very difficult for the Mechanical Turk community (in some cases, I can only do better because I have access to side information associated with the photos). Since Majority Voting assigns less hispanic labels, and has a lower error rate on the difference label sample, it is doing a better job. This might be a good application of the ``clamping'' capability of generative models of Mechanical Turkers, where hand labeling a subset of the corpus converts hidden variables to known variables and helps nail down the parameters of the raters. In particular I should implement clamping and then hand label a subset of the images marked hispanic by Multiclass GLAD.
Distinguishing between whites and hispanics, and whites and asians, is also difficult for the Mechanical Turk community. Since Majority Vote is assigning more of these labels, and has a higher error rate on the difference label sample, it is doing a worse job.
Multiclass GLAD assigns the ``other'' label on a strict superset of the times Majority Vote does so. The error rate here is very high: while there are many Arabs given this label, there are also many photos that would be better assigned one of the four main labels.
In practice, since I'm using these labels as training data in a supervised learning problem, I don't have to make a discrete decision at this point. Instead I can take the per-image $p(Z=k)$ vector and construct a cost-sensitive multiclass classification instance with it.
I'm releasing an initial version on Google Code of the multiclass GLAD software that I used to get the above results in the hope others might find it useful. On Google Code it is called nominallabelextract and is part of the nincompoop project.
Overall the multiclass GLAD extension described above appears promising, but not decisively better than majority voting, and I still don't have data that is sufficiently high quality for me to attack my original problem. One possible direction is to implement clamping and do some hand labeling to get a better estimate for the easily confused labels (e.g., hispanic vs. asian); another is to introduce a hierarchical prior on the confusion matrix. If I do those things I'll keep the Google Code up to date.
My (sub)problem is basically identifying the race of a person from a headshot of that person. Acceptable choices are ``black'', ``white'', ``asian'', ``hispanic'', ``other'', or to reject the photo as not being an actual headshot of an actual person (like any data set, mine has some funny business). I made a simple image labeling HIT, loaded 5000 images into Mechanical Turk, and asked for each one to be labeled by 5 unique workers.
It turns out there have been multiple papers in the last few years regarding crowdsourcing generally and Mechanical Turk in particular. I'm going to focus on an earlier paper describing the GLAD framework of Whitehill et. al. which is similar in purpose to the Welinder et. al. paper. There are three reasons for this. First, I found Whitehill et. al. a bit easier to understand and adapt to the multiclass case. Second, Whitehill et. al. provide a reference software implementation which served as a useful consistency check when implementing the multiclass version. Third, one of the authors on the Whitehill et. al. paper is a former advisor of mine.
Empirical Performance, Binary Case
Although my problem is multiclass, I decided to start with a binary version of it in order to build intuition and test-drive the reference implementation. So for the moment I'll be discussing ``is this a photo of a black person or not?''Majority Voting
The following table summarizes the performance of the majority voting heuristic. \[\begin{array}{c|c|c|c}
\mbox{Criterion } &\mbox{ Positive Labels } &\mbox{ False Positives } &\mbox{ Cumulative Labels } \\ \hline
\mbox{Exactly 5 out of 5 } &\mbox{ 920 } &\mbox{ 0/100 } &\mbox{ 920 } \\
\mbox{Exactly 4 out of 5 } &\mbox{ 460 } &\mbox{ 0/100 } &\mbox{ 1380 } \\
\mbox{Exactly 3 out of 5 } &\mbox{ 221 } &\mbox{ 4/100 } &\mbox{ 1601 } \\
\mbox{Exactly 2 out of 5 } &\mbox{ 41 } &\mbox{ 6/41 } &\mbox{ 1642 }
\end{array}
\] For the false positives column, I selected 100 random examples from the set in question and manually labelled them. For the ``exactly 2 out of 5'' criterion, I required that no other label achieved more than 1 label (so strictly speaking, this requires access to the original multiclass ratings and not binary versions of them).
Overall majority voting does pretty good. If I stick with 4 out of 5 or better, the number of false positives is expected to be low. Here's a similar table, but looking for negative labels. \[
\begin{array}{c|c|c|c}
\mbox{Criterion } &\mbox{ Negative Labels } &\mbox{ False Negatives } &\mbox{ Cumulative Labels } \\ \hline
\mbox{Exactly 0 out of 5 } &\mbox{ 2849 } &\mbox{ 1/100 } &\mbox{ 2849 } \\
\mbox{Exactly 1 out of 5 } &\mbox{ 351 } &\mbox{ 6/50 } &\mbox{ 3200 }
\end{array}
\] As before, the false negatives column is actually me manually labelling a subset satisfying the criterion. It looks like high quality negative labels are only available at the ``exactly 0 out of 5'' level. Choosing ``4 or more out of 5'' for positives and ``0 out of 5'' for negatives results in a ratio of positive to negative examples of roughly 1:2. It also leaves 771 images unlabelled, implying the true ratio of positive to negative examples in the training set could be as high as 3:4 and as low as 2:9. Being incorrect about the relative frequencies would manifest itself in the classifier resulting from training on this data as a bias towards false negatives or false positives.
GLAD
Now for the GLAD estimation strategy, for which I used the downloadable reference implementation. I tried to pick thresholds in $p (Z = 1)$ corresponding to coverage points from the majority voting strategy. Here's a table for positive labels. \[\begin{array}{c|c|c|c}
\mbox{Criterion } &\mbox{ Positive Labels } &\mbox{ False Positives } &\mbox{ Cumulative Labels } \\ \hline
\mbox{$p (Z = 1) = 1$ } &\mbox{ 923 } &\mbox{ see below } &\mbox{ 923 } \\
$0.9869 \leq p (Z = 1) < 1$ &\mbox{ 460 } &\mbox{ see below } &\mbox{ 1383 } \\
$0.6 \leq p (Z = 1) < 0.9869$ &\mbox{ 219 } &\mbox{ see below } &\mbox{ 1602 } \\
$0.395 \leq p (Z = 1) < 0.6$ &\mbox{ 41 } &\mbox{ 6/41 } &\mbox{ 1642 } \\
\end{array}
\] The ``exactly 5 out of 5'' set and the ``$p (Z = 1) = 1$'' set are the same
except that the latter contains 3 extra images. I spot checked the extra 3 images and they were all true positives. The ``exactly 4 out of 5'' set and the ``$0.9869 \leq p (Z = 1) < 1$'' differ by 13 pictures (26 total), so I labelled those manually. All of them were true positives. The ``exactly 3 out of 5'' set and the ``$0.6 \leq p (Z = 1) < 0.9869$'' set share 201 common images, so I labelled the differences manually. 2 out of 20 images were false positives in the ``exactly 3 out of 5'' set, while 0 out of 18 images were false positives in the ``$0.6 \leq p (Z = 1) < 0.9869$'' set. The ``exactly 2 out of 5'' set and the ``$0.395 \leq p (Z = 1) < 0.6$'' set share only 13 common images, so I labelled all the images in the latter. The false positive rate was identical, even though only 1 false positive is shared between the two sets.
Here's a table for negative labels. \[
\begin{array}{c|c|c|c}
\mbox{Criterion } &\mbox{ Negative Labels } &\mbox{ False Negatives } &\mbox{ Cumulative Labels } \\ \hline
\mbox{$p (Z = 1) = 0$ } &\mbox{ 2850 } &\mbox{ see below } &\mbox{ 2850 } \\
\mbox{$0 < p (Z = 1) < 0.022$ } &\mbox{ 351 } &\mbox{ see below } &\mbox{ 3201 }
\end{array}
\] The ``exactly 0 out of 5'' set and the ``$p (Z = 1) = 0$'' set are the same except that the latter contains one extra image. I spot checked the extra image and it is a true negative. The ``exactly 1 out of 5'' set and the ``$0 < p (Z = 1) < 0.022$'' differ by 17 pictures (34 total), so I labelled those manually. 10 out of 17 of the ``exactly 1 out of 5'' uniques were false negatives, whereas 6 out of 17 of the ``$0 < p (Z = 1) < 0.022$'' uniques were false negatives.
Overall the GLAD strategy shows some modest precision lift over majority voting for this data set. If 1601 positive labels are desired, GLAD would have an estimated 7 false positives to majority voting's 9 false positives. Meanwhile if 3200 negative labels are desired, GLAD would have an estimated 38 false negatives to majority voting's 42.
Generalization to Multiclass
At the core of the GLAD technique is an assumption about the structure of the probability of error, \[p (L_{ij} = Z_j | \alpha_i, \beta_j) = \frac{1}{1 + e^{-\alpha_i \beta_j}}
\] This is equivalent to assuming a confusion matrix probability of the form \[
p (L_{ij} = k | Z_j = l, \alpha_i, \beta_j) \propto e^{-\alpha_i^{(k,l)} \beta_j} : \; k, l \in \{ 0, 1 \},
\] where $\alpha_i^{(k,k)} = 0$ and $\alpha_i^{(k,l)} = \alpha_i^{(l,k)}$. The latter symmetry property basically says a rater is equally likely to confuse negatives for positives and positives for negatives, an assumption that Welinder et. al. relax.
The confusion matrix formulation suggests a straightforward generalization to the multiclass case where $k$ and $l$ range over more than just $0$ and $1$. I'll drop the symmetry assumption in $\alpha$ so I'll be able to model a per-rater bias. Although I do not have access to the paper, I suspect this model is the same as that proposed by Dawid and Skene in 1979 (regarding, apparently, errors in medical patient records: could they have envisioned how their model would be applied 30 years later?). As with the original GLAD, training proceeds via ``Bayesian EM'' (see software release below).
In practice, this is $|K| (|K| - 1)$ parameters per rater when there are $|K|$ labels, which might be too much model complexity. In my data set there were 167 workers for my 5000 images, with a median number of ratings per worker of 71. Workers with 71 or more ratings are responsible for 22960 out of the 25000 ratings I collected. If numbers like that are typical, there is definitely room for a larger number of model parameters per rater, so for binary ground truth it is presumably always beneficial to drop the symmetry assumption and model per-rater bias.
However, when the number of classes $|K|$ gets very large, having $|K| (|K| - 1)$ parameters per rater is not useful without additional assumptions. One possibility is to assume all errors are equally probable as in Welinder and Perona, but that doesn't fit the error patterns I'm seeing in the data. I suspect there is room for additional papers in this area elaborating a useful hierarchical prior for multiclass observations where the per-rater $\alpha$ would be informed by a population level estimate of the probability of confusing two classes or even the probability of having a particular confusion matrix. I'll leave this improvement for the future. (But not too far in the future: I can tell just from spot checking the data that most workers are making the same kinds of errors).
Empirical Performance
For my multiclass data, I tried both majority voting (3 out of 5) and the multiclass generalization of GLAD. \[\begin{array}{c|c|c|c|c|c|c|c}
\mbox{Method } &\mbox{ Asian } &\mbox{ Black } &\mbox{ Hispanic } &\mbox{ Other } &\mbox{ White } &\mbox{ Invalid } &\mbox{ No Label } \\ \hline
\mbox{Multiclass GLAD (all) } &\mbox{ 941 } &\mbox{ 1690 } &\mbox{ 490 } &\mbox{ 217 } &\mbox{ 914 } &\mbox{ 748 } &\mbox{ n/a } \\
\mbox{Majority Vote } &\mbox{ 950 } &\mbox{ 1601 } &\mbox{ 137 } &\mbox{ 27 } &\mbox{ 818 } &\mbox{ 676 } &\mbox{ 793 } \\
\mbox{Multiclass GLAD (threshold) } &\mbox{ 724 } &\mbox{ 1599 } &\mbox{ 325 } &\mbox{ 148 } &\mbox{ 794 } &\mbox{ 617 } &\mbox{ 793 } \\
\mbox{MV} \bigcap \mbox{M-GLAD (threshold)} &\mbox{ 686 } &\mbox{ 1579 } &\mbox{ 115 } &\mbox{ 27 } &\mbox{ 742 } &\mbox{ 586 } &\mbox{ 423 }
\end{array}
\] Majority voting is unable to assign a label unless 3 raters agree, leading to 793 images not being assigned a label. For Multiclass GLAD (threshold), I chose a minimum label probability of 0.8461 in order to assign an image that label; this led to the same number of images not being assigned a label. I also forced Multiclass GLAD to assign a label to every image, and the result suggests that labels on images of lower label confidence are less likely to be ``black'' than labels on images of high label confidence.
For each label, I took a random sample of images that were given by that label exclusively either by Multiclass GLAD (threshold) or Majority Vote (i.e., I ignored images that were assigned the same label by both algorithms). I labelled these manually in order to assess the error rate on the difference set. \[
\begin{array}{c|c|c|c|c|c|c|c}
\mbox{Label } & \Delta \mbox{MV Error Rate} & \Delta \mbox{M-GLAD Error Rate} \\ \hline
\mbox{Asian } &\mbox{ 1/38 } &\mbox{ 1/38 } \\
\mbox{Black } &\mbox{ 4/22 } &\mbox{ 1/20 } \\
\mbox{Hispanic } &\mbox{ 15/22 } &\mbox{ 18/22 } \\
\mbox{White } &\mbox{ 11/20 } &\mbox{ 6/20 } \\
\mbox{Other } &\mbox{ n/a } &\mbox{ 10/20 }
\end{array}
\] Overall, distinguishing between hispanics and asians is very difficult for the Mechanical Turk community (in some cases, I can only do better because I have access to side information associated with the photos). Since Majority Voting assigns less hispanic labels, and has a lower error rate on the difference label sample, it is doing a better job. This might be a good application of the ``clamping'' capability of generative models of Mechanical Turkers, where hand labeling a subset of the corpus converts hidden variables to known variables and helps nail down the parameters of the raters. In particular I should implement clamping and then hand label a subset of the images marked hispanic by Multiclass GLAD.
Distinguishing between whites and hispanics, and whites and asians, is also difficult for the Mechanical Turk community. Since Majority Vote is assigning more of these labels, and has a higher error rate on the difference label sample, it is doing a worse job.
Multiclass GLAD assigns the ``other'' label on a strict superset of the times Majority Vote does so. The error rate here is very high: while there are many Arabs given this label, there are also many photos that would be better assigned one of the four main labels.
In practice, since I'm using these labels as training data in a supervised learning problem, I don't have to make a discrete decision at this point. Instead I can take the per-image $p(Z=k)$ vector and construct a cost-sensitive multiclass classification instance with it.
Software Release
I'm releasing an initial version on Google Code of the multiclass GLAD software that I used to get the above results in the hope others might find it useful. On Google Code it is called nominallabelextract and is part of the nincompoop project.
Overall the multiclass GLAD extension described above appears promising, but not decisively better than majority voting, and I still don't have data that is sufficiently high quality for me to attack my original problem. One possible direction is to implement clamping and do some hand labeling to get a better estimate for the easily confused labels (e.g., hispanic vs. asian); another is to introduce a hierarchical prior on the confusion matrix. If I do those things I'll keep the Google Code up to date.
Sunday, January 2, 2011
Happiness is a Warm Tweet
Often obtaining training labels is a limiting factor for getting things done. Towards this end, Ramage et. al. make an interesting observation regarding Twitter: tweets are rich with annotation corresponding to Twitter conventions. Emoticons, hash-tags, and at-directions provide information about the emotional, semantic, and social content of the tweet.
So for kicks I decided to make a Twitter happiness detector. With a little help from Wikipedia I learned what the common emoticons are for indicating happiness and sadness. For any particular tweet, I called it happy if it contained at least one happy emoticon and no sad emoticons; I called it sad if it contained at least one sad emoticon and no happy emoticons; and otherwise I called it ambiguous. Most tweets are ambiguous, and the goal is to generalize to them; but for training I'll only use the unambiguous tweets.
There are several features in the new vowpal wabbit that come together and make this problem pleasant to attack: support for sparse high-cardinality features; model complexity control via the hashing trick; the --adaptive flag which is like an automated improved tf-idf weighting; native support for n-gram expansion; and of course, the ability to churn through dozens of millions of tweets on my laptop before lunch is over. Training on a week of tweets and testing on a future day of tweets leads to an AUC of 0.87, i.e., given a random happy and sad tweet the resulting regressor is 87% likely to score the happy tweet as more happy than the sad tweet. (Note: one has to remove emoticons when parsing, otherwise, the AUC is 1. The point here is to generalize to ambiguous tweets.) At this point I'm not leveraging the LDA support in vowpal, I'm just encoding each token nominally, so presumably this can be improved.
I took another random sample of 10000 tweets from even further in the future, which turn out to be mostly ambiguous tweets since that is what most tweets are. I then ranked them, and here are the 10 happiest and 10 saddest tweets. Just to reiterate, emoticons are stripped from the tweets during parsing: that several of the happiest and saddest tweets have emoticons in them indicates the ease of predicting emoticon presence from the other tokens in the tweet.
First, emoticons are universal, which allows this trick to work for many different languages (I think? I can't read some of those).
Second, Twitter data is very clean. Silly ideas like this never worked with web data, because there was always this huge hurdle of separating the content from other parts of the payload (navigational, structural, etc.). In addition web pages are big where tweets are short, so tweets can have a clear emotional tone where web pages might have multiple.
Third, vowpal should be in the list of tools that people try when faced with a text classification problem. This took less than half a day to put together, and most of that was data wrangling.
So for kicks I decided to make a Twitter happiness detector. With a little help from Wikipedia I learned what the common emoticons are for indicating happiness and sadness. For any particular tweet, I called it happy if it contained at least one happy emoticon and no sad emoticons; I called it sad if it contained at least one sad emoticon and no happy emoticons; and otherwise I called it ambiguous. Most tweets are ambiguous, and the goal is to generalize to them; but for training I'll only use the unambiguous tweets.
There are several features in the new vowpal wabbit that come together and make this problem pleasant to attack: support for sparse high-cardinality features; model complexity control via the hashing trick; the --adaptive flag which is like an automated improved tf-idf weighting; native support for n-gram expansion; and of course, the ability to churn through dozens of millions of tweets on my laptop before lunch is over. Training on a week of tweets and testing on a future day of tweets leads to an AUC of 0.87, i.e., given a random happy and sad tweet the resulting regressor is 87% likely to score the happy tweet as more happy than the sad tweet. (Note: one has to remove emoticons when parsing, otherwise, the AUC is 1. The point here is to generalize to ambiguous tweets.) At this point I'm not leveraging the LDA support in vowpal, I'm just encoding each token nominally, so presumably this can be improved.
I took another random sample of 10000 tweets from even further in the future, which turn out to be mostly ambiguous tweets since that is what most tweets are. I then ranked them, and here are the 10 happiest and 10 saddest tweets. Just to reiterate, emoticons are stripped from the tweets during parsing: that several of the happiest and saddest tweets have emoticons in them indicates the ease of predicting emoticon presence from the other tokens in the tweet.
10 Happiest Tweets
@WRiTExMiND no doubt! <--guess who I got tht from? Bwahaha anyway doe I like surprising people it's kinda my thing so ur welcome! And hi :) @skvillain yeh wiz is dope, got his own lil wave poppin! I'm fuccin wid big sean too he signed to kanye label g.o.o.d music And @pumahbeatz opened for @MarshaAmbrosius & blazed! So proud of him! Go bro! & Marsha was absolutely amazing! Awesome night all around. =) Awesome! RT @robscoms: Great 24 hours with nephews. Watched Tron, homemade mac & cheese for dinner, Wii, pancakes & Despicable Me this am! Good Morning 2 U Too RT @mzmonique718: Morningggg twitt birds!...up and getting ready for church...have a good day and LETS GO GIANTS! Goodmorning #cleveland, have a blessed day stay focused and be productive and thank god for life AMEN!!!>>>RT @DrSanlare: Daddy looks soooo good!!! God is amazing! To GOD be the glory and victory #TeamJesus Glad I serve an awesome God AGREED!! RT @ILoveElizCruz: Amen to dat... We're some awesome people! RT @itsVonnell_Mars: @ILoveElizCruz gotta love my sign lol #word thanks! :) RT @Steph0e: @IBtunes HAppy Birthday love!!! =) still a fan of ya movement... yay you get another year to be dope!!! YES!! Happy bday isaannRT @isan_coy: Selamatt ulang tahun yaaa RT @Phitz_bow: Selamat siangg RT @isan_coy: Slamat pagiiii
10 Saddest Tweets
Migraine, sore throat, cough & stomach pains. Why me God? Ik moet werken omg !! Ik lig nog in bed en ben zo moe .. Moet alleen opstaan en tis koud buitn :( I Feel Horrible ' My Voice Is Gone Nd I'm Coughing Every 5 Minutes ' I Hate Feeling Like This :-/ SMFH !!! Stomach Hurting ; Aggy ; Upset ; Tired ;; Madd Mixxy Shyt Yo ! Worrying about my dad got me feeling sick I hate this!! I wish I could solve all these problems but I am only 1 person & can do so much.. Malam2 menggigil+ga bs napas+sakit kepala....badan remuk redam *I miss my husband's hug....#nangismanja# Waking up with a sore throat = no bueno. Hoping someone didn't get me ill and it's just from sleeping. D: Aaaa ini tenggorokan gak enak, idung gatel bgt bawaannya pengen bersin terus. Calon2 mau sakit nih -___- I'm scared of being alone, I can't see to breathe when I am lost in this dream, I need you to hold me? Why the hell is suzie so afraid of evelyn! Smfh no bitch is gonna hav me scared I dnt see it being possible its not!
Observations
First, emoticons are universal, which allows this trick to work for many different languages (I think? I can't read some of those).
Second, Twitter data is very clean. Silly ideas like this never worked with web data, because there was always this huge hurdle of separating the content from other parts of the payload (navigational, structural, etc.). In addition web pages are big where tweets are short, so tweets can have a clear emotional tone where web pages might have multiple.
Third, vowpal should be in the list of tools that people try when faced with a text classification problem. This took less than half a day to put together, and most of that was data wrangling.