In my previous post I discussed my ongoing difficulties with the results from a Mechanical Turk HIT. I indicated that I would hand-label some of the data and then implement clamping (known label) in my generative model to attempt to improve the results. Since then I've done the clamping implementation and released to nincompoop.
Well the first thing I learned trying to hand-label the data is that I basically asked the Turkers to do the impossible. It is not possible to reliably distinguish between whites and hispanics (somewhat ill-defined terms, actually) on the basis of a photo alone. The only reason I'm able to disambiguate is because I have access to additional information (e.g., the person's real name). Lesson learned: always try to perform the HIT to determine feasibility before sending to Mechanical Turk.
I hand-labeled about 20% of the profiles, held-out 1/4 of the hand-labels to assess quality of the label estimation, and clamped the rest. I ended up with the following results on the held-out labels: columns are the label assigned by nominallabelextract (i.e., $\operatorname{arg\,max}_k\; p (Z=k)$), and rows are the labels assigned by ``Mechanical Me''. (Note: invalid was one of the choices from the HIT, indicating that the photo was improper.) \[
\begin{array}{c|c|c|c|c|c|c}
& \mbox{black} & \mbox{white} & \mbox{asian} & \mbox{hispanic} & \mbox{other} & \mbox{invalid} \\ \hline
\mbox{black} & 106 & 0 & 0 & 2 & 0 & 8 \\
\mbox{white} & 0 & 35 & 0 & 1 & 0 & 7 \\
\mbox{asian} & 4 & 7 & 39 & 13 & 16 & 23 \\
\mbox{hispanic} & 0 & 4 & 1 & 3 & 1 & 1 \\
\end{array}
\] Now it is interesting to compare this to how the model does without access to any of the clamped values: \[
\begin{array}{c|c|c|c|c|c|c}
& \mbox{black} & \mbox{white} & \mbox{asian} & \mbox{hispanic} & \mbox{other} & \mbox{invalid} \\ \hline
\mbox{black} & 106 & 0 & 0 & 2 & 0 & 8 \\
\mbox{white} & 0 & 35 & 0 & 1 & 0 & 7 \\
\mbox{asian} & 4 & 7 & 42 & 11 & 12 & 26 \\
\mbox{hispanic} & 0 & 5 & 0 & 2 & 2 & 1 \\
\end{array}
\] It's a wash, or if anything clamping has slightly degraded things.
My dream of labeling a small amount of data to rescue the larger pile has been destroyed. What's happening? Intuitively for clamping to help there needs to be Mechanical Turk workers who label like I do, such that nominallabelextract can extrapolate from agreement on the known set to high reliability on the unknown set. When I spot-checked, however, there were cases when I clamped a value (e.g., hispanic), but all 5 workers from Mechanical Turk agreed on a different label (e.g., white). Therefore I suspect there are no workers who label like I do, because none of them have access to the additional information that I have.
So basically I have to redesign the HIT to contain additional information.
No comments:
Post a Comment