I say ``inspired by'' because the model is quite a bit simpler. In particular since in my data sets there are typically very few ratings per item (e.g., 3), I continue my tradition of a simple item model (namely, a single scalar difficulty parameter $\beta$). Therefore instead of embedding items, I embed the hidden labels. Each worker is modeled as a probabilistic classifier driven by the distance from the hidden label prototype, \[
p (l_{ij} = r | \alpha, \beta, \tau, z) \propto \exp (-\beta_j \lVert \tau_{z_j} + \alpha_{z_jr} - \tau_r - \alpha_{ir} \rVert^2).
\] Here $l_{ij}$ is the label reported by worker $i$ on item $j$, $\alpha_{ir}$ is the $d$-dimensional bias vector for worker $i$ and label $r$, $\beta_j$ is the difficulty parameter for item $j$, $\tau_r$ is the $d$-dimensional prototype vector for label $r$, $z_j$ is the true hidden label for item $j$, and $d$ is the dimensionality of the embedding. Although the $\tau$ need to be randomly initialized to break symmetry, this parameterization ensures that $\alpha_{ir} = 0$ is a reasonable starting condition. The $\alpha$ are $L^2$ regularized (Gaussian prior) but the $\tau$ are not (uninformative prior). A note about invariances: $d$ symmetries are eliminated by translating and rotating the $\tau$ into canonical position ($\tau_0$ is constrained to be at the origin, $\tau_1$ is constrained to be in the subspace spanned by the first unit vector, etc.).
Although my motivation was visualization (corresponding to $d = 2$ or $d = 3$), there are two other possible uses. $d = 1$ is akin to a non-monotonic ordinal constraint and might be appropriate for some problems. Larger $d$ are potentially useful since there is a reduction of per-worker parameters from $O (|L|^2)$ to $O (d |L|)$, which might be relevant for multi-label problems handled by reduction.
Inference proceeds as before (I used multinomial logistic regression for the classifier), except of course the worker model has changed. In practice this worker model is roughly 3x slower than the multinomial worker model, but since this worker model results in a reduction of per-worker parameters perhaps the fair comparison is against a low-rank approximation, which is also slower. Here is the software working through my canonical demonstration task, predicting the ethnicity of a Twitter user from their profile.
strategy = nominalembed initial_t = 10000 eta = 1.0 rho = 0.9 n_items = 16547 n_labels = 9 n_worker_bits = 16 n_feature_bits = 18 n_dims = 2 seed = 45 test_only = false prediction file = (no output) data file = (stdin) cumul since cumul since example current current current current avg q last avg ce last counter label predict ratings features -1.64616 -1.64616 -1.90946 -1.90946 2 -1 2 4 30 -1.60512 -1.56865 -1.93926 -1.95912 5 -1 2 3 32 -1.38015 -1.15517 -2.13355 -2.32784 10 -1 1 4 28 -1.11627 -0.82685 -2.08542 -2.03194 19 -1 2 3 21 -0.89318 -0.63424 -1.89668 -1.68574 36 -1 1 3 35 -0.90385 -0.91498 -1.62015 -1.31849 69 -1 8 4 27 -0.99486 -1.0903 -1.5287 -1.43162 134 -1 1 4 54 -0.93116 -0.86077 -1.42049 -1.30809 263 -1 1 4 45 -0.90436 -0.87592 -1.47783 -1.5365 520 -1 1 3 13 -0.92706 -0.95001 -1.42042 -1.36223 1033 -1 2 1 11 -0.96477 -1.00259 -1.33948 -1.25791 2058 -1 8 3 21 -0.95079 -0.93672 -1.2513 -1.16272 4107 -1 1 3 44 -0.91765 -0.88423 -1.13014 -1.0087 8204 -1 0 3 26 -0.90145 -0.88529 -0.98977 -0.84921 16397 -1 8 3 23 -0.86520 -0.82882 -0.80860 -0.62731 32782 -1 8 3 20 -0.83186 -0.79852 -0.63999 -0.47132 65551 -1 1 3 56 -0.79732 -0.76279 -0.50123 -0.36243 131088 -1 2 3 35 -0.77279 -0.74826 -0.40255 -0.30386 262161 -1 8 3 41 -0.75345 -0.73413 -0.33804 -0.27352 524306 -1 2 3 43 -0.74128 -0.72911 -0.29748 -0.25692 1048595 -1 1 4 45 -0.73829 -0.72691 -0.28774 -0.25064 1323760 -1 1 3 27 applying deferred prior updates ... finished tau: \ latent dimension | 0 1 label | 0 | 0.0000 0.0000 1 | 2.6737 0.0000 2 | 3.5386 -1.3961 3 | 1.3373 -1.2188 4 | -1.5965 -1.4927 5 | 0.0136 -2.9098 6 | -2.4236 1.4345 7 | -0.0450 2.2672 8 | 2.1513 -1.5638 447.48s user 1.28s system 97% cpu 7:38.84 totalThe above process produces estimates (posterior distributions) over the hidden labels for each item as well as a classifier that will attempt to generalize to novel instances and a worker model that will attempt to generalize to novel workers. In addition several visualizable things fall out of this:
- The hidden label prototype vectors $\tau_r$. Being closer together suggests two labels are more likely to be confused.
- The per-worker noise vector $\alpha_{ir}$. These adjust the hidden label prototypes per user, leading to differences in bias and accuracy.
- The items can be placed into the latent space by forming a convex combination of hidden label prototype vectors via the posterior distribution over labels.
Results are dependent upon the random seed. The most popular labels (Asian, Hispanic, Black, White and N/A) maintain their relative positions but the less popular labels move around. Here's the above plot for a different random seed: note the x-axis has shrunk, but this will be more convenient for subsequent plots. (Click on the image to zoom in).
I'll stick with this random seed for the remainder of the plots. Now I'll place a dot for each worker's prototype vector ($\tau_z + \alpha_{iz}$) on the plot. (Click on the image to zoom in).
The pattern of dots provides some intuition about the distribution of error patterns across the worker population. For instance, the dots around the Hispanic label have more horizontal than vertical spread. That suggests there is more variation in distinguishing between Whites and Hispanics versus distinguishing between Blacks and Hispanics. The distinction between Whites and Hispanics is more cultural than racial; the US Census Bureau lists White as a race, but ``Hispanic or Latino'' as an ethnicity; thus in some sense this is poor experimental design, but since advertisers care strongly about this distinction, I have to make it work.
Finally here are some profile photos embedded into the latent space according to the posterior distribution over the hidden label for the profile. Click on the image below to get a vector version that you can zoom into and see the detail.
In some cases the photos don't appear to make sense given their embedded location. Some of this is because the workers are noisy labelers. However the workers have access to and are basing their labeling decisions on the entire profile. Therefore these photos are best thought of as ``examples of the kind of profile photo that particular ethnicities choose to use'', rather than examples of pictures of people of that ethnicity per se.
The latest version of playerpiano is available from the Google code repository.
No comments:
Post a Comment