- Compute term frequency vector for each hash tag.
- For each tweet X:
- For each hashtag $h \in X$:
- For each non-hashtag token $t \in X$:
- Increment count for $(h, t)$.
- For each non-hashtag token $t \in X$:
- For each hashtag $h \in X$:
- For each tweet X:
- Compute pairwise cosine for each hashtag pair in the top 1000 most frequent hashtags, $\cos \theta (h, h^\prime)$.
- Define similarity as $s (h, h^\prime) = \arccos (\cos \theta (h, h^\prime))$.
- I get less negative eigenvalues doing this versus $1.0 - \cos \theta (h, h^\prime)$. This is the difference between moving around the hypersphere and cutting across a hyperchord.
- Do 60-dimensional MDS on $s$.
- Two of the 60 eigenvalues were negative so I treated them as 0. So really 58-dimensional MDS.
- I was a bit surprised to get any negative eigenvalues here, since all my term vectors occupy the positive hyperquadrant of a hypersphere. Clearly my hyperintuition needs hypertuning ... or maybe I have a bug.
- Input resulting 60-dimensional representation into t-SNE.
- I used perplexity 10.
Finally when I plotted this I tried to randomize the colors to give a chance of being able to see something when all the tags are on top of each other. Really the png does not do justice, you should get the pdf version and zoom in.
Thanks for posting this, it's interesting stuff. Did you try just feeding the term frequency vectors directly into t-SNE?
ReplyDeleteYes. The C++ t-SNE implementation does an admirable job once the data is loaded, actually, but it wants to read the entire dataset as dense vectors into memory before proceeding, so as I tried to scale up it became a problem. The MDS middleman helps with that.
ReplyDeleteI could also have tried the Matlab implementation and fed the distances in, which would have been only 1000x1000 space. However at that point I had a nice workflow going with the c++ version so I ended up doing what I did.
Did you try generating a 3-D image instead? And the follow-up question: is there a simple rotate/view tool for 3-D matrices out there somewhere -- it really seems to be what t-SNE needs.
ReplyDelete@doug: yes I did. I used Mathematica which lets you rotate and zoom into 3D plots in the notebook interface (not exactly simple: sorry). It seemed to be using the 3rd dimension to improve the cluster differences from each other; but this didn't result in a qualitatively different view. YMMV.
ReplyDeleteI also tried a 3D version using color as the third dimension. This turned out not to be very readable so I didn't post it.