One of the hot features in the new vowpal is Online LDA (thanks Matt Hoffman!). However Tweets are really tiny, so it's natural to ask whether models like LDA are effective for such short documents. Ramage et. al. wondered the same thing.
While LDA and related models have a long history of application to news articles and academic abstracts, one open question is if they will work on documents as short as Twitter posts and with text that varies greatly from the traditionally studied collections – here we find that the answer is yes.So I took a sample of 4 million tweets, tokenized them, and fed them to vowpal and asked for a 10 topic model. Running time: 3 minutes. I'll spare you the details of tokenization, except to note that on average a tweet ends up with 11 tokens (i.e., not many).
Although 10 topics is really too small to get anything but broad brushstrokes (I was just warming up), the result is funny so I'd thought I'd paste it here. Here are the top ten tokens for each topic, 1 topic per row.
arenas carter villain guiding hoooo ipods amir crazzy confessions snort #awesome de la a y que el en no me mi es the to a my is and in for of you on na ka sa ko ng mo ba ang ni pa wa di yg ga ada aja ya ini ke mau gw dan #fb alpha atlantic 2dae orgy und tales ich fusion koolaid creme ik de je een en met is in op niet het maggie paula opposition gems oiii kemal industrial cancun ireng unplug controllers 9700 t0 bdae concentration 0ut day' armpit kb 2007 0f s0 yu ma ii lmaoo lml youu juss mee uu yeaa ohhIn addition to being a decent language detector, the model has ascertained what Twitter users consider awesome (snorting, ipod toting villians in arenas) and what people choose to selectively tweet simultaneously to Facebook (orgies, creme, and koolaid).
Scaling up, a 100 topic model run on 35 million tweets took 3 hours and 15 minutes to complete on my laptop. Ramage et. al. report training a circa 800 topic Labelled LDA model on 8 million tweets in 96 machine-days (24 machines for 4 days). It's not quite apples-to-apples, but I figure the online LDA implementation in vowpal is somewhere between 2 and 3 orders of magnitude faster.
Congratulations on the new gig! I hope the blog continues -- I'm really enjoying reading it. Can you say anything about the new job? In particular, are you still working on decision making type problems?
ReplyDeleteThe new job is at a startup which owns several popular twitter mobile clients and, among other things, wants to use machine learning to make the Twitter experience way better.
ReplyDeleteSo regarding decision type problems, absolutely. There are several recognized "problems" about the Twitter user experience that machine learning can help with, e.g., defining streams based upon content and not identity of tweeter; filtering and prioritizing twitter streams for more efficient consumption; and discovering new Twitter accounts to follow. And of course there are monetization issues which will leverage machine learning for efficiency.
Do you remember what VW options you used?
ReplyDeleteThis was very early in the implementation, so I would advise consulting the latest documentation. Also the Vowpal Wabbit yahoo group (http://tech.groups.yahoo.com/group/vowpal_wabbit/) is extremely friendly and Matt Hoffman hangs out there, so you can get the best possible information by asking questions there.
Delete