Sunday, January 2, 2011

Happiness is a Warm Tweet

Often obtaining training labels is a limiting factor for getting things done. Towards this end, Ramage et. al. make an interesting observation regarding Twitter: tweets are rich with annotation corresponding to Twitter conventions. Emoticons, hash-tags, and at-directions provide information about the emotional, semantic, and social content of the tweet.

So for kicks I decided to make a Twitter happiness detector. With a little help from Wikipedia I learned what the common emoticons are for indicating happiness and sadness. For any particular tweet, I called it happy if it contained at least one happy emoticon and no sad emoticons; I called it sad if it contained at least one sad emoticon and no happy emoticons; and otherwise I called it ambiguous. Most tweets are ambiguous, and the goal is to generalize to them; but for training I'll only use the unambiguous tweets.

There are several features in the new vowpal wabbit that come together and make this problem pleasant to attack: support for sparse high-cardinality features; model complexity control via the hashing trick; the --adaptive flag which is like an automated improved tf-idf weighting; native support for n-gram expansion; and of course, the ability to churn through dozens of millions of tweets on my laptop before lunch is over. Training on a week of tweets and testing on a future day of tweets leads to an AUC of 0.87, i.e., given a random happy and sad tweet the resulting regressor is 87% likely to score the happy tweet as more happy than the sad tweet. (Note: one has to remove emoticons when parsing, otherwise, the AUC is 1. The point here is to generalize to ambiguous tweets.) At this point I'm not leveraging the LDA support in vowpal, I'm just encoding each token nominally, so presumably this can be improved.

I took another random sample of 10000 tweets from even further in the future, which turn out to be mostly ambiguous tweets since that is what most tweets are. I then ranked them, and here are the 10 happiest and 10 saddest tweets. Just to reiterate, emoticons are stripped from the tweets during parsing: that several of the happiest and saddest tweets have emoticons in them indicates the ease of predicting emoticon presence from the other tokens in the tweet.

10 Happiest Tweets

@WRiTExMiND no doubt! <--guess who I got tht from? Bwahaha anyway doe I like surprising people it's kinda my thing so ur welcome! And hi :)
@skvillain yeh wiz is dope, got his own lil wave poppin! I'm fuccin wid big sean too he signed to kanye label g.o.o.d music
And @pumahbeatz opened for @MarshaAmbrosius & blazed! So proud of him! Go bro! & Marsha was absolutely amazing! Awesome night all around. =)
Awesome! RT @robscoms: Great 24 hours with nephews. Watched Tron, homemade mac & cheese for dinner, Wii, pancakes & Despicable Me this am!
Good Morning 2 U Too RT @mzmonique718: Morningggg twitt birds!...up and getting ready for church...have a good day and LETS GO GIANTS!
Goodmorning #cleveland, have a blessed day stay focused and be productive and thank god for life
AMEN!!!>>>RT @DrSanlare: Daddy looks soooo good!!! God is amazing! To GOD be the glory and victory #TeamJesus Glad I serve an awesome God
AGREED!! RT @ILoveElizCruz: Amen to dat... We're some awesome people! RT @itsVonnell_Mars: @ILoveElizCruz gotta love my sign lol
#word thanks! :) RT @Steph0e: @IBtunes HAppy Birthday love!!! =) still a fan of ya movement... yay you get another year to be dope!!! YES!!
Happy bday isaannRT @isan_coy: Selamatt ulang tahun yaaa RT @Phitz_bow: Selamat siangg RT @isan_coy: Slamat pagiiii

10 Saddest Tweets

Migraine, sore throat, cough & stomach pains. Why me God?
Ik moet werken omg !! Ik lig nog in bed en ben zo moe .. Moet alleen opstaan en tis koud buitn :(
I Feel Horrible ' My Voice Is Gone Nd I'm Coughing Every 5 Minutes ' I Hate Feeling Like This :-/
SMFH !!! Stomach Hurting ; Aggy ; Upset ; Tired ;; Madd Mixxy Shyt Yo !
Worrying about my dad got me feeling sick I hate this!! I wish I could solve all these problems but I am only 1 person & can do so much..
Malam2 menggigil+ga bs napas+sakit kepala....badan remuk redam *I miss my husband's hug....#nangismanja#
Waking up with a sore throat = no bueno. Hoping someone didn't get me ill and it's just from sleeping. D:
Aaaa ini tenggorokan gak enak, idung gatel bgt bawaannya pengen bersin terus. Calon2 mau sakit nih -___-
I'm scared of being alone, I can't see to breathe when I am lost in this dream, I need you to hold me?
Why the hell is suzie so afraid of evelyn! Smfh no bitch is gonna hav me scared I dnt see it being possible its not!

Observations


First, emoticons are universal, which allows this trick to work for many different languages (I think? I can't read some of those).

Second, Twitter data is very clean. Silly ideas like this never worked with web data, because there was always this huge hurdle of separating the content from other parts of the payload (navigational, structural, etc.). In addition web pages are big where tweets are short, so tweets can have a clear emotional tone where web pages might have multiple.

Third, vowpal should be in the list of tools that people try when faced with a text classification problem. This took less than half a day to put together, and most of that was data wrangling.

2 comments:

  1. FWIW, Google Translate identified the ones I couldn't read as Indonesian, Dutch, and 2 more Indonesian. I wonder if it's a coincidence that Indonesia used to be (more or less) the Dutch East Indies? And are there that many Indonesian twitterers relative to English-speakers, or are they that much more detectably emotional?

    ReplyDelete
  2. The company I'm working for owns several Twitter mobile apps. My Twitter data comes from there, not from Twitter, and I've noticed there is some peculiarities in the data. In particular they have a client which is very popular in Indonesia, so I have a lot of geo tags from that area of the globe.

    ReplyDelete