LDA on Graphs
The strategy is to treat the edge sets at each vertex of the social graph as a document and then apply LDA to the resulting document corpus, similar to Zhang et. al. Since I'm considering Twitter's social graph, the latent factors might represent interests or communities, but I don't actually care as long as the resulting features improve my supervised classifiers.When LDA was first applied in Computer Vision, it was first applied essentially without modification with some success. Then the generative model was adapted to the problem domain to improve performance (e.g., in the case of Computer Vision, by incorporating spatial structure). Things are done in this order for a very practical reason: when you apply the standard generative model, you get to leverage someone else's optimized and correct implementation! For the same reasons I'm sticking with the original LDA here, but there are some aspects I've noticed are not a perfect fit.
- On directed social graphs (such as Twitter) there are two kinds of edges which is analogous to two different kinds of tokens being present in the document. LDA only has one token type. Possibly this can be worked around by prefixing every edge with a '+' or '-' indicating direction. In practice I sidestep this problem by only modeling the outgoing edges (i.e., the set of people that someone follows).
- An edge can only exist once in an edge set, whereas with vanilla LDA a token can occur multiple times in a text document. Taking into account this negative correlation between edge emission probabilities might improve results.
Broad Social Topics
Even though I don't actually care about understanding the latent factors, it makes for entertaining blog fodder. So now for the fun. I ran a 10 topic LDA model over the edge sets from a random sample of twitter users, in order to get a broad overview of the graph structure. Here are the top 10 mostly likely twitter accounts for each topic:1 | Ugglytruth globovision LuisChataing juanes tusabiasque AlejandroSanz Calle13Oficial shakira Erikadlv ChiguireBipolar ricky_martin BlackberryVzla miabuelasabia CiudadBizarra ElUniversal chavezcandanga luisfonsi ElChisteDelDia noticias24 |
2 | detikcom SoalCINTA sherinamunaf Metro_TV soalBOWBOW radityadika kompasdotcom TMCPoldaMetro IrfanBachdim10 ayatquran agnezmo pepatah AdrieSubono desta80s cinema21 fitrop vidialdiano ihatequotes sarseh |
3 | RevRunWisdom NICKIMINAJ drakkardnoir TreySongz kanyewest chrisbrown iamdiddy myfabolouslife KevinHart4real LilTunechi KimKardashian MissKeriBaby 50cent RealWizKhalifa lilduval MsLaurenLondon BarackObama Ludacris Tyrese |
4 | justinbieber radityadika Poconggg IrfanBachdim10 snaptu AdrieSubono MentionKe TheSalahGaul vidialdiano FaktanyaAdalah TweetRAMALAN soalBOWBOW unfollowr disneywords DamnItsTrue SoalCINTA sherinamunaf widikidiw PROMOTEfor |
5 | NICKIMINAJ KevinHart4real TreySongz RevRunWisdom RealWizKhalifa chrisbrown drakkardnoir Wale kanyewest lilduval Sexstrology myfabolouslife LilTunechi ZodiacFacts 106andpark BarackObama Tyga FreakyFact KimKardashian |
6 | ConanOBrien cnnbrk shitmydadsays BarackObama THE_REAL_SHAQ TheOnion jimmyfallon nytimes StephenAtHome BreakingNews mashable google BillGates rainnwilson twitter espn ochocinco TIME SarahKSilverman |
7 | ladygaga KimKardashian katyperry taylorswift13 britneyspears PerezHilton KhloeKardashian aplusk TheEllenShow KourtneyKardash rihanna jtimberlake justinbieber RyanSeacrest ParisHilton nicolerichie LaurenConrad selenagomez Pink |
8 | iambdsami Z33kCare4women DONJAZZYMOHITS MriLL87WiLL chineyIee NICKIMINAJ MrStealYaBitch FreddyAmazin ProducerHitmann MI_Abaga DoucheMyCooch WomenLoveBrickz Uncharted_ WhyYouMadDoe MrsRoxxanne I_M_Ronnie GuessImLucky BlitheDemeanor Tahtayy |
9 | Woodytalk vajiramedhi chocoopal PM_Abhisit js100radio kalamare Trevornoah GarethCliff suthichai Domepakornlam ploy_chermarn crishorwang paulataylor Noom_Kanchai jjetrin Khunnie0624 ThaksinLive DJFreshSA Radioblogger |
10 | myfabolouslife IAMBIGO NICKIMINAJ GuessImLucky DroManoti GFBIVO90 Sexstrology FASTLANE_STUDDA PrettyboiSunny Ms_MAYbeLLine ZodiacFacts FlyLikeSpace RobbRF50PKF CLOUD9ACE Jimmy_Smacks LadieoloGistPKF TreySongz Prince_Japan GerardThaPrince |
And yes, this data was collected prior to Charlie Sheen's meteoric rise.
hey you might be interested in my research http://www.akshaybhat.com/LPMR/
ReplyDeleteI have used complete Twitter social network from 2009 (36 million users), and implemented a community detection algorithm on Hadoop.