DEV Community

Cover image for You can tell a man by his tweets
Eric Bonfadini
Eric Bonfadini

Posted on

You can tell a man by his tweets

Some time ago I read a tweet of Kenneth Reitz, a very well known Python developer I follow on Twitter, asking:

Starting from this, I decided to analyze some tweets from pretty popular Python devs in order to understand a priori how they use Twitter, what they tweet about and what I can gather using data from Twitter APIs only.
Obviously you can apply the same analysis on a different list of Twitter accounts.

Setting up the environment

For my analysis I set up a Python 3.6 virtual environment with the following main libraries:

Some extra libraries will be introduced later on, along with the explanation of the steps I did.

In order to access the Twitter APIs I registered my app and then I provided the tweepy library with the consumer_key, access_token, access_token_secret and consumer_secret.

We're now ready to get some real data from Twitter!

Choosing a list of Twitter accounts

First of all, I chose a list of 8 popular Python devs, starting from the top 360 most-downloaded packages on PyPI and selecting some libraries I know or use daily.

Here's my final list, including links to the Twitter account and the libraries (from the above mentioned list) for which those guys are known for:

Getting data from Twitter

I got all the data with two endpoints only:

  • with a call to lookup users I could get all the information about the accounts (creation date, description, counts, location, etc.)
  • with a call to user timeline I could get the tweets about a single user and all the information related to every single tweet. I configured the call to get also retweets and replies.

I saved the results from the two calls in two Pandas dataframes in order to ease the data processing and then into CSV files to be used as starting point for the next steps without re-calling each time the Twitter API.

Preprocessing tweets

The users dataframe contained all the information I needed; I just created three more columns:

  • a followers/following ratio, a sort of "popularity" indicator
  • a tweets per day ratio, dividing the total number of tweets by the number of days since the creation of the account
  • the coordinates starting from the location, if available, using Geopy. @benoitc doesn't have a location, while @zzzeek has a generic "northeast", geolocated in Nebraska :-)

Here's the final users dataframe:

screen_name name verified description created_at location location_coo time_zone total_tweets favourites_count followers_count following_count listed_count followers/following tweets_per_day
kennethreitz Kenneth Reitz False @DigitalOcean & @ThePSF. Creator of Requests: HTTP for Humans. I design art — with code, cameras, musical instruments, and the English language. 2009-06-24 23:28:06 Winchester, VA [39.1852184, -78.1652404] Eastern Time (US & Canada) 58195 26575 18213 480 822 37.94375 17.950339296730412
mitsuhiko Armin Ronacher True Creator of Flask; Building stuff at @getsentry; prev @fireteamltd, @splashdamage — writing and talking about system architecture, API design and lots of Python 2008-02-01 23:12:59 Austria [47.2000338, 13.199959] Vienna 31774 2059 21801 593 941 36.76391231028668 8.470807784590775
zzzeek mike bayer False 2008-06-11 19:22:19 northeast [41.7370229, -99.5873816] Quito 14771 1106 3013 209 194 14.416267942583731 4.080386740331492
teoliphant Travis Oliphant False Creator of SciPy, NumPy, and Numba; founder and Director of Anaconda, Inc. Founder of NumFOCUS. CEO of Quansight 2009-04-17 20:04:57 Austin, TX [30.2711286, -97.7436995] Central Time (US & Canada) 3875 1506 18052 483 746 37.374741200828154 1.1706948640483383
benoitc benoît chesneau False web craftsman 2007-02-13 16:53:37 Paris 27172 548 1971 704 247 2.799715909090909 6.620857699805068
asksol Ask Solem False sound, noise, stream processing, distributed systems, data, open source python, etc. Works at @robinhoodapp 🌳🌳🌴🌵 2007-11-11 18:56:33 San Francisco, CA [45.4423543, -73.4373087] Pacific Time (US & Canada) 3249 191 2509 513 126 4.89083820662768 0.8476389251239238
wesmckinn Wes McKinney True Data science toolmaker at https://t.co/YVn0VFqgj0. Creator of pandas, @IbisData. @ApacheArrow @ApacheParquet PMC. Wrote Python for Data Analysis. Views my own 2010-02-18 21:01:15 New York, NY [40.7306458, -73.9866136] Eastern Time (US & Canada) 7749 3021 33130 784 1277 42.25765306122449 2.5804195804195804
cournape David Cournapeau False Python and Stats geek. Former NumPy/Scipy core contributor. Lead the ML engineering team @cogentlabs Occasional musings on economics/music/Japanese culture 2010-06-14 03:17:21 Tokyo, Japan [34.6968642, 139.4049033] Amsterdam 15577 505 800 427 112 1.873536299765808 5.395566331832352

The tweets dataframe on the contrary needed some extra preprocessing.

First of all, I discovered an annoying limitation about the Twitter user timeline API: there's a maximum number of tweets that can be returned (more or less 3200 including retweets and replies). Therefore I decided to group the tweets by username and to get the oldest tweet date for each user:

username count min max
asksol 2991 2009-07-19 19:24:33 2018-05-10 14:58:17
benoitc 3199 2017-02-21 14:55:37 2018-05-11 19:36:21
cournape 3179 2017-06-12 19:13:20 2018-05-11 16:55:39
kennethreitz 3200 2017-08-26 20:48:35 2018-05-11 21:07:46
mitsuhiko 3226 2017-06-14 13:23:57 2018-05-11 19:26:07
teoliphant 3201 2013-02-19 03:54:16 2018-05-11 16:48:39
wesmckinn 3205 2014-01-26 17:45:07 2018-05-03 14:51:29
zzzeek 3214 2015-05-05 13:35:38 2018-05-11 14:17:02

Then I filtered out all the tweets before the maximum value between the first dates (2017-08-26 20:48:35).
Starting from these data, @kennethreitz is influencing the cut date because he's tweeting a lot more than some other users; but in this way we can at least get the same timeframe for all the users and compare tweets from the same period.

After this filter I got 25418-14470=10948 tweets, split in this way:

username count min max
asksol 52 2017-09-11 18:36:19 2018-05-10 14:58:17
benoitc 1849 2017-08-26 20:49:13 2018-05-11 19:36:21
cournape 1888 2017-08-27 01:02:11 2018-05-11 16:55:39
kennethreitz 3200 2017-08-26 20:48:35 2018-05-11 21:07:46
mitsuhiko 2328 2017-08-27 08:22:19 2018-05-11 19:26:07
teoliphant 443 2017-08-26 22:57:23 2018-05-11 16:48:39
wesmckinn 591 2017-08-28 18:07:08 2018-05-03 14:51:29
zzzeek 596 2017-08-26 22:14:11 2018-05-11 14:17:02

Other preprocessing steps:

  • I parsed the source information using Beautiful Soup because it contained HTML entities
  • I removed smart quotes from text
  • I converted the text to lower case
  • I removed URLs and numbers

I filtered out all the tweets with empty text after these steps (i.e. containing only urls, etc) and I got 10948-125=10823 tweets.

I finally created a new column containing the "tweet type" (standard, reply or retweet) and another column with the tweet length.

Here are some columns from the final tweets dataframe (first 5 rows):

username created_at full_text text_clean lang favorite_count retweet_count source tweet_type tweet_len
kennethreitz 2018-05-11 21:07:46 RT @IAmAru: Trio is @kennethreitz-approved. #PyCon2018 rt trio is approved en 0.0 1 Tweetbot for iΟS retweet 54
kennethreitz 2018-05-10 21:55:35 If you want to say hi, swing by the DigitalOcean booth during the opening reception! #PyCon2018 if you want to say hi swing by the digitalocean booth during the opening reception en 20.0 2 Tweetbot for iΟS standard 95
kennethreitz 2018-05-10 20:34:20 @dbinoj 24x 1x dynos right now x x dynos right now en 0.0 0 Tweetbot for iΟS reply 30
kennethreitz 2018-05-10 20:11:37 Swing by the @IndyPy booth for your chance to win a signed copy of The Hitchhiker's Guide to Python! ✨🍰✨ https://t.co/CZhd2If5s0 https://t.co/3kUaqu5TMX swing by the booth for your chance to win a signed copy of the hitchhiker s guide to python en 25.0 3 IFTTT standard 152
kennethreitz 2018-05-10 13:53:31 Let's do this https://t.co/6xLCE4WCqA https://t.co/ERiMmffe8L let s do this en 22.0 1 IFTTT standard 61

Explorative Data Analysis

The users dataframe itself already shows some insights:

  • There are only two accounts with the verified flag: @mitsuhiko and @wesmckinn
  • @wesmckinn, @kennethreitz, @teoliphant and @mitsuhiko are the most popular accounts in the list (according to my "popularity" indicator): popindicator
  • @kennethreitz wrote since the creation of his account at least twice the number of tweets per day compared to the other devs in the list: tweetsperday
  • Most of the accounts in the list live in the US; I used Folium to create a map showing the locations: map

The tweets dataframe needs instead some manipulation before we can gather some good insights.

First of all let's check the tweet "style" of each account. From the following chart we can see for example that @cournape is retweeting a lot, while @mitsuhiko is replying a lot: tweettype

We can also group by username and tweet type, and show a chart with the mean tweet length. @kennethreitz for example writes replies shorter than standard tweets, while @teoliphant writes tweets longer than the other guys (exceeding the 140 chars limit): tweetlen

Ok, now let's filter out the retweets and let's focus on the machine-detected language used in standard tweets and replies. The five most common languages are: English, German, French, undefined and a rather weird "tagalog" (ISO 639-1 code "tl", maybe an error in auto-detection?). Most of the tweets are in English; @mitsuhiko tweets a lot in German, while @benoitrc in French: lang

So, let's just select tweets in English or undefined: all the next charts are just considering tweets and replies in English (but obviously you can tune differently your analysis).
Let's group by username and get statistics about the number of favorites/retweets per user:

username favorite_count count favorite_count max favorite_count mean favorite_count std retweet_count max retweet_count mean retweet_count std
asksol 46 41.0 1.608695652173913 6.111840097055933 3.0 0.10869565217391304 0.48204475908203187
benoitc 1009 30.0 0.6531219028741329 1.8313280878865186 17.0 0.13676907829534193 0.7644934696088941
cournape 214 60.0 1.2757009345794392 4.449367547428712 25.0 0.205607476635514 1.7481758044670788
kennethreitz 2637 3932.0 10.062571103526736 82.09998594317476 2573.0 2.620781190747061 50.79602602503255
mitsuhiko 1547 752.0 9.657401422107304 41.06463543974671 220.0 1.8526179702650292 9.932970595417615
teoliphant 186 808.0 26.080645161290324 69.54002504187612 134.0 7.806451612903226 17.085639972995896
wesmckinn 433 2081.0 45.750577367205544 142.2699008271913 695.0 12.270207852193995 48.083342617014644
zzzeek 439 85.0 2.173120728929385 6.417876507767421 28.0 0.44874715261959 1.896040581838119

From this table we can see that:

  • @kennethreitz has the most retweeted and favorited tweet in the dataframe. Here's the tweet:
  • @wesmckinn has the second most retweeted and favorited tweet in the dataframe. Here's the tweet:
  • @wesmckinn has highest mean value for retweet count and favorite count

Since @wesmckinn has also the highest followers count, how these stats change if we normalize the dataframe using the followers count?
Obviously one tweet can get favorited/retweeted even from non-followers, but this normalization will probably produce more fair results because the higher the followers count, the most the tweet will probably be viewed.

username favorite_count perc count favorite_count perc max favorite_count perc mean favorite_count perc std retweet_count perc max retweet_count perc mean retweet_count perc std
asksol 46 1.634117178158629 0.06411700486942658 0.243596655920922 0.11956954962136308 0.0043322300587450395 0.019212624913592345
benoitc 1009 1.5220700152207 0.0331365754882869 0.09291365235345102 0.8625063419583967 0.0069390704360904 0.038787086230791176
cournape 214 7.5 0.1594626168224299 0.556170943428589 3.125 0.02570093457943925 0.21852197555838485
kennethreitz 2637 21.588974908032725 0.055249388368344046 0.4507768404061633 14.127271728984791 0.014389618353632417 0.27889982992935036
mitsuhiko 1547 3.4493830558231275 0.04429797450624903 0.18836124691411713 1.009128021650383 0.008497857760034094 0.04556199530029643
teoliphant 186 4.475958342565921 0.14447510060541938 0.38522061290647097 0.7423000221582097 0.043244247800261586 0.09464679798911974
wesmckinn 433 6.281316027769393 0.138094106149126 0.42942922072801465 2.0977965590099608 0.037036546490172004 0.1451353535074393
zzzeek 439 2.8211085297046132 0.07212481675836008 0.21300619010180605 0.9293063391968137 0.01489369905806803 0.06292866185987782

After the normalization we can see that @cournape and @teoliphant are getting higher mean values, in terms of retweets and favorites.

We can also see how the monthly number of tweets changes over time, per user. From the following chart we can see for example that @kennethreitz tweeted a lot in september 2017 (more than 800 tweets): monthly

Or we can even see which tools are used the most to tweet, per user: sources
I grouped a lot of less used tools under "Other" (Tweetbot for iΟS, Twitter for iPad, OS X, Instagram, Foursquare, Facebook, LinkedIn, Squarespace, Medium, Buffer).

Finally, we can build a kind of punchcard chart for each user, showing an aggregation of tweets dates by day of the week and hours of the day: punch

Topics

But what are the devs in the list talking about?

Let's start with a simple visualization, a word cloud.
After some basic preprocessing of the text from standard tweets only (tokenization, pos tagging, stopwords removal, bigrams, etc), I grouped the tweets by username and got the most common words for each one:

username tweet count most_common
@asksol 6 [('python', 3), ('enjoy', 1), ('seeing', 1), ('process', 1), ('handle', 1)]
@benoitc 488 [('like', 40), ('erlang', 33), ('use', 31), ('code', 30), ('people', 30)]
@cournape 43 [('japan', 8), ('japanese', 6), ('#pyconjp', 6), ('shibuya', 4), ('python', 4)]
@kennethreitz 1109 [('pipenv', 157), ('python', 84), ('new', 77), ('requests', 64), ('released', 53)]
@mitsuhiko 399 [('rust', 53), ('like', 36), ('people', 27), ('new', 25), ('way', 20)]
@teoliphant 113 [('#pydata', 39), ('#python', 36), ('@anacondainc', 18), ('great', 18), ('new', 15)]
@wesmckinn 129 [('@apachearrow', 32), ('data', 21), ('pandas', 16), ('python', 12), ('new', 10)]
@zzzeek 170 [('like', 14), ('years', 11), ('python', 10), ('time', 10), ('use', 9)]

Then I created a word cloud for each username using word_cloud. All the guys are talking about Python or their libraries (like pipenv, pandas, sqlalchemy, etc); we can also spot some other programming languages like erlang and rust.
cloud

The next step is to identify real topics, using an LDAmodel from Gensim. I still used standard tweets from the two accounts with the higher number of tweets (@kennethreitz and @mitsuhiko) and I performed the same preprocessing used for wordclouds generation.
I run the model using two dynamic values:

  • the number of topics (ranging between 2 and 14)
  • the alpha value (with possible values 0.2, 0.3, 0.4). Then I chose the best solution using the Gensim built-in Coherence Model, using c_v as a metric: the optimal model is the one with 9 topics and alpha=0.2 coherence

Here are the topics:

topic number top words
0 (0, '0.125*"way" + 0.094*"favorite" + 0.076*"feature" + 0.067*"oh" + 0.063*"think"')
1 (1, '0.140*"pipenv" + 0.124*"released" + 0.098*"pipenv_released" + 0.082*"want" + 0.073*"code"')
2 (2, '0.271*"python" + 0.132*"today" + 0.093*"people" + 0.039*"month" + 0.036*"kenneth"')
3 (3, '0.183*"requests" + 0.134*"love" + 0.081*"work" + 0.071*"html" + 0.057*"github"')
4 (4, '0.164*"like" + 0.100*"rust" + 0.098*"time" + 0.058*"day" + 0.047*"things"')
5 (5, '0.297*"pipenv" + 0.062*"support" + 0.058*"includes" + 0.045*"right" + 0.044*"making"')
6 (6, '0.271*"new" + 0.076*"getting" + 0.075*"better" + 0.058*"use" + 0.049*"photos"')
7 (7, '0.161*"good" + 0.097*"going" + 0.092*"got" + 0.067*"happy" + 0.058*"current"')
8 (8, '0.114*"great" + 0.091*"ipad" + 0.076*"finally" + 0.066*"heroku" + 0.057*"working"')

We can check the intertopic distance map and the most relevant terms for each topic using pyLDAvis: you can explore the interactive data in the jupyter notebook in my github account. ldavis

Conclusions and future steps

In this post I showed how to get data from Twitter APIs and how to perform some simple analysis in order to know in advance some features about an account (e.g. tweet style, statistics about tweets, topics).
Your mileage may vary depending on the initial account list and the configuration of the algorithms (especially in topics detection).

I uploaded a jupyter notebook on my github, with some snippets I used in order to create this blog post.

Next steps:

  • Improve preprocessing using lemmatization and stemming
  • Try different algorithms for topics detection using Gensim (e.g. AuthorTopicModel or LDAMallet) or scikit-learn
  • Add sentiment analysis

Top comments (3)

Collapse
 
andrealascola profile image
Andrea La Scola

Stalking level: over 9000!! Awesome Job! 🚀

Collapse
 
rpalo profile image
Ryan Palo

This is awesome! Great analysis and write up, thanks!

Collapse
 
erinlmoore profile image
Erin Moore

Does it only work on men?