Over christmas I tried Julia Evans' twitter meme challenge in python.
This is about using a suffix array data structure to find phrases that get repeated a lot on twitter:
The idea here is that memes are substrings like “me as a kpop fan” that many different people are using. The tricky thing is that you don’t really know how long those substrings will be, and maybe you’re interested in phrases of different lengths.
You can probably do this challenge without using anything fancy (with a hashmap of phrases or something) but I think it’s a nice opportunity to play with a fun data structure: suffix arrays!
If you want to try the challenge, she has a SQLite database of tweets you can download from dropbox.
Here's some of the memes I found:
- Actual footage of me after chugging a venti iced coffee prepping for this presentation
- Actual footage of me before a phone interview.
- Actual footage of me holding my life together during December.
This is a pattern that got shared a lot:
born in: 🇷🇺 i live in: 🇷🇺 first country i traveled to: 🇧🇻 last country i traveled to: 🇹🇳 country i miss the most: 🇸🇯 countries i hope to see: 🇰🇷🇯🇵🇨🇳🇮🇸🇷🇴🇵🇱🇬🇧🇺🇸🇮🇹🇪🇸🇩🇪 countries already been to: 🇳🇴🇸🇪🇫🇮🇹🇭🇹🇳🇹🇷🇪🇬 country likely visit next: who knows..
It seems like twitter did something that prompted a bunch of people to post lots of variations on this message:
DISCLAIMER! Due to Twitter’s new policy, I’m posting a disclaimer and would like to mention that I am not any of the celebrities I post about. This account is not affiliated with any of them. This is purely a fan account for entertainment purposes only.
In her blog post, Julia describes sifting through the suffix array manually to find interesting looking memes. I did pretty much the same thing, but with a few changes to make it a bit easier.
My python code for iterating through the suffix array and printing it out was pretty slow, so I decided to just throw away 90% of the data. I figured that I didn't need a ton of examples to recognise the memes.
I also noticed that a lot of the suffix array was actually irrelevant to the challenge, because huge chunks of it contained strings that didn't have many characters in common. To make my job easier, I wrote some code to filter out strings that don't have enough overlap with the preceding one:
last_sample = None for suffix in suffix_array.recover_suffixes(): tweet_suffix = first_line(suffix.strip()) sample = tweet_suffix[:20] if sample != last_sample: # Too different, ignore last_sample = sample continue
I was initially confused at why there were so many complete words in my dataset. Since all suffixes of a string get represented in the suffix array, I was expecting to see suffixes beginning with partial words (such as "ello world") sorted in the middle of full words.
It turned out what was happening was I was looking at suffixes beginning with a space character (e.g. " a year ago"), but I was stripping off the space when I printed them out (doh). Although confusing, this was useful for the challenge because I wasn't interested in substrings that don't start on word boundaries.
I used pydivsufsort to actually build the suffix array. Although it isn't packaged up for easy installation with pip, it was quite straightforward to compile libdivsufsort, the c library it's wrapping.
I added a couple of helper methods to help me debug the SuffixArray data structure:
def __str__(self): return str(list(self.sa)) def recover_suffixes(self): """ Return the suffixes in sorted order """ return (self.text[i:] for i in self.sa)
However, when I tried using it, I got weird results:
import divsufsort a = divsufsort.SuffixArray('bananas') >>> print(a) [6, 5, 1, 2, 3, 4, 0] >>> print(list(a.recover_suffixes())) ['s', 'as', 'ananas', 'nanas', 'anas', 'nas', 'bananas'] # this isn't sorted at all!
It turned out that this library accepted regular unicode strings but only works properly with bytestrings:
>>> a = divsufsort.SuffixArray(b'bananas') >>> print(a) [1, 3, 5, 0, 2, 4, 6] >>> print(list(a.recover_suffixes())) [b'ananas', b'anas', b'as', b'bananas', b'nanas', b'nas', b's']
This took me ages to figure out, because I wasn't sure if I had actually misunderstood something about the data structure. But thankfully, pydivsufsort came with tests, so inspecting those helped me to figure out what I was doing wrong.
Once I got past this problem I was pretty happy with pydivsufsort. However, if you are interested in coding a suffix tree construction algorithm from scratch instead, you could try this guide on the SA-IS algorithm for programmers.
Using bytestrings throughout meant I couldn't rely on the
print() function to print out non-latin characters, like emojis. 😭
In python, if you print a byte string, you get a hexidecimal representation of it printed out instead of the actual string, which is very annoying!
>>> print(':D') :D >>> print(b'\xf0\x9f\xa4\xa3') b'\xf0\x9f\xa4\xa3'
My first instinct was to call
decode('utf8') on all my output, and then print that. But given the amount of text I was processing, it seemed unnecessary to decode all my text to unicode strings, only for
sys.stdout to re-encode it again.
This prompted me to look at the python documentation, which notes that:
To write or read binary data from/to the standard streams, use the underlying binary buffer object. For example, to write bytes to stdout, use sys.stdout.buffer.write(b'abc').
However, if you are writing a library (and do not control in which context its code will be executed), be aware that the standard streams may be replaced with file-like objects like io.StringIO which do not support the buffer attribute.
This means that the following works, as long as nobody has overwritten sys.stdout to point to something else:
>>> bytes_written = sys.stdout.buffer.write(b'\xf0\x9f\xa4\xa3\n') 🤣
This is the code I ended up with.
from divsufsort import SuffixArray import sqlite3 import sys conn = sqlite3.connect('twitter.db') c = conn.cursor() c.execute('''select tweet from tweets limit 100000''') tweets =  for row in c.fetchall(): # Lower case all the text to make things simpler tweet = row.lower() # Get rid of the line breaks in the middle of tweets to make it easier to scan the output lines = tweet.splitlines() tweets.append('⏎ '.join(lines).encode('utf8')) conn.close() print('loaded all tweets') all_tweets = b'\n'.join(tweets) print('joined all tweets') suffix_array = SuffixArray(all_tweets) print('made suffix array') def first_line(long_string): try: idx = long_string.index(b'\n') except ValueError: return long_string return long_string[:idx] last_sample = None for suffix in suffix_array.recover_suffixes(): tweet_suffix = first_line(suffix.strip()) sample = tweet_suffix[:20] if sample != last_sample: # Too different, ignore last_sample = sample continue sys.stdout.buffer.write(tweet_suffix) sys.stdout.buffer.write(b'\n') sys.stdout.buffer.flush()
This was a pretty interesting way of exploring a big dataset of text.
Besides the memes, I noticed a lot of spam and marketing, as well as other viral stuff like petitions.
But you can also spot phrases that show up a lot naturally, such as
"dropping out of the race" (this kind of phrasing showed up in about 400 tweets).
A fun way to see what people were excited about is to look at suffixes starting with "also!"
also! check your local library for art books and see if they do art classes for kids/teens also! crime junkie popped my podcast virginity this year❣️ also! hire me to wrap all your presents! because i love that shit and i’m good at it also! i lost almost 30 lbs since my appointment last year!!!
The best thing about this is is after all the "also!"s you get suffixes with increasing numbers of exclamation marks!
also!!!! we have heat again!!!!