Cameron Archer

Posted on Dec 7, 2022 • Originally published at tinybird.co

Measuring World Cup sentiment with Twitter and Tinybird

#watercooler

This is a blog post about something I love (soccer), something I find interesting (analytics), and one thing I really don’t care for (Twitter) but nonetheless find inextricably bound to my line of work (Content Marketing at Tinybird.

I’m a huge soccer fan. I’m also American, which is why I call it soccer (sorry, not sorry).

Ted calls it soccer.

If you like soccer as much as me, you’re no doubt aware that we’re right in the thick of the World Cup. I write this on the eve of the first knockout round matches, my country’s team having survived the group stage to face the Oranje of Holland tomorrow morning. (Edit: 😥)

I love watching soccer. I find that no other sport quite sets you on edge with emotional tension. You agonize over every pass, tackle, and setpiece. Goals are hard to come by. When the ball does finally find the back of the net, your heart either soars with delight or plummets in despair. The buildup and sudden release of emotion is quite a rush. At least I think so.

And that’s why I’m here, writing this. Because having experienced the emotion of watching many international soccer matches, I wondered what that emotion might actually look like if we analyzed it. What if we could actually measure and visualize the rollercoaster emotions that so many fans across the world will experience during this World Cup?

This blog post documents my journey to measure the aggregate emotions of soccer fans around the world as their teams participated in World Cup matches.

This blog post documents my journey to measure the aggregate emotions of soccer fans around the world as their teams participated in World Cup matches. If you like soccer, data, analytics, Python, or Twitter, you might find this interesting.

This is what I’ve created so far. Keep reading to learn what it means, and how I got there.

The sentiment towards Japan (red) and Spain (yellow) during and after Japan's surprising (and controversial) win over Spain in group stage.

The Starting Point

Of course, I didn’t just decide to measure the emotions of the World Cup on a whim. I had some inspiration.

Earlier this year, my colleague alrocar took on a project to measure the real-time sentiment of his Twitter timeline, and he documented his work here. If you take a look at the Tinybird Twitter banner you’ll see the fruits of his labor.

‍

The sentiment of Tinybird's Twitter timeline.

I thought maybe I could do something similar, but instead of measuring the sentiment of my Twitter timeline, I wanted to measure the emotions of the World Cup.

Thus, I decided to use Twitter as my data source. After all, Mr. Musk himself promised that Twitter would be the best source of real-time World Cup coverage, and its global reach and live nature promised to make it useful for streaming analytics. It seemed like a good place to start.

Drawing from alrocar’s work, I determined to use the Twitter API to capture tweets about the World Cup, and attempt to measure the emotion surrounding each team during the matches in which they participated.

But how do you measure emotion on Twitter?

Yeah, this is a tough one. I thought about taking the same approach as alrocar, using the TextBlob Python library which offers a sentiment analysis function, spitting out a polarity based on its analysis. I tested it out a bit, but the results weren’t great.

What made this problem especially challenging is the very nature of the World Cup. As a global competition, tweets about it are published in many different languages. Using a simple natural language processing library was prone to some error (at best) or even significant bias towards a language such as English. So I decided to look for another way.

The obvious second choice was to use emojis. Theoretically, I could try to measure the sentiment of the various emojis used in tweets associated with a particular team. More 😀 means more positive emotions, more 🤬 mean more negative.

But to be honest, this just felt really hard. I’m a moonlighting Python/SQL hack, not a developer. Plus I’ve got a wife and kid. I didn’t have time for that!

I wondered: “What’s the simplest way to measure emotion towards one’s country?”

And then I had it. What better way to show support for your country than by “waving your flag”? I could measure sentiment towards a World Cup team by tracking the number of times that team’s country flag emoji was used in tweets related to the World Cup.

What better way to track sentiment towards a World Cup team than by counting the number of times it's flag was "waved" on Twitter?

And so that’s what I did. Here’s how I did it.

The basic architecture

Three basic components comprised this project: Some Python code to create a streaming connection to the Twitter API and handle the data it created, a Tinybird data project to analyze the data created and publish APIs from the analysis, and a simple Retool dashboard to serve as a frontend for my APIs.

‍

The basic architecture of my World Cup sentiment project.

Here’s the project repo, if you’d like to follow along or try it yourself.

Using the Twitter Filtered Stream API

The Twitter API is one of the richest sources of real-time data on the internet. There’s so much you can do with it. For this project, I decided to use the Filtered Stream in the v2 Twitter API.

Filtered Stream lets you create a streaming connection and define “stream rules” to filter which tweets are included in the stream.

I chose Python as a language to handle the streaming, because a) it’s the language I’m strongest in, and b) because of tweepy.

Tweepy

Tweepy is a pretty full-featured Python library to interact with the Twitter API. I was able to use Tweepy to create a streaming client, define my filter rules, and then handle sending data to other functions when that data is received from the stream.

Creating a streaming client
Tweepy has a StreamingClient class to interact with the Twitter v2 Filtered Stream API. It includes such functions as add_rules(), filter(), and on_data() that I used to capture World Cup tweets and send them on for processing. I defined my own StreamingClient class to add some functions and override the on_data() function to process the data created in the stream (more later).

Adding stream rules

I used the add_rules() function to define rules for my filtered stream. Originally, I wanted to create a rule for every team, but there were 32 teams originally participating in the World Cup, and at my Twitter Developer project level (Elevated) I was limited to 25 rules.

I ended up setting up a very generic rule as follows:

def set_filter(self):
        print('set filter rules')
        rule = 'WorldCup OR "World Cup" OR Qatar2022 OR FIFA'
        self.add_rules(tweepy.StreamRule(value = rule))

This added to my stream any tweets that included (exact match) any one of those 4 phrases. Ultimately, I should have refined this a bit more, because I ended up burning through all of my 2M tweet cap during the group stage of the World Cup, but it also allowed me to do a fun “bonus” task which I describe at the end of this post.

Sending tweets to Tinybird

There’s no better tool for building fast analytics APIs on streaming data than Tinybird. Am I biased? Hell yeah I am, and I’ll die on this hill. Tinybird, and the ClickHouse SQL functions it provides, made the analytics and API publishing part of this project so easy. I’ll get to that in a bit.

But before writing any SQL, I had to get data into Tinybird.

This also proved relatively trivial with the Tinybird Events API. It’s a simple HTTP endpoint that can process and store up to 1000 req/s and 20 Mb/s of streaming data into a Tinybird Data Source. This project didn’t come close to those limits.

The Tinybird Events API made it very easy to send tweet data to Tinybird for analysis.

I borrowed some of the buffering concepts from alrocar’s project (which used the Data Sources API), but the basic flow here is to just take some data from the tweet stream, do some parsing and pre-processing, then format it as NDJSON:

tweet = json_data['data']
   if 'created_at' not in tweet:
      timestamp  = datetime.now()
   else: timestamp = datetime.strptime(tweet['created_at'], '%Y-%m-%dT%H:%M:%S.%fZ')

text = tweet['text']
   tt = {
      'timestamp': timestamp.strftime("%Y-%m-%d %H:%M:%S"),
      'tweet': text,
   }
   data.append(tt)

Then it’s just a few more lines of code to send that data to Tinybird thanks to the requests library.

params = {
   'name': self.datasource,
   'token': self.token,
   'host': self.host
}
response = requests.post(self.url, params=params, data=data)

During the matches*, I’d run this script on my laptop and easily stream hundreds of tweets a second to Tinybird.

*The World Cup was already several matches in before I got everything working, and I also missed a few other matches due to a Google Fiber outage :(

Analysis and publication in Tinybird

To measure emotion during the matches, I counted the total number of flag emojis used per minute for each team during the match. Thanks to the ClickHouse function countSubstring() supported in Tinybird, this was very easy.

I created a dual-node SQL Pipe that first counted the number of flags for each team in each tweet, then aggregated the total number of flags for each team over the match time period.

I used SQL in Tinybird to calculate the total number of flag emojis used for each team per minute where only one team was mentioned in a Tweet.

In this aggregation, I used the ClickHouse sumIf() function to only include tweets where just one of the two flags was mentioned. I figured that tweets including both flags were probably more intended to summarize or comment on match progress, rather than express support for a particular team.

This is how that SQL looks in Tinybird

--FIRST NODE
%
SELECT
  tweet,
  timestamp,
  countSubstrings(tweet, {{String(team_1_flag, default='🇺🇸', description='The flag for the first team in the match', required=True)}}) AS team_1_matches,
  countSubstrings(tweet, {{String(team_2_flag, default='🇮🇷', description='The flag for the second team in the match', required=True)}}) AS team_2_matches
FROM tweets_match_2
WHERE timestamp >= toDateTime({{DateTime(match_start, default="2022-11-30 15:00:00", description="The match start time in GMT", required=True)}}) - INTERVAL 15 minute
AND timestamp < toDateTime({{DateTime(match_start, default="2022-11-30 15:00:00", description="The match start time in GMT", required=True)}}) + INTERVAL 180 minute

--SECOND NODE
%
SELECT
    toStartOfMinute(timestamp) AS minute,
    sumIf(team_1_matches, team_2_matches==0) AS total_team_1_matches,
    sumIf(team_2_matches, team_1_matches==0) AS total_team_2_matches
FROM matching_flags
GROUP BY minute
ORDER BY minute DESC

Note in Tinybird you can split SQL queries into discreet nodes within a Pipe to avoid window functions and CTEs.

You’ll notice the use of the Tinybird templating language to create parameters in the queries. These served as query parameters for the API I published from this SQL.

Publishing APIs from SQL queries

I’ll keep this brief. In Tinybird, this is nearly automatic:

‍

Publishing an API from an SQL query in Tinybird only takes a few seconds.

Doesn’t get easier than that. So with a little SQL and Tinybird, I had an API that would give me the total number of flag emojis used for each team participating in a match from 15 minutes before the match’s start time to 3 hours after it.

Visualizing with Retool

I love Retool. Sure, I can hack together some HTML/CSS/JavaScript and build something myself, but for a project like this Retool is perfect. I can just add the Tinybird APIs as resource queries, and then use pre-built React components with a drag-and-drop editor. Easy peasy.

Here’s the Retool dashboard I created. I ended up creating some additional Tinybird APIs to build the live tweet feed, get team-oriented colors for the lines on the line chart, and automatically populate the chart with data from the most current match. The image below shows how the dashboard looked during the match between Tunisia and France on November 30th, a surprising result in which Tunisia defeated a much more talented French side 1-0. The first red spike is when Tunisia scored its goal, and the final spike is when the match ended.

‍

Tunisia upset France during the World Cup. You can see spikes of Tunisian flag usage where Tunisia scored a goal, and after the match ended and they had won.

My hypothesis was correct! Twitter took to “waving the flag” of the victorious countries, releasing its collective fan emotion in a burst of flag emojis whenever a goal was scored or a match ended.

Bonus: GOOOOOOOOOOOOOOAL!

If you are into soccer and live in an English- or Spanish-speaking country, you know that enthusiastic announcers tend to yell “GOAL!” or “GOL!” when a goal is scored. The quality of the goal is often emphasized by the length of the yell.

“GOOOOAL” = pretty good goal

“GOOOOOOOOOOOOOOOOOOOOOOOOOOAL” = amazing goal

“GOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOAL” = golazo

I decided to create an additional Tinybird API to return the tweet containing the longest such proclamation and the number of characters included in the “GOAL.” I expected that somebody might use all 280 characters available to them, and I wasn’t disappointed.

SELECT
  tweet,
  timestamp,
  extract(upper(tweet), 'GO+A*L+') AS goal,
  length(extract(upper(tweet), 'GO+A*L+')) AS goal_length
FROM tweets_match_2
ORDER BY goal_length DESC
LIMIT 1

That’s a “G”, an “L”, and 278 “O”s. I even made an API to power a histogram chart showing the frequency of various “GOOOOAL” lengths on Twitter.

‍

The frequency of various character-length "GOOO...LS" on Twitter.

This was a fun project, bringing together passions, work, and necessary evils (love ya, Twitter 😘).

If you like what you’ve seen here, share it around, or if you have ideas for the next iterations of this project, shoot me a note on Twitter or fork the repo! I’d love to hear from you.

DEV Community