Arctype Team for Arctype

Posted on Mar 9, 2022 • Originally published at arctype.com

Partitioning a billion-row table of soccer data using data context

#programming #mysql #tutorial #productivity

In this article, you will learn how to use the semantics behind your data when you partition your database. This can drastically improve your application's performance. And, most importantly, you will discover that you should tailor your partitioning criteria to your unique application domain.

I have collaborated with a startup to develop a web app for sports experts to make decisions and explore data. The application supports any sport, but we are based in Europe - and Europeans love soccer. Each of the hundreds of games played every day worldwide comes with thousands of rows. In just a few months, the Events table in our app reached half a billion rows!

By understanding how soccer experts were querying our data, we could partition the database intelligently. The average time improvement on this new table was between 20x and 40x faster. The average time improvement on all queries was 5X to 10X.

Let's now delve into this scenario and learn why you cannot ignore your data context when partitioning a database.

Presenting the context

Our sports application offers both raw and aggregated data, although the professionals who adopted it prefer the latter. The underlying database contains terabytes of complex, unstructured, heterogeneous data from several providers. So, the biggest challenge was designing a reliable, fast, and easy-to-explore database.

Application domain

In this industry, many providers offer their clients access to the events of the most important soccer games. Specifically, they provide you with data related to what happened during a game, such as goals, assists, yellow cards, passes, and much more. The table containing this data is by far the largest one we had to work with.

VPS specs, technologies, and architecture

My team has been developing the backend application that provides the most crucial data exploration features. We adopted Kotlin v1.6 running on top of a JVM (Java Virtual Machine) as the programming language, Spring Boot 2.5.3 as the framework, and Hibernate 5.4.32.Final as the ORM (Object Relational Mapping). The main reason why we opted for this technology stack is that speed is one of the most crucial business requirements. So, we needed a technology that could leverage heavy multi-thread processing, and Spring Boot turned out to be a reliable solution.

We deployed our backend on a 16GB 8CPU VPS through a Docker container managed by Dokku. It can use 15GB of RAM at most. This is because one GB of RAM is dedicated to a Redis-based caching system. We added it to improve the performance and avoid overloading the backend with repeated operations.

Database and table structure

As for the database, we decided to opt for MySQL 8. An 8GB and 2 CPU VPS currently hosts the database server, which supports up to 200 concurrent connections. The backend application and the database are in the same server farm to avoid communication overhead. We designed the database structure to avoid duplication and with performance in mind. We decided to adopt a relational database because we wanted to have a consistent structure to convert the data received from the providers. This way, we standardize the sports data, making it easier to explore and present it to the end-users.

The database contains hundreds of tables at the time of writing, and I cannot present them all because of the NDA I signed. Luckily, one table is enough to thoroughly analyze why we ended up adopting the data context-based partition you are about to see. The real challenge came when we started to perform heavy queries on the Events table. But before diving into that, let's see what the Events table looks like:

As you can see, it does not involve many columns, but keep in mind that I had to omit some of them for confidentiality reasons. But what really matters here are the parameterId and gameId columns. We use these two foreign keys to select a type of parameter (e.g., goal, yellow card, pass, penalty) and the games in which it happened.

Performance issues

The Events table reached half a billion rows in just a few months. As we have already covered in depth in this blog post, the main problem is that we need to perform aggregate operations using slow IN queries. This is because what happens during a game is not so important. Instead, sports experts want to analyze aggregated data to find trends and make decisions based on them.

Also, although they generally analyze the entire season or the last 5 or 10 games, users often want to exclude some particular games from their analysis. This is because they do not want a game played particularly poorly or well to polarize their results. We cannot pre-generate the aggregate data because we would have to do this on all possible combinations, which is not feasible. So, we have to store all the data and aggregate it on the fly.

Understanding the Performance Problem

Now, let's dive into the central aspect that led to the performance issues we had to face.

Million-row tables are slow

If you've ever dealt with tables containing hundreds of millions of rows, you know that they are inherently slow. You cannot even think of running JOINs on such large tables. Yet, you can perform SELECT queries in a reasonable amount of time. This is particularly true when these queries involve simple WHERE conditions. On the other hand, they become terribly slow when using aggregate functions or IN clauses. In these cases, they can easily take up to 80 seconds, which is simply too much.

Indexes are not enough

To improve the performance, we decided to define some indexes. This was our first approach to finding a solution to the performance issues. But, unfortunately, this led to another problem. Indexes take time and space. This is generally insignificant, but not when dealing with such large tables. It turned out that defining complex indexes based on the most common queries took several hours and GBs of space. Also, indexes are helpful but are not magic.

Data context-based database partitioning as a solution

Since we could not solve the performance problem with custom-defined indexes, we decided to try a new approach. We talked with other experts, looked online for solutions, read articles based on similar scenarios, and finally decided that partitioning the database was the right approach to follow.

Why traditional partitioning may not be the right approach

Before partitioning all our largest tables, we studied the topic both in the MySQL official documentation and in interesting articles. Although we all agreed that this was the way to go, we also realized that applying partitioning without taking our particular application domain into account would be a mistake. Specifically, we understood how crucial it was to find the proper criteria when partitioning a database. Some experts in partitioning taught us that the traditional approach is to partition on the number of rows. But we wanted to find something more intelligent and more efficient than that.

Delving into the application domain to find the partitioning criteria

We learned an essential lesson by analyzing the application domain and interviewing our users. Sports experts tend to analyze aggregated data from games in the same competition. For example, a competition in soccer can be a league, a tournament, or a single match where you can win a trophy. There are thousands of different competitions. The most important ones in Europe are the Champions League, Premier League, LaLiga, Serie A, Bundesliga, Eredivisie, Liga 1, and Primeira Liga.

This means that our users take into account data coming from different competitions very rarely. Also, they prefer to explore data season by season. In other words, they seldom leave the context represented by a sports competition played in a particular season. Our database structure expressed this concept with a table called SeasonCompetition, whose goal is to associate a competition with a specific season. So, we realized that a good approach would be to partition our larger tables into sub-tables related to a particular SeasonCompetition instance.

Specifically, we defined the following name format for these new tables: <tableName>_<seasonCompetitionId>.

Consequently, if we had 100 rows in the SeasonCompetition table, we would have to split the large Events table into the smaller Events_1, Events_2, …, Events_100 tables. Based on our analysis, this approach would lead to a considerable performance boost in the average case, although introducing some overhead in the rarest cases.

Matching the criteria with the most common queries

Before coding and launching the scripts to execute this complex and potentially returnless operation, we validated our studies by looking at the most common queries performed by our backend application. But doing so, we found out that the vast majority of queries involved only games played within a SeasonCompetition. This convinced us that we were right. So we partitioned all the large tables in the database with the approach just defined.

SELECT AVG('value') as 'value', SUM('minutes') as 'minutes'
FROM 'Events'
WHERE 'parameterId' = 15 AND 'gameId' IN(223,241,245,212,201,299,187,304,187,205)
GROUP BY 'teamId'

Now, let's study the pros and cons of this decision.

Pros

Running queries on a table containing at most half a million rows is much more performant than doing it on a table of half a billion rows, especially when it comes to aggregate queries.
Smaller tables are easier to manage and update. Adding a column or index is not even comparable to before in terms of time and space. Plus, each SeasonCompetition is different and requires different analyses. Consequently, it may require special columns and indexes, and the aforementioned partitioning allows us to deal with this easily.
The provider might amend some data. This forces us to perform delete and update queries, which are infinitely faster on such small tables. Plus, they always concern only some games of a particular SeasonCompetition, so we only need to operate only on a single table now.

Cons

Before making a query on these sub-tables, we need to know the seasonCompetitionId associated with the games of interest. This is because the seasonCompetitionId value is used in the table name. Therefore, our backend needs to retrieve this info before running the query by looking at the games in analysis, representing a small overhead.
When a query involves a set of games that involve many SeasonCompetitions, the backend application must run a query on each sub-table. So, in these cases, we can no longer aggregate the data at the database level, and we must do it at the application level. This introduces some complexity in the backend logic. At the same time, we can execute these queries in parallel. Also, we can aggregate the retrieved data efficiently and in parallel.
Managing a database with thousands of tables is not easy and can be challenging to explore in a client. Similarly, adding a new column or updating an existing column in each table is cumbersome and requires a custom script.

Effects of data context-based partitioning on performance

Let's now look at the time improvement achieved when executing a query in the new partitioned database.

Time improvement in the average case (query involving only one SeasonCompetition): from 20x to 40x
Time improvement in the general case (query involving one or more SeasonCompetitions): from 5x to 10x

Final thoughts

Partitioning your database is undoubtedly an excellent way to improve performance, especially on large databases. However, doing it without considering your particular application domain might be a mistake or lead to an inefficient solution. Instead, taking your time to study the domain by interviewing experts and your users and by looking at the most executed queries is crucial to conceive highly efficient partitioning criteria. This article showed you how to do this and demonstrated the results of such an approach through a real-world case study.

DEV Community