Antonello Zanini for Writech

Posted on Jan 19 • Edited on Oct 26

Improving Performance with SQL Aggregate Functions

#sql #data #database #performance

In this article, you will learn how SQL aggregate functions can represent an easy way to significantly improve the performance of your application. Particularly, you will see how they were a game-changer in a real-world scenario based on a data-driven application developed for a startup operating in the sports industry.

Let's now dig deeper into this scenario and learn why SQL aggregate functions are so crucial when it comes to data science.

Introducing the Scenario

The application I recently worked on aims to offer advanced data exploration features in the sports world through the web. In detail, it needs to allow exploration of both raw and aggregated data. Since the database involves terabytes of heterogeneous and unstructured data, the challenges were mostly backend and database side. Now, let's dive into this scenario.

Technologies, server specs, and architecture

We developed the backend in Kotlin with the Spring Boot 2.5.3 framework and the Hibernate 5.4.32.Final ORM (Object Relational Mapping). We deployed it on an 8GB 4 CPU VPS through a Docker container managed by Dokku. The initial heap size was set to 2GB and limited to 7GB, while the remaining GB of RAM was allocated to a Redis-based caching system. We developed the web application with performance in mind. Specifically, it is based on the multi-layered Spring Boot architecture described here and it involves multi-thread processing.

Database structure

The database is a MySql server running on an 8GB 2 CPU VPS. The same server farm also hosts the backend application and the database, but they do not share the same VPS. Since the data coming from sports is simple but highly heterogeneous, the database was structured to avoid duplication and encourage standardization. This is why a relational database was chosen. Right now, the database involves hundreds of tables and it cannot be presented entirely here due to an NDA.

Luckily, the most problematic tables share more or less the same structure. So, analyzing just one table should be enough. In particular, this is what the PositionalData table looks like this:

CREATE TABLE `PositionalData` (
  `id` int(11) NOT NULL AUTO_INCREMENT PRIMARY KEY,
  `area1` double DEFAULT NULL,
  `area2` double DEFAULT NULL,
  `area3` double DEFAULT NULL,
  `area4` double DEFAULT NULL,
  `area5` double DEFAULT NULL,
  .
  .
  .
  `area140` double DEFAULT NULL,
  `area141` double DEFAULT NULL,
  `area142` double DEFAULT NULL,
  `area143` double DEFAULT NULL,
  `area144` double DEFAULT NULL,
  `value` double DEFAULT NULL,
  `parameterId` int(11) NOT NULL,
  `gameId` int(11) NOT NULL,
  `createdAt` datetime DEFAULT CURRENT_TIMESTAMP,
  `createdBy` int(11) DEFAULT NULL,
  `updatedAt` datetime DEFAULT CURRENT_TIMESTAMP,
  `updatedBy` int(11) DEFAULT NULL,
  `deletedAt` datetime DEFAULT NULL,
  `deletedBy` int(11) DEFAULT NULL,
  `active` tinyint(1) DEFAULT '1',
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

As you can see, it involves more than 100 columns, and it has more than 4 external IDs. On average, each of these tables contains at least 15 million rows.

Performance issues

One of the key features of the frontend application is to let users analyze the aggregated values of hundreds of different sports parameters (e.g., passes, throws, blocks) coming from all the selected games of one or more seasons. To retrieve this data, we developed a backend API performing a query on the aforementioned table. Such a query was nothing more than a trivial SELECT returning from 10k to 20k rows. Then, this data is aggregated with a multi-thread process, stored in the Redis cache, and finally serialized in JSON and returned to the frontend application. The first time users call that API—and so before the result is available in the Redis cache—it takes from 2 to 4 seconds. This is unacceptable.

Delving Into the Performance Problem

Let's now see what the downsides of the above approach are.

ORM data transformation process bottleneck

Most advanced ORMs abstract how data is represented at the database level. In other terms, the ORM performs the query, retrieves the desired data from the database, and takes care of transforming it into its application-level representation. This data transformation process is performed behind the scenes, but it undoubtedly represents an overhead. Although it is usually negligible in terms of performance, when it comes to thousands of rows it can easily become a bottleneck. This is especially true when using OO (Object Oriented) languages. In fact, creating a new class instance takes time and resources. To limit the object size and then the heap usage, a solution might be to select only the strictly necessary set of columns. This would make each object lighter, even though the object creation process is the main overhead here. Thus, the time spent performing this transformation process would not change significantly.

Looping takes time

Performing simple operations like sum or average on arrays of objects containing thousands of elements is not performance-free. Although this does not compare to the time spent by the ORM to transform the data, it certainly represents an additional overhead. Fortunately, Java supports many thread-safe collections to perform operations concurrently. On the other hand, opening and managing threads is a complex and time-consuming operation.

Adopting SQL Aggregate Functions

Let's see how a couple of SQL aggregate functions helped me solve the performance issue.

What are SQL aggregate functions?

SQL aggregate functions allow you to perform a calculation involving several rows and obtain one value as a result. Even though each SQL language has its own aggregate function, the most common ones are:

COUNT(): it returns a count of the number of rows selected
MIN(): it extracts the minimum value
MAX(): it extracts the maximum value
SUM(): it performs the sum operation
AVG(): it performs the average operation

They represent a particularly powerful and useful tool when used in association with the GROUP BY statement. Thanks to it, you can first group the desired data and then aggregate it by harnessing them. If you want to delve into MySQL aggregate functions, you can find all the supported ones here. I also recommend checking out this and this.

Replacing application-level operations with queries

While SQL aggregation functions seem promising, I did not know if they could make a difference before seeing them in action. Specifically, the application-level operation was generating a data structure containing the average value on the value column and the sum of each areaX (with X from 1 to 144) column on each parameter chosen over the selected games. You can easily translated into the following query:

SELECT SUM(`area1`) as `area1`,  SUM(`area2`) as `area2`, SUM(`area3`) as `area3`, 
...
SUM(`area142`) as `area142`, SUM(`area143`) as `area143`, SUM(`area144`) as `area144`, 
AVG(`total`) as `total`, `parameterId`
FROM `PositionalData`
WHERE `parameterId` IN (:parameterIds) AND `gameId` IN (:gameIds)
GROUP BY `parameterId`

As you can see, this query takes advantage of the SQL aggregate functions to return aggregate data at the database level. All this while filtering over the desired data using the IN statement on gameId and parameterId and grouping it based on the same parameterId. In other words, data is first filtered based on the selected game of the season and the desired parameters to analyze. Then, the resulting data is grouped by parameter, and finally aggregated by the SQL aggregate functions.

Defining the right indexes

Since that query involves GROUP BY, IN, and SQL aggregate statements, it might be slow. This is why defining the right indexes is so important. In detail, the most important and performance-effective index applied was the following one:

ALTER TABLE `PositionalData` ADD INDEX `PositionalData_parameterId_gameId` (`parameterId`, `gameId`) USING BTREE;

Pros

Database-level aggregation is much faster than executing the same aggregation logic at the application level while looping on large arrays.
Using queries involving SQL aggregate functions with the GROUP BY statement allows you to drastically reduce the number of rows returned. In detail, this allowed me to go from about 10k rows to a number of rows equal to the number of parameters analyzed. Consequently, this makes the data transformation process performed by the ORM irrelevant in terms of time, preventing it from being a bottleneck.
Aggregating at the database level allows you to take advantage of the performance benefits coming from the database cache when executing the same requests. This can make it less crucial to have an application-level cache, consequently leading to a lighter architecture.

Cons

SQL aggregate functions are generally used at SELECT time. When dealing with strongly typed programming languages, the ORM needs to the type the result will have. And not all the ORMs allow you to define easily, sometimes even restricting SQL aggregate functions to native queries only. This means losing the advantage of abstraction introduced by the ORM, and discouraging their use.
The query to extract the desired data involving SQL aggregate functions is always slower than a SELECT involving a simple WHERE clause. At the same time, the execution time should remain in the order of tenths of a second, and in any case much less than to perform the same operation at the application level.
The SQL aggregate operations available are usually a dozen, of which only 5 or 6 are actually mathematical operations.

Performance Comparison

Let's compare the result in terms of response time when calling the same API involving data aggregation with no cache and the same parameters.

Response time when performing aggregation at the application level: ~2/4s
Response time when performing aggregation at the database level: ~800ms

Final Thoughts on SQL Aggregate Functions

SQL aggregate functions are with no doubt a great tool to take the performance to the next level when dealing with data science. Using them is easy and effective, although not all the ORM can fully or natively support them. Either way, knowing how to take advantage of them may become essential to improve performance, and explaining it through a real-world case study was what this article was about.

The post "Improving Performance with SQL Aggregate Functions" appeared first on Writech.

DEV Community