This is part 1 of a series of articles on micro-benchmarks for matrix computations. This first article focuses on a math.js benchmark, and part 2 will discuss a TensorFlow benchmark. Make sure to subscribe if you don’t want to miss it!
In this article, you will learn how performing parallel computations can speed up the multiplication of two matrices.
I recently had occasion to revisit some of the math I learned in high school. Finally, I can see the use of all those matrix multiplication exercises! My background is in IT engineering, but I have to admit that AI involves much more math than IT does.
Matrix multiplication is a recurring operation performed in many domains, such as signal processing, data analysis and, more recently, AI.
In these use cases, the matrices implemented are rather large, frequently containing more than a thousand lines. Let’s assume we are multiplying two matrices, each one with dimensions 1000 × 1000. The number of operations that would need to be performed would be:
2 × 1,000³ − 1,000² = 1,999,000,000
That’s right — nearly 2 billion operations! It’s no surprise the CPU is so busy when performing such computations. With so much on its plate, it can’t do anything else! So let’s see what we can do to free up the main CPU thread and event loop and speed up the process.
To parallelize matrix multiplication efficiently, be it with Starnode technology or using any other parallelization technique, one must start by identifying independent blocks of operations that can take place concurrently, with minimal overhead time for the execution of splits and recombinations and minimum data transfer.
We tried two different approaches, splitting down matrices band-wise in the first approach, and splitting tile-wise in the second. Band-wise splitting worked well for small matrices, but when we tried with larger matrices (a 400 hundred lines or more), we found that tile-wise splitting was the best way to go.
Below, one can see how these two input-matrix splitting schemes are implemented for the product R = A × B:
- In the case of a band-wise split, A is split into blocks of consecutive rows. Each block Ai is then multiplied by the full matrix B, yielding the result Ri, which constitutes a block of consecutive rows in the product matrix R.
Figure 1a: band-wise split
- In a tile-wise split, A is split into blocks of consecutive rows and B into blocks of consecutive columns. Each block Ai is then multiplied by the block Bi, yielding Ri, which constitutes a “tile” in the product matrix R.
Figure 1b: tile-wise split
Matrix shapes have little impact for a given number of elements, as long as the form factor of the matrix is not excessively rectangular. With small matrices, band-wise splits entail slightly less parallelization overhead than tile-wise splits thanks to the faster B-matrix readings and very straightforward process for merging blocks in the product matrix. This advantage vanishes rapidly, however, as the size of the B matrix increases due to the cache hierarchy conflicts that result from all processes using full B array data.
As our approach effectively uses all the resources of your computer, you can expect the fans to run faster, the temperature to increase and your matrices to be computed in a snap!
We have run all our tests on a dedicated server with a CPU Intel i7–7700 4 core/8 threads 4.2GHz and 32GB RAM.
The following chart shows the time required to multiply math.js matrices of various sizes in node.js without Starnode and with Starnode, as well as the speedup factor when using Starnode in each case. As you can see, the larger the matrix is, the larger the speedup!
This chart shows only the results of using the tile-wise parallelization method, as this method provided the best performance with node.js for matrices larger than 400 × 400.
As you can see, node.js with Starnode completed matrix multiplication up to six times faster than regular node.js!
You can find below the detailed results for the two split methods. In this table:
- m is the number of lines in the A matrix
- p is the number of lines in the B matrix (as well as the number of columns in A)
- n is the number of columns in the B matrix
We are very excited with these results, as we initially only expected to achieve a speedup factor of 2 or 3 at this scale of parallelization. Surprisingly, when implementing Starnode parallelization, very little overhead is required to make two processes “talk to each other,” resulting in much-improved computation speeds. For example, for the multiplication of a 2000 × 1200 matrix, we achieved a speedup factor of 6.1! ⚡
The team is also currently working on a TensorFlow benchmark with the same operating mode, which I will link to here soon. Make sure to subscribe to learn new math skills to impress your colleagues! 🤓
Thank you for reading! If you liked this article (or if you didn’t), feel free to leave a comment. We'll do our best to reply and update this article accordingly.