*This article was originally published on Medium by Dominique Péré, a member of WarpJS.*

This is part 1 of a series of articles on micro-benchmarks for matrix computations. This first article focuses on a math.js benchmark, and part 2 will discuss a TensorFlow benchmark. Make sure to subscribe if you don’t want to miss it!

**In this article, you will learn how performing parallel computations can speed up the multiplication of two matrices.**

I recently had occasion to revisit some of the math I learned in high school. Finally, I can see the use of all those matrix multiplication exercises! My background is in IT engineering, but I have to admit that AI involves much more math than IT does.

I am now working for the company that is developing Starnode, a JavaScript library designed to speed up node.js. The only problem with **JavaScript is that it is only able to carry out computations using a single thread, a single process and the CPU** (it’s like a restaurant with only one chef in the kitchen!). Why is JavaScript designed like this? The purpose is to keep it simple and non-blocking. You can find out a lot more about this aspect of JavaScript in this article.

# Why matrix computing take forever

Matrix multiplication is a recurring operation performed in many domains, such as signal processing, data analysis and, more recently, AI.

In these use cases, the matrices implemented are rather large, frequently containing more than a thousand lines. Let’s assume we are multiplying two matrices, each one with dimensions 1000 × 1000. The number of operations that would need to be performed would be:

2 × 1,000³ − 1,000² = 1,999,000,000

That’s right — nearly 2 billion operations! It’s no surprise the CPU is so busy when performing such computations. With so much on its plate, it can’t do anything else! So let’s see what we can do to free up the main CPU thread and event loop and speed up the process.

# The key to speeding up matrix computation: parallelization

Here is the challenge: to speed up the multiplication of two large matrices with a single-threaded node. Well, we could have used the child_process library to fork another process and assign parts of the job to the forked process (or have done the same with the worker threads), but we wanted to keep our code simple and come up with a solution that will work with a variable number of CPU/threads. By chance, we have some of the most skilled virtual machine PhDs and engineers working with us to help us optimize the parallelization, and we created Starnode, a very simple API that can be used to parallelize any standard JavaScript function. Now with the ability to perform fine-grained parallelization, we worked to determine how much time would be saved with large matrix computations.

My hardware engineer colleague (who happens to be a former math professor!) and I focused on possible ways to parallelize a sequential algorithm, as this would allow us to split operations for large matrices between multiple processing resources using the JavaScript-based ScaleDynamics “warp,” a dynamic compiler technology. (more to come about this is in another story).

# Splitting and computing in parallel

To parallelize matrix multiplication efficiently, be it with Starnode technology or using any other parallelization technique, one must start by identifying independent blocks of operations that can take place concurrently, with minimal overhead time for the execution of splits and recombinations and minimum data transfer.

We tried two different approaches, splitting down matrices band-wise in the first approach, and splitting tile-wise in the second. Band-wise splitting worked well for small matrices, but when we tried with larger matrices (a 400 hundred lines or more), we found that tile-wise splitting was the best way to go.

Below, one can see how these two input-matrix splitting schemes are implemented for the product R = A × B:

- In the case of a band-wise split, A is split into blocks of consecutive rows. Each block Ai is then multiplied by the full matrix B, yielding the result Ri, which constitutes a block of consecutive rows in the product matrix R.

*Figure 1a: band-wise split*

- In a tile-wise split, A is split into blocks of consecutive rows and B into blocks of consecutive columns. Each block Ai is then multiplied by the block Bi, yielding Ri, which constitutes a “tile” in the product matrix R.

*Figure 1b: tile-wise split*

Matrix shapes have little impact for a given number of elements, as long as the form factor of the matrix is not excessively rectangular. With small matrices, band-wise splits entail slightly less parallelization overhead than tile-wise splits thanks to the faster B-matrix readings and very straightforward process for merging blocks in the product matrix. This advantage vanishes rapidly, however, as the size of the B matrix increases due to the cache hierarchy conflicts that result from all processes using full B array data.

# The CPUs are burning!

As our approach effectively uses all the resources of your computer, you can expect the fans to run faster, the temperature to increase and your matrices to be computed in a snap!

We have run all our tests on a dedicated server with a CPU Intel i7–7700 4 core/8 threads 4.2GHz and 32GB RAM.

The following chart shows the time required to multiply math.js matrices of various sizes in node.js without Starnode and with Starnode, as well as the speedup factor when using Starnode in each case. As you can see, the larger the matrix is, the larger the speedup!

This chart shows only the results of using the tile-wise parallelization method, as this method provided the best performance with node.js for matrices larger than 400 × 400.

*As you can see, node.js with Starnode completed matrix multiplication up to six times faster than regular node.js!*

You can find below the detailed results for the two split methods. In this table:

- m is the number of lines in the A matrix
- p is the number of lines in the B matrix (as well as the number of columns in A)
- n is the number of columns in the B matrix

We are very excited with these results, as we initially only expected to achieve a speedup factor of 2 or 3 at this scale of parallelization. Surprisingly, when implementing Starnode parallelization, very little overhead is required to make two processes “talk to each other,” resulting in much-improved computation speeds. For example, for the multiplication of a 2000 × 1200 matrix, we achieved a speedup factor of 6.1! ⚡

The team is also currently working on a TensorFlow benchmark with the same operating mode, which I will link to here soon. Make sure to subscribe to learn new math skills to impress your colleagues! 🤓

Thank you for reading! If you liked this article (or if you didn’t), feel free to leave a comment. We'll do our best to reply and update this article accordingly.

## Discussion

That's a good article. Thanks!

Thank you!💖