## DEV Community

ndesmic

Posted on • Updated on

# Fast Matrix Math in JS

There's a lot of ways to represent matrix math in JS. Some are readable and some are fast. I wanted to explore the differences a bit. How much are certain techniques actually saving me?

For this I'll just be looking at a single operation: element-wise addition to reduce the total cases, but difference operations could slightly change the overall values especially ones like matrix multiplication which require slightly more complex application rules. The states are also on my computer which is a slightly older i7 8700K using Deno which is v8 under-the-hood. A different runtime like Bun might behave very differently if there are different optimizations.

## The simple functional way

I thought I'd start here as this is how I might write it for a first draft. It's very code optimized but I'd suspect the performance is actually pretty bad. Immutability is great for avoiding bugs but terrible for perf, especially when JS doesn't have intelligent copies.

``````//mat.js
return a.map((row, ri) => row.map((val, ci) => b[ri][ci] + val));
}
``````

The matrix representation is an array of arrays. The outer array is rows, the inner arrays are columns.

Using Deno's built-in benchmarking tool, we can see how this performs on different size matrices.

``````import { addMatrixFunc } from "./mat.js";
import { mat100A, mat100B } from "./data/mat-data.js";

});

});

});

/* ... */

});
``````

The `mat100A` and `mat100B` are pre-generated 100x100 matrices which are way too big to put in the test file.

Some notes here is that I don't think Deno lets you set the iterations or warmup iterations, at least anymore. I think it just looks for convergence of numbers. The actual number of runs is displayed in the JSON output and it's slightly different per test.

Here's how we do:

Name min max avg p75 p99 p995
Add 1x1 (Func) 63ns 180ns 70ns 74ns 113ns 124ns
Add 2x2 (Func) 144ns 208ns 152ns 158ns 184ns 196ns
Add 4x4 (Func) 312ns 373ns 329ns 335ns 370ns 373ns
Add 8x8 (Func) 694ns 930ns 724ns 731ns 930ns 930ns
Add 16x16 (Func) 1798ns 1942ns 1836ns 1843ns 1942ns 1942ns
Add 32x32 (Func) 5274ns 6599ns 5495ns 5605ns 6599ns 6599ns
Add 64x64 (Func) 13000ns 2331200ns 17451ns 16300ns 41900ns 60700ns
Add 100x100 (Func) 30800ns 512800ns 40269ns 38200ns 105700ns 218300ns

## Loops

So the first way I think we can improve is loops. Functions have overhead so if we remove that and be a little more imperative we can be a bit faster.

``````export function addMatrixLoop(a, b) {
const out = [];
for (let row = 0; row < a.length; row++) {
const arrayRow = [];
for (let col = 0; col < a[0].length; col++) {
arrayRow.push(a[row][col] + b[row][col])
}
out.push(arrayRow);
}
return out;
}
``````

Note that I'm not going to do strict bounds checking, we're just assuming `a` and `b` are the same size as the bounds checks just add overhead.

Name min max avg p75 p99 p995
Add 1x1 (Loop) 28ns 210ns 46ns 47ns 142ns 168ns
Add 2x2 (Loop) 55ns 163ns 71ns 76ns 125ns 143ns
Add 4x4 (Loop) 122ns 227ns 143ns 151ns 195ns 225ns
Add 8x8 (Loop) 360ns 807ns 411ns 422ns 744ns 807ns
Add 16x16 (Loop) 1179ns 1246ns 1208ns 1217ns 1246ns 1246ns
Add 32x32 (Loop) 5031ns 5216ns 5090ns 5105ns 5216ns 5216ns
Add 64x64 (Loop) 14300ns 362400ns 20651ns 19200ns 52900ns 110500ns
Add 100x100 (Loop) 38200ns 425400ns 54401ns 54100ns 227700ns 256300ns

Loops start out faster but once we hit around 32x32 the they are equal to `.map` and larger than that `.map` is faster. Very surprising!

## Pre-allocating arrays

My next idea was to pre-allocate the arrays since pushing into an array can cause re-sizes and maybe these are why this is slower.

``````export function addMatrixLoopPreAlloc(a, b) {
const out = new Array(a.length);
for (let row = 0; row < a.length; row++) {
const arrayRow = new Array(a[0].length);
for (let col = 0; col < a[0].length; col++) {
arrayRow[col] = a[row][col] + b[row][col];
}
out[row] = arrayRow;
}
return out;
}
``````
Name min max avg p75 p99 p995
Add 1x1 (Loop Prealloc) 13ns 137ns 18ns 20ns 56ns 73ns
Add 2x2 (Loop Prealloc) 25ns 65ns 28ns 27ns 45ns 53ns
Add 4x4 (Loop Prealloc) 61ns 152ns 73ns 78ns 124ns 129ns
Add 8x8 (Loop Prealloc) 203ns 444ns 228ns 232ns 348ns 434ns
Add 16x16 (Loop Prealloc) 710ns 942ns 762ns 768ns 942ns 942ns
Add 32x32 (Loop Prealloc) 2648ns 2769ns 2700ns 2716ns 2769ns 2769ns
Add 64x64 (Loop Prealloc) 9500ns 372100ns 10926ns 10100ns 25000ns 35800ns
Add 100x100 (Loop Prealloc) 24500ns 515800ns 28392ns 26300ns 62100ns 204400ns

That did the trick! We're about 1.5x faster than where we started!

## Unrolling the loops

What if we just removed all the loops and wrote it out long-hand?

``````export function addMatrix4x4(a, b) {
return [
[a[0][0] + b[0][0], a[0][1] + b[0][1], a[0][2] + b[0][2], a[0][3] + b[0][3]],
[a[1][0] + b[1][0], a[1][1] + b[1][1], a[1][2] + b[1][2], a[1][3] + b[1][3]],
[a[2][0] + b[2][0], a[2][1] + b[2][1], a[2][2] + b[2][2], a[2][3] + b[2][3]],
[a[3][0] + b[3][0], a[3][1] + b[3][1], a[3][2] + b[3][2], a[3][3] + b[3][3]]
];

}

``````

This is not very flexible as you need a function for every shape of matrix you want to add. However, in some cases like 3D this isn't too bad because you have a very finite amount of things, usually only ever 4x4. In machine learning this might cause problems.

Here's a function that generates the javascript text for the unrolled loop:

``````export function genMatAddBody(rows, cols) {
let funcBody = "return [\n";

for (let r = 0; r < rows; r++) {
funcBody += "\t\t["
for (let c = 0; c < cols; c++) {
funcBody += `a[\${r}][\${c}] + b[\${r}][\${c}]\${c < cols - 1 ? ", " : ""}`
}
funcBody += `]\${r < rows - 1 ? ", " : ""}\n`
}

funcBody += `\t];\n`
return funcBody;
}
rows = Number(rows);
cols = Number(cols);
return new Function("a", "b", body);
}
``````

I was also curious if making this dynamic generation changes much:

``````export function genMatAddFunc(rows, cols) {
rows = Number(rows); //prevents code injection
cols = Number(cols);
return new Function("a", "b", body);
}
``````

Since we are using eval we should make sure to sanitize the input.

``````const addMatrix1x1Dyn = genMatAddFunc(1,1);
// etc.
``````
Name min max avg p75 p99 p995
Add 1x1 (unrolled) 7ns 34ns 8ns 8ns 19ns 20ns
Add 1x1 (unrolled dynamic) 7ns 40ns 8ns 7ns 19ns 20ns
Add 2x2 (unrolled) 11ns 46ns 13ns 12ns 26ns 29ns
Add 2x2 (unrolled dynamic) 11ns 39ns 12ns 12ns 27ns 29ns
Add 4x4 (unrolled) 36ns 159ns 59ns 72ns 124ns 130ns
Add 4x4 (unrolled dynamic) 36ns 236ns 67ns 84ns 156ns 181ns
Add 8x8 (unrolled) 92ns 243ns 130ns 142ns 235ns 242ns
Add 8x8 (unrolled dynamic) 89ns 262ns 113ns 119ns 186ns 209ns
Add 16x16 (unrolled) 500ns 672800ns 734ns 600ns 3400ns 10500ns
Add 16x16 (unrolled dynamic) 500ns 2052000ns 799ns 600ns 6400ns 10600ns
Add 32x32 (unrolled) 73800ns 562500ns 83976ns 85200ns 136400ns 160600ns
Add 32x32 (unrolled dynamic) 73000ns 908200ns 90772ns 90900ns 137900ns 162600ns
Add 64x64 (unrolled) 328700ns 737300ns 350104ns 343900ns 574500ns 587000ns
Add 64x64 (unrolled dynamic) 327600ns 698800ns 349201ns 345400ns 573900ns 592400ns
Add 100x100 (unrolled) 829600ns 1250900ns 876580ns 873700ns 1143900ns 1157500ns
Add 100x100 (unrolled dynamic) 816900ns 1416300ns 891844ns 894500ns 1227700ns 1288200ns

It's a big improvement for small values beating the pre-allocated loop by about 1.5 to 2x but not good for large ones being considerably slower. I'm not sure why that's the case, maybe it has to due with the size of the function itself? The code generated is massive. Also dynamic generation is basically the same as writing them out. So if you want to save payload (and aren't limited by CSP) you can dynamically create these at no penalty.

## Flattening the arrays

Another thing where I think we can save is the arrays. We technically don't need to have a lot of nested arrays and they'll add some overhead to create. So now a 2x2 array looks like this:

``````[
4, 7,
10, 5
]
``````

However you now need to know the dimensions for this to work because different rectangular shapes can have the same number of elements. So maybe let's make it an object.

``````{
shape: [2,2],
data: [
4, 7,
10, 5
]
}
``````

The shape is an array rather than properties because we could scale this idea up into N-dimensional tensors. In fact, this is how libraries like tensorflowjs do it. For convenience let's build some function to convert between the formats.

``````export function nestedArrayToFlat(nested){
return {
shape: [nested.length, nested[0].length],
data: nested.flat(Infinity)
}
}

export function flatToNestedArray(flat){
const data = new Array(flat.shape[0]);
for(let row = 0; row < flat.shape[0]; row++){
const rowArray = new Array(flat.shape[1]);
for(let col = 0; col < flat.shape[1]; col++){
rowArray[col] = flat.data[row * flat.shape[1] + col];
}
data[row] = rowArray;
}
return data;
}
``````

So far I think pre-allocated arrays and loops have the best general performance scaling to larger values so we'll stick to that for now. This also means I'll be omitting flat and loop since they aren't winning in any category as well as dynamic because it's the same as unrolled.

``````export function addMatrixFlat(a, b) {
const out = {
shape: a.shape,
data: new Array(a.data.length)
};
for (let row = 0; row < a.shape[0]; row++) {
for (let col = 0; col < a.shape[1]; col++) {
const index = (row * a.shape[1]) + col;
out.data[index] = a.data[index] + b.data[index];
}
}
return out;
}
``````
Name min max avg p75 p99 p995
Add 1x1 (flat) 9ns 53ns 10ns 10ns 24ns 29ns
Add 2x2 (flat) 14ns 49ns 15ns 15ns 29ns 30ns
Add 4x4 (flat) 32ns 107ns 40ns 46ns 86ns 94ns
Add 8x8 (flat) 97ns 167ns 110ns 113ns 143ns 157ns
Add 16x16 (flat) 400ns 548ns 436ns 447ns 517ns 548ns
Add 32x32 (flat) 1985ns 2900ns 2222ns 2276ns 2900ns 2900ns
Add 64x64 (flat) 8512ns 10514ns 8775ns 8715ns 10514ns 10514ns
Add 100x100 (flat) 15500ns 701100ns 23261ns 21800ns 54200ns 194800ns

It's about 20% faster on larger matrices than our previous best but 20% slower on 1x1 and 2x2 than unrolled. Since those aren't too important I'd say this is a big win.

### Row vs column major

Does it matter if we traverse over rows versus columns? One might suspect that it could when CPU caching an stuff gets involved, but let's test.

``````export function addMatrixFlatColMajor(a, b) {
const out = {
shape: a.shape,
data: new Array(a.data.length)
};
for (let col = 0; col < a.shape[1]; col++) {
for (let row = 0; row < a.shape[0]; row++) {
const index = (row * a.shape[1]) + col;
out.data[index] = a.data[index] + b.data[index];
}
}
return out;
}
``````
Name min max avg p75 p99 p995
Add 1x1 (flat col major) 9ns 41ns 10ns 9ns 21ns 22ns
Add 2x2 (flat col major) 14ns 41ns 15ns 14ns 29ns 32ns
Add 4x4 (flat col major) 32ns 79ns 37ns 37ns 61ns 67ns
Add 8x8 (flat col major) 101ns 156ns 114ns 116ns 147ns 153ns
Add 16x16 (flat col major) 423ns 532ns 453ns 465ns 513ns 532ns
Add 32x32 (flat col major) 2047ns 3228ns 2199ns 2258ns 3228ns 3228ns
Add 64x64 (flat col major) 7500ns 413800ns 10417ns 10200ns 26200ns 37000ns
Add 100x100 (flat col major) 19800ns 575300ns 25090ns 23500ns 63000ns 198500ns

As it turn out, column major traversal is actually a tiny bit slower than row major traversal. This is likely because the cache lines are being read more optimally.

However, since element-wise add is so simple we can actually just ditch the loop structure and just add all the elements linearly with a single loop.

``````export function addMatrixFlatSimple(a, b) {
const out = {
shape: a.shape,
data: new Array(a.data.length)
};
for(let i = 0; i < a.data.length; i++){
out.data[i] = a.data[i] + b.data[i];
}
return out;
}
``````
Name min max avg p75 p99 p995
Add 1x1 (flat simple) 7ns 46ns 8ns 8ns 18ns 20ns
Add 2x2 (flat simple) 9ns 54ns 10ns 10ns 23ns 26ns
Add 4x4 (flat simple) 18ns 77ns 24ns 28ns 51ns 56ns
Add 8x8 (flat simple) 55ns 159ns 73ns 78ns 125ns 136ns
Add 16x16 (flat simple) 276ns 405ns 315ns 335ns 393ns 405ns
Add 32x32 (flat simple) 1387ns 1682ns 1490ns 1547ns 1682ns 1682ns
Add 64x64 (flat simple) 6381ns 7219ns 6602ns 6675ns 7219ns 7219ns
Add 100x100 (flat simple) 9000ns 598000ns 17166ns 15700ns 49400ns 178400ns

This is like 20%+ faster.

### Unrolled

We can also unroll these too and see what happens, maybe the simpler structure helps? Using this code:

``````export function genMatAddFlatBody(rows, cols){
let funcBody = "return [\n";

for (let r = 0; r < rows; r++) {
for (let c = 0; c < cols; c++) {
funcBody += `a[\${r * cols + c}] + b[\${r * cols + c}]\${(c * r) < ((rows - 1) * (cols - 1)) ? ", " : ""}`
}
}

funcBody += `];\n`
return funcBody;
}
``````

We can generate functions like this:

``````export function addMatrixFlat2x2(a, b) {
return [
a[0] + b[0], a[1] + b[1], a[2] + b[2], a[3] + b[3]];

}
``````

We can dynamically create them with eval like this:

``````export function genMatAddFlatFunc(rows, cols) {
rows = Number(rows);
cols = Number(cols);
return new Function("a", "b", body);
}
``````
Name min max avg p75 p99 p995
Add 1x1 (flat unrolled) 6ns 53ns 7ns 7ns 19ns 22ns
Add 2x2 (flat unrolled) 7ns 62ns 8ns 8ns 21ns 23ns
Add 4x4 (flat unrolled) 24ns 136ns 37ns 41ns 84ns 93ns
Add 8x8 (flat unrolled) 61ns 185ns 81ns 86ns 131ns 144ns
Add 16x16 (flat unrolled) 300ns 564700ns 508ns 400ns 1000ns 6100ns
Add 32x32 (flat unrolled) 63600ns 826700ns 74574ns 75200ns 133000ns 162600ns
Add 64x64 (flat unrolled) 263500ns 788800ns 286503ns 280600ns 502900ns 528900ns
Add 100x100 (flat unrolled) 706400ns 1760300ns 764369ns 758900ns 1102800ns 1118900ns

It just edges the simple loop out at 1x1 and 2x2 and beyond that it loses and get much, much worse at large sizes.

## Typed arrays

So the next area of possible optimization I can see is to actually use types. We can do this in Javascript using typed-arrays. This will allow us to allocate a block of memory and reduce the overhead of any array structure. This is actually a little more important though. By using typed-arrays we can actually reduce conversions. APIs like WASM, WebGL and WebGPU deal with blocks of memory and the less we have to convert the faster we're going to wind up. So I think even if it turns out this is a little bit slower, there's still good reasons to pursue it. Although we end up with different paths, one for floats and one for integers and even then it might matter if we choose different bit-widths. Also, since we've already shown that flat structures perform better overall we don't need to consider nested arrays. For brevity I'm not going to test all Typed array combos as we'll start to see a general pattern.

### Float 64

Name min max avg p75 p99 p995
Add 1x1 (F64) 330ns 1600ns 400ns 397ns 663ns 1600ns
Add 2x2 (F64) 329ns 598ns 393ns 409ns 493ns 598ns
Add 4x4 (F64) 393ns 1786ns 490ns 503ns 662ns 1786ns
Add 8x8 (F64) 490ns 778ns 621ns 664ns 778ns 778ns
Add 16x16 (F64) 1024ns 5425ns 1311ns 1334ns 5425ns 5425ns
Add 32x32 (F64) 3346ns 4707ns 3772ns 4115ns 4707ns 4707ns
Add 64x64 (F64) 8000ns 2309700ns 14203ns 12700ns 35300ns 44800ns
Add 100x100 (F64) 23200ns 3328400ns 35026ns 33300ns 82400ns 231000ns

Javascript numbers are float 64s. So it's really surprising that these behave slower than a normal javascript array. Doing small arrays is actually slower than `array.map`. I'm guessing this has something to do with how the engine treats them. As the matrices get larger these get faster but even at 100x100 items it's still quite bit slower than a normal flat array.

### Float 32

Name min max avg p75 p99 p995
Add 1x1 (F32) 324ns 554ns 380ns 391ns 506ns 554ns
Add 2x2 (F32) 324ns 594ns 391ns 408ns 520ns 594ns
Add 4x4 (F32) 396ns 658ns 463ns 489ns 569ns 658ns
Add 8x8 (F32) 508ns 822ns 620ns 673ns 822ns 822ns
Add 16x16 (F32) 1148ns 1784ns 1345ns 1422ns 1784ns 1784ns
Add 32x32 (F32) 3258ns 3840ns 3344ns 3337ns 3840ns 3840ns
Add 64x64 (F32) 10500ns 1101800ns 18473ns 21600ns 66500ns 101200ns
Add 100x100 (F32) 25800ns 1797500ns 37062ns 35800ns 99800ns 245400ns

F32 arrays have the same problem as Float64s. The performance is near identical despite being smaller so for pure speed there's no point in choosing them. In fact at 100x100 F64 arrays are decently faster. The only benefit we get is halfing our memory which might be a reason to choose these.

### Int 32

Name min max avg p75 p99 p995
Add 1x1 (I32) 321ns 1015ns 390ns 398ns 704ns 1015ns
Add 2x2 (I32) 324ns 570ns 390ns 403ns 501ns 570ns
Add 4x4 (I32) 372ns 530ns 426ns 443ns 488ns 530ns
Add 8x8 (I32) 455ns 621ns 539ns 575ns 616ns 621ns
Add 16x16 (I32) 784ns 1202ns 913ns 966ns 1202ns 1202ns
Add 32x32 (I32) 2111ns 2704ns 2182ns 2182ns 2704ns 2704ns
Add 64x64 (I32) 8742ns 9569ns 9138ns 9305ns 9569ns 9569ns
Add 100x100 (I32) 12600ns 2578300ns 22470ns 21600ns 50300ns 72200ns

I32s again behave similarly but start to see much bigger gains at larger matrices. In fact at 100x100 the I32 matrix is about equal to a flat matrix. Not amazing but if you are dealing with large integer matrices this is probably your best choice.

## Conclusion

For simple, single-threaded javascript we've observed a few things (in Deno/V8 @ 2023-03-31):

• Loops will mostly perform better than `.map` except at very large values and only with nested arrays (I tried a flat array and it wasn't notable enough to copy-paste the data).
• Bespoke unrolled functions work well on very small sizes 4x4 or less but doesn't beat a simple loop and falls off very, very quickly.
• Reducing structure makes a big difference.
• Pre-allocating arrays makes a huge different, always do this if you can.
• Typed-arrays provide no speed advantage (but we might get less conversion overhead and space-space saving).

There's more ways we can deal matrices, I'd like to maybe look at what WASM and WebGPU look like with high overhead but potentially massive speed increases in actual calculation due to parallelism. Web Workers as well. Also different ops can vary widely. Matrix multiplication uses the left-hand and right-hand structures differently and might require some different strategies. But I think the biggest take-away:

Your best bet for a generalized element-wise matrix op is a single flat loop over a normal JS array as it's fast and scales well

## The Data

Name min max avg p75 p99 p995
Add 1x1 (Func) 63ns 180ns 70ns 74ns 113ns 124ns
Add 1x1 (Loop) 28ns 210ns 46ns 47ns 142ns 168ns
Add 1x1 (Loop Prealloc) 13ns 137ns 18ns 20ns 56ns 73ns
Add 1x1 (unrolled) 7ns 34ns 8ns 8ns 19ns 20ns
Add 1x1 (unrolled dynamic) 7ns 40ns 8ns 7ns 19ns 20ns
Add 1x1 (flat) 9ns 53ns 10ns 10ns 24ns 29ns
Add 1x1 (flat col major) 9ns 41ns 10ns 9ns 21ns 22ns
Add 1x1 (flat simple) 7ns 46ns 8ns 8ns 18ns 20ns
Add 1x1 (flat unrolled) 6ns 53ns 7ns 7ns 19ns 22ns
Add 1x1 (F64) 330ns 1600ns 400ns 397ns 663ns 1600ns
Add 1x1 (F32) 324ns 554ns 380ns 391ns 506ns 554ns
Add 1x1 (I32) 321ns 1015ns 390ns 398ns 704ns 1015ns
Add 2x2 (Func) 144ns 208ns 152ns 158ns 184ns 196ns
Add 2x2 (Loop) 55ns 163ns 71ns 76ns 125ns 143ns
Add 2x2 (Loop Prealloc) 25ns 65ns 28ns 27ns 45ns 53ns
Add 2x2 (unrolled) 11ns 46ns 13ns 12ns 26ns 29ns
Add 2x2 (unrolled dynamic) 11ns 39ns 12ns 12ns 27ns 29ns
Add 2x2 (flat) 14ns 49ns 15ns 15ns 29ns 30ns
Add 2x2 (flat col major) 14ns 41ns 15ns 14ns 29ns 32ns
Add 2x2 (flat simple) 9ns 54ns 10ns 10ns 23ns 26ns
Add 2x2 (flat unrolled) 7ns 62ns 8ns 8ns 21ns 23ns
Add 2x2 (F64) 329ns 598ns 393ns 409ns 493ns 598ns
Add 2x2 (F32) 324ns 594ns 391ns 408ns 520ns 594ns
Add 2x2 (I32) 324ns 570ns 390ns 403ns 501ns 570ns
Add 4x4 (Func) 312ns 373ns 329ns 335ns 370ns 373ns
Add 4x4 (Loop) 122ns 227ns 143ns 151ns 195ns 225ns
Add 4x4 (Loop Prealloc) 61ns 152ns 73ns 78ns 124ns 129ns
Add 4x4 (unrolled) 36ns 159ns 59ns 72ns 124ns 130ns
Add 4x4 (unrolled dynamic) 36ns 236ns 67ns 84ns 156ns 181ns
Add 4x4 (flat) 32ns 107ns 40ns 46ns 86ns 94ns
Add 4x4 (flat col major) 32ns 79ns 37ns 37ns 61ns 67ns
Add 4x4 (flat simple) 18ns 77ns 24ns 28ns 51ns 56ns
Add 4x4 (flat unrolled) 24ns 136ns 37ns 41ns 84ns 93ns
Add 4x4 (F64) 393ns 1786ns 490ns 503ns 662ns 1786ns
Add 4x4 (F32) 396ns 658ns 463ns 489ns 569ns 658ns
Add 4x4 (I32) 372ns 530ns 426ns 443ns 488ns 530ns
Add 8x8 (Func) 694ns 930ns 724ns 731ns 930ns 930ns
Add 8x8 (Loop) 360ns 807ns 411ns 422ns 744ns 807ns
Add 8x8 (Loop Prealloc) 203ns 444ns 228ns 232ns 348ns 434ns
Add 8x8 (unrolled) 92ns 243ns 130ns 142ns 235ns 242ns
Add 8x8 (unrolled dynamic) 89ns 262ns 113ns 119ns 186ns 209ns
Add 8x8 (flat) 97ns 167ns 110ns 113ns 143ns 157ns
Add 8x8 (flat col major) 101ns 156ns 114ns 116ns 147ns 153ns
Add 8x8 (flat simple) 55ns 159ns 73ns 78ns 125ns 136ns
Add 8x8 (flat unrolled) 61ns 185ns 81ns 86ns 131ns 144ns
Add 8x8 (F64) 490ns 778ns 621ns 664ns 778ns 778ns
Add 8x8 (F32) 508ns 822ns 620ns 673ns 822ns 822ns
Add 8x8 (I32) 455ns 621ns 539ns 575ns 616ns 621ns
Add 16x16 (Func) 1798ns 1942ns 1836ns 1843ns 1942ns 1942ns
Add 16x16 (Loop) 1179ns 1246ns 1208ns 1217ns 1246ns 1246ns
Add 16x16 (Loop Prealloc) 710ns 942ns 762ns 768ns 942ns 942ns
Add 16x16 (unrolled) 500ns 672800ns 734ns 600ns 3400ns 10500ns
Add 16x16 (unrolled dynamic) 500ns 2052000ns 799ns 600ns 6400ns 10600ns
Add 16x16 (flat) 400ns 548ns 436ns 447ns 517ns 548ns
Add 16x16 (flat col major) 423ns 532ns 453ns 465ns 513ns 532ns
Add 16x16 (flat simple) 276ns 405ns 315ns 335ns 393ns 405ns
Add 16x16 (flat unrolled) 300ns 564700ns 508ns 400ns 1000ns 6100ns
Add 16x16 (F64) 1024ns 5425ns 1311ns 1334ns 5425ns 5425ns
Add 16x16 (F32) 1148ns 1784ns 1345ns 1422ns 1784ns 1784ns
Add 16x16 (I32) 784ns 1202ns 913ns 966ns 1202ns 1202ns
Add 32x32 (Func) 5274ns 6599ns 5495ns 5605ns 6599ns 6599ns
Add 32x32 (Loop) 5031ns 5216ns 5090ns 5105ns 5216ns 5216ns
Add 32x32 (Loop Prealloc) 2648ns 2769ns 2700ns 2716ns 2769ns 2769ns
Add 32x32 (unrolled) 73800ns 562500ns 83976ns 85200ns 136400ns 160600ns
Add 32x32 (unrolled dynamic) 73000ns 908200ns 90772ns 90900ns 137900ns 162600ns
Add 32x32 (flat) 1985ns 2900ns 2222ns 2276ns 2900ns 2900ns
Add 32x32 (flat col major) 2047ns 3228ns 2199ns 2258ns 3228ns 3228ns
Add 32x32 (flat simple) 1387ns 1682ns 1490ns 1547ns 1682ns 1682ns
Add 32x32 (flat unrolled) 63600ns 826700ns 74574ns 75200ns 133000ns 162600ns
Add 32x32 (F64) 3346ns 4707ns 3772ns 4115ns 4707ns 4707ns
Add 32x32 (F32) 3258ns 3840ns 3344ns 3337ns 3840ns 3840ns
Add 32x32 (I32) 2111ns 2704ns 2182ns 2182ns 2704ns 2704ns
Add 64x64 (Func) 13000ns 2331200ns 17451ns 16300ns 41900ns 60700ns
Add 64x64 (Loop) 14300ns 362400ns 20651ns 19200ns 52900ns 110500ns
Add 64x64 (Loop Prealloc) 9500ns 372100ns 10926ns 10100ns 25000ns 35800ns
Add 64x64 (unrolled) 328700ns 737300ns 350104ns 343900ns 574500ns 587000ns
Add 64x64 (unrolled dynamic) 327600ns 698800ns 349201ns 345400ns 573900ns 592400ns
Add 64x64 (flat) 8512ns 10514ns 8775ns 8715ns 10514ns 10514ns
Add 64x64 (flat col major) 7500ns 413800ns 10417ns 10200ns 26200ns 37000ns
Add 64x64 (flat simple) 6381ns 7219ns 6602ns 6675ns 7219ns 7219ns
Add 64x64 (flat unrolled) 263500ns 788800ns 286503ns 280600ns 502900ns 528900ns
Add 64x64 (F64) 8000ns 2309700ns 14203ns 12700ns 35300ns 44800ns
Add 64x64 (F32) 10500ns 1101800ns 18473ns 21600ns 66500ns 101200ns
Add 64x64 (I32) 8742ns 9569ns 9138ns 9305ns 9569ns 9569ns
Add 100x100 (Func) 30800ns 512800ns 40269ns 38200ns 105700ns 218300ns
Add 100x100 (Loop) 38200ns 425400ns 54401ns 54100ns 227700ns 256300ns
Add 100x100 (Loop Prealloc) 24500ns 515800ns 28392ns 26300ns 62100ns 204400ns
Add 100x100 (unrolled) 829600ns 1250900ns 876580ns 873700ns 1143900ns 1157500ns
Add 100x100 (unrolled dynamic) 816900ns 1416300ns 891844ns 894500ns 1227700ns 1288200ns
Add 100x100 (flat) 15500ns 701100ns 23261ns 21800ns 54200ns 194800ns
Add 100x100 (flat col major) 19800ns 575300ns 25090ns 23500ns 63000ns 198500ns
Add 100x100 (flat simple) 9000ns 598000ns 17166ns 15700ns 49400ns 178400ns
Add 100x100 (flat unrolled) 706400ns 1760300ns 764369ns 758900ns 1102800ns 1118900ns
Add 100x100 (F64) 23200ns 3328400ns 35026ns 33300ns 82400ns 231000ns
Add 100x100 (F32) 25800ns 1797500ns 37062ns 35800ns 99800ns 245400ns
Add 100x100 (I32) 12600ns 2578300ns 22470ns 21600ns 50300ns 72200ns

## Code

https://github.com/ndesmic/fast-mat