ndesmic

Posted on Apr 3, 2023 • Edited on May 22, 2023

Fast Matrix Math in JS

#javascript #performance #deno #math

There's a lot of ways to represent matrix math in JS. Some are readable and some are fast. I wanted to explore the differences a bit. How much are certain techniques actually saving me?

For this I'll just be looking at a single operation: element-wise addition to reduce the total cases, but difference operations could slightly change the overall values especially ones like matrix multiplication which require slightly more complex application rules. The states are also on my computer which is a slightly older i7 8700K using Deno which is v8 under-the-hood. A different runtime like Bun might behave very differently if there are different optimizations.

The simple functional way

I thought I'd start here as this is how I might write it for a first draft. It's very code optimized but I'd suspect the performance is actually pretty bad. Immutability is great for avoiding bugs but terrible for perf, especially when JS doesn't have intelligent copies.



//mat.js
export function addMatrixFunc(a, b) {
    return a.map((row, ri) => row.map((val, ci) => b[ri][ci] + val));
}

The matrix representation is an array of arrays. The outer array is rows, the inner arrays are columns.

Using Deno's built-in benchmarking tool, we can see how this performs on different size matrices.



import { addMatrixFunc } from "./mat.js";
import { mat100A, mat100B } from "./data/mat-data.js";

Deno.bench("Add 1x1", () => {
    addMatrixFunc([[44]], [[65]]);
});

Deno.bench("Add 2x2", () => {
    addMatrixFunc([[44]], [[65]]);
});

Deno.bench("Add 4x4", () => {
    addMatrixFunc([[44]], [[65]]);
});

/* ... */

Deno.bench("Add 100x100", () => {
    addMatrixFunc(mat100A, mat100B);
});

The mat100A and mat100B are pre-generated 100x100 matrices which are way too big to put in the test file.

Some notes here is that I don't think Deno lets you set the iterations or warmup iterations, at least anymore. I think it just looks for convergence of numbers. The actual number of runs is displayed in the JSON output and it's slightly different per test.

Here's how we do:

Name	min	max	avg	p75	p99	p995
Add 1x1 (Func)	63ns	180ns	70ns	74ns	113ns	124ns
Add 2x2 (Func)	144ns	208ns	152ns	158ns	184ns	196ns
Add 4x4 (Func)	312ns	373ns	329ns	335ns	370ns	373ns
Add 8x8 (Func)	694ns	930ns	724ns	731ns	930ns	930ns
Add 16x16 (Func)	1798ns	1942ns	1836ns	1843ns	1942ns	1942ns
Add 32x32 (Func)	5274ns	6599ns	5495ns	5605ns	6599ns	6599ns
Add 64x64 (Func)	13000ns	2331200ns	17451ns	16300ns	41900ns	60700ns
Add 100x100 (Func)	30800ns	512800ns	40269ns	38200ns	105700ns	218300ns

Loops

So the first way I think we can improve is loops. Functions have overhead so if we remove that and be a little more imperative we can be a bit faster.



export function addMatrixLoop(a, b) {
    const out = [];
    for (let row = 0; row < a.length; row++) {
        const arrayRow = [];
        for (let col = 0; col < a[0].length; col++) {
            arrayRow.push(a[row][col] + b[row][col])
        }
        out.push(arrayRow);
    }
    return out;
}

Note that I'm not going to do strict bounds checking, we're just assuming a and b are the same size as the bounds checks just add overhead.

Name	min	max	avg	p75	p99	p995
Add 1x1 (Loop)	28ns	210ns	46ns	47ns	142ns	168ns
Add 2x2 (Loop)	55ns	163ns	71ns	76ns	125ns	143ns
Add 4x4 (Loop)	122ns	227ns	143ns	151ns	195ns	225ns
Add 8x8 (Loop)	360ns	807ns	411ns	422ns	744ns	807ns
Add 16x16 (Loop)	1179ns	1246ns	1208ns	1217ns	1246ns	1246ns
Add 32x32 (Loop)	5031ns	5216ns	5090ns	5105ns	5216ns	5216ns
Add 64x64 (Loop)	14300ns	362400ns	20651ns	19200ns	52900ns	110500ns
Add 100x100 (Loop)	38200ns	425400ns	54401ns	54100ns	227700ns	256300ns

Loops start out faster but once we hit around 32x32 the they are equal to .map and larger than that .map is faster. Very surprising!

Pre-allocating arrays

My next idea was to pre-allocate the arrays since pushing into an array can cause re-sizes and maybe these are why this is slower.



export function addMatrixLoopPreAlloc(a, b) {
    const out = new Array(a.length);
    for (let row = 0; row < a.length; row++) {
        const arrayRow = new Array(a[0].length);
        for (let col = 0; col < a[0].length; col++) {
            arrayRow[col] = a[row][col] + b[row][col];
        }
        out[row] = arrayRow;
    }
    return out;
}

Name	min	max	avg	p75	p99	p995
Add 1x1 (Loop Prealloc)	13ns	137ns	18ns	20ns	56ns	73ns
Add 2x2 (Loop Prealloc)	25ns	65ns	28ns	27ns	45ns	53ns
Add 4x4 (Loop Prealloc)	61ns	152ns	73ns	78ns	124ns	129ns
Add 8x8 (Loop Prealloc)	203ns	444ns	228ns	232ns	348ns	434ns
Add 16x16 (Loop Prealloc)	710ns	942ns	762ns	768ns	942ns	942ns
Add 32x32 (Loop Prealloc)	2648ns	2769ns	2700ns	2716ns	2769ns	2769ns
Add 64x64 (Loop Prealloc)	9500ns	372100ns	10926ns	10100ns	25000ns	35800ns
Add 100x100 (Loop Prealloc)	24500ns	515800ns	28392ns	26300ns	62100ns	204400ns

That did the trick! We're about 1.5x faster than where we started!

Unrolling the loops

What if we just removed all the loops and wrote it out long-hand?



export function addMatrix4x4(a, b) {
    return [
        [a[0][0] + b[0][0], a[0][1] + b[0][1], a[0][2] + b[0][2], a[0][3] + b[0][3]],
        [a[1][0] + b[1][0], a[1][1] + b[1][1], a[1][2] + b[1][2], a[1][3] + b[1][3]],
        [a[2][0] + b[2][0], a[2][1] + b[2][1], a[2][2] + b[2][2], a[2][3] + b[2][3]],
        [a[3][0] + b[3][0], a[3][1] + b[3][1], a[3][2] + b[3][2], a[3][3] + b[3][3]]
    ];

}

This is not very flexible as you need a function for every shape of matrix you want to add. However, in some cases like 3D this isn't too bad because you have a very finite amount of things, usually only ever 4x4. In machine learning this might cause problems.

Here's a function that generates the javascript text for the unrolled loop:



export function genMatAddBody(rows, cols) {
    let funcBody = "return [\n";

    for (let r = 0; r < rows; r++) {
        funcBody += "\t\t["
        for (let c = 0; c < cols; c++) {
            funcBody += `a[${r}][${c}] + b[${r}][${c}]${c < cols - 1 ? ", " : ""}`
        }
        funcBody += `]${r < rows - 1 ? ", " : ""}\n`
    }

    funcBody += `\t];\n`
    return funcBody;
}
export function genMatAddFunc(rows, cols) {
    rows = Number(rows);
    cols = Number(cols);
    const body = genMatAddBody(rows, cols);
    return new Function("a", "b", body);
}

I was also curious if making this dynamic generation changes much:



export function genMatAddFunc(rows, cols) {
    rows = Number(rows); //prevents code injection
    cols = Number(cols);
    const body = genMatAddBody(rows, cols);
    return new Function("a", "b", body);
}

Since we are using eval we should make sure to sanitize the input.



const addMatrix1x1Dyn = genMatAddFunc(1,1);
const addMatrix2x2Dyn = genMatAddFunc(2,2);
const addMatrix4x4Dyn = genMatAddFunc(4,4);
// etc.
const addMatrix100x100Dyn = genMatAddFunc(100,100);

Name	min	max	avg	p75	p99	p995
Add 1x1 (unrolled)	7ns	34ns	8ns	8ns	19ns	20ns
Add 1x1 (unrolled dynamic)	7ns	40ns	8ns	7ns	19ns	20ns
Add 2x2 (unrolled)	11ns	46ns	13ns	12ns	26ns	29ns
Add 2x2 (unrolled dynamic)	11ns	39ns	12ns	12ns	27ns	29ns
Add 4x4 (unrolled)	36ns	159ns	59ns	72ns	124ns	130ns
Add 4x4 (unrolled dynamic)	36ns	236ns	67ns	84ns	156ns	181ns
Add 8x8 (unrolled)	92ns	243ns	130ns	142ns	235ns	242ns
Add 8x8 (unrolled dynamic)	89ns	262ns	113ns	119ns	186ns	209ns
Add 16x16 (unrolled)	500ns	672800ns	734ns	600ns	3400ns	10500ns
Add 16x16 (unrolled dynamic)	500ns	2052000ns	799ns	600ns	6400ns	10600ns
Add 32x32 (unrolled)	73800ns	562500ns	83976ns	85200ns	136400ns	160600ns
Add 32x32 (unrolled dynamic)	73000ns	908200ns	90772ns	90900ns	137900ns	162600ns
Add 64x64 (unrolled)	328700ns	737300ns	350104ns	343900ns	574500ns	587000ns
Add 64x64 (unrolled dynamic)	327600ns	698800ns	349201ns	345400ns	573900ns	592400ns
Add 100x100 (unrolled)	829600ns	1250900ns	876580ns	873700ns	1143900ns	1157500ns
Add 100x100 (unrolled dynamic)	816900ns	1416300ns	891844ns	894500ns	1227700ns	1288200ns

It's a big improvement for small values beating the pre-allocated loop by about 1.5 to 2x but not good for large ones being considerably slower. I'm not sure why that's the case, maybe it has to due with the size of the function itself? The code generated is massive. Also dynamic generation is basically the same as writing them out. So if you want to save payload (and aren't limited by CSP) you can dynamically create these at no penalty.

Flattening the arrays

Another thing where I think we can save is the arrays. We technically don't need to have a lot of nested arrays and they'll add some overhead to create. So now a 2x2 array looks like this:



[
    4, 7,
    10, 5
]

However you now need to know the dimensions for this to work because different rectangular shapes can have the same number of elements. So maybe let's make it an object.



{
    shape: [2,2],
    data: [
        4, 7,
        10, 5
    ]
}

The shape is an array rather than properties because we could scale this idea up into N-dimensional tensors. In fact, this is how libraries like tensorflowjs do it. For convenience let's build some function to convert between the formats.



export function nestedArrayToFlat(nested){
    return {
        shape: [nested.length, nested[0].length],
        data: nested.flat(Infinity)
    }
}

export function flatToNestedArray(flat){
    const data = new Array(flat.shape[0]);
    for(let row = 0; row < flat.shape[0]; row++){
        const rowArray = new Array(flat.shape[1]);
        for(let col = 0; col < flat.shape[1]; col++){
            rowArray[col] = flat.data[row * flat.shape[1] + col]; 
        }
        data[row] = rowArray;
    }
    return data;
}

So far I think pre-allocated arrays and loops have the best general performance scaling to larger values so we'll stick to that for now. This also means I'll be omitting flat and loop since they aren't winning in any category as well as dynamic because it's the same as unrolled.



export function addMatrixFlat(a, b) {
    const out = {
        shape: a.shape,
        data: new Array(a.data.length)
    };
    for (let row = 0; row < a.shape[0]; row++) {
        for (let col = 0; col < a.shape[1]; col++) {
            const index = (row * a.shape[1]) + col;
            out.data[index] = a.data[index] + b.data[index];
        }
    }
    return out;
}

Name	min	max	avg	p75	p99	p995
Add 1x1 (flat)	9ns	53ns	10ns	10ns	24ns	29ns
Add 2x2 (flat)	14ns	49ns	15ns	15ns	29ns	30ns
Add 4x4 (flat)	32ns	107ns	40ns	46ns	86ns	94ns
Add 8x8 (flat)	97ns	167ns	110ns	113ns	143ns	157ns
Add 16x16 (flat)	400ns	548ns	436ns	447ns	517ns	548ns
Add 32x32 (flat)	1985ns	2900ns	2222ns	2276ns	2900ns	2900ns
Add 64x64 (flat)	8512ns	10514ns	8775ns	8715ns	10514ns	10514ns
Add 100x100 (flat)	15500ns	701100ns	23261ns	21800ns	54200ns	194800ns

It's about 20% faster on larger matrices than our previous best but 20% slower on 1x1 and 2x2 than unrolled. Since those aren't too important I'd say this is a big win.

Row vs column major

Does it matter if we traverse over rows versus columns? One might suspect that it could when CPU caching an stuff gets involved, but let's test.



export function addMatrixFlatColMajor(a, b) {
    const out = {
        shape: a.shape,
        data: new Array(a.data.length)
    };
    for (let col = 0; col < a.shape[1]; col++) {
        for (let row = 0; row < a.shape[0]; row++) {
            const index = (row * a.shape[1]) + col;
            out.data[index] = a.data[index] + b.data[index];
        }
    }
    return out;
}

Name	min	max	avg	p75	p99	p995
Add 1x1 (flat col major)	9ns	41ns	10ns	9ns	21ns	22ns
Add 2x2 (flat col major)	14ns	41ns	15ns	14ns	29ns	32ns
Add 4x4 (flat col major)	32ns	79ns	37ns	37ns	61ns	67ns
Add 8x8 (flat col major)	101ns	156ns	114ns	116ns	147ns	153ns
Add 16x16 (flat col major)	423ns	532ns	453ns	465ns	513ns	532ns
Add 32x32 (flat col major)	2047ns	3228ns	2199ns	2258ns	3228ns	3228ns
Add 64x64 (flat col major)	7500ns	413800ns	10417ns	10200ns	26200ns	37000ns
Add 100x100 (flat col major)	19800ns	575300ns	25090ns	23500ns	63000ns	198500ns

As it turn out, column major traversal is actually a tiny bit slower than row major traversal. This is likely because the cache lines are being read more optimally.

However, since element-wise add is so simple we can actually just ditch the loop structure and just add all the elements linearly with a single loop.



export function addMatrixFlatSimple(a, b) {
    const out = {
        shape: a.shape,
        data: new Array(a.data.length)
    };
    for(let i = 0; i < a.data.length; i++){
        out.data[i] = a.data[i] + b.data[i];
    }
    return out;
}

Name	min	max	avg	p75	p99	p995
Add 1x1 (flat simple)	7ns	46ns	8ns	8ns	18ns	20ns
Add 2x2 (flat simple)	9ns	54ns	10ns	10ns	23ns	26ns
Add 4x4 (flat simple)	18ns	77ns	24ns	28ns	51ns	56ns
Add 8x8 (flat simple)	55ns	159ns	73ns	78ns	125ns	136ns
Add 16x16 (flat simple)	276ns	405ns	315ns	335ns	393ns	405ns
Add 32x32 (flat simple)	1387ns	1682ns	1490ns	1547ns	1682ns	1682ns
Add 64x64 (flat simple)	6381ns	7219ns	6602ns	6675ns	7219ns	7219ns
Add 100x100 (flat simple)	9000ns	598000ns	17166ns	15700ns	49400ns	178400ns

This is like 20%+ faster.

Unrolled

We can also unroll these too and see what happens, maybe the simpler structure helps? Using this code:



export function genMatAddFlatBody(rows, cols){
    let funcBody = "return [\n";

    for (let r = 0; r < rows; r++) {
        for (let c = 0; c < cols; c++) {
            funcBody += `a[${r * cols + c}] + b[${r * cols + c}]${(c * r) < ((rows - 1) * (cols - 1)) ? ", " : ""}`
        }
    }

    funcBody += `];\n`
    return funcBody;
}

We can generate functions like this:



export function addMatrixFlat2x2(a, b) {
    return [
        a[0] + b[0], a[1] + b[1], a[2] + b[2], a[3] + b[3]];

}

We can dynamically create them with eval like this:



export function genMatAddFlatFunc(rows, cols) {
    rows = Number(rows);
    cols = Number(cols);
    const body = genMatAddFlatBody(rows, cols);
    return new Function("a", "b", body);
}

Name	min	max	avg	p75	p99	p995
Add 1x1 (flat unrolled)	6ns	53ns	7ns	7ns	19ns	22ns
Add 2x2 (flat unrolled)	7ns	62ns	8ns	8ns	21ns	23ns
Add 4x4 (flat unrolled)	24ns	136ns	37ns	41ns	84ns	93ns
Add 8x8 (flat unrolled)	61ns	185ns	81ns	86ns	131ns	144ns
Add 16x16 (flat unrolled)	300ns	564700ns	508ns	400ns	1000ns	6100ns
Add 32x32 (flat unrolled)	63600ns	826700ns	74574ns	75200ns	133000ns	162600ns
Add 64x64 (flat unrolled)	263500ns	788800ns	286503ns	280600ns	502900ns	528900ns
Add 100x100 (flat unrolled)	706400ns	1760300ns	764369ns	758900ns	1102800ns	1118900ns

It just edges the simple loop out at 1x1 and 2x2 and beyond that it loses and get much, much worse at large sizes.

Typed arrays

So the next area of possible optimization I can see is to actually use types. We can do this in Javascript using typed-arrays. This will allow us to allocate a block of memory and reduce the overhead of any array structure. This is actually a little more important though. By using typed-arrays we can actually reduce conversions. APIs like WASM, WebGL and WebGPU deal with blocks of memory and the less we have to convert the faster we're going to wind up. So I think even if it turns out this is a little bit slower, there's still good reasons to pursue it. Although we end up with different paths, one for floats and one for integers and even then it might matter if we choose different bit-widths. Also, since we've already shown that flat structures perform better overall we don't need to consider nested arrays. For brevity I'm not going to test all Typed array combos as we'll start to see a general pattern.

Float 64

Name	min	max	avg	p75	p99	p995
Add 1x1 (F64)	330ns	1600ns	400ns	397ns	663ns	1600ns
Add 2x2 (F64)	329ns	598ns	393ns	409ns	493ns	598ns
Add 4x4 (F64)	393ns	1786ns	490ns	503ns	662ns	1786ns
Add 8x8 (F64)	490ns	778ns	621ns	664ns	778ns	778ns
Add 16x16 (F64)	1024ns	5425ns	1311ns	1334ns	5425ns	5425ns
Add 32x32 (F64)	3346ns	4707ns	3772ns	4115ns	4707ns	4707ns
Add 64x64 (F64)	8000ns	2309700ns	14203ns	12700ns	35300ns	44800ns
Add 100x100 (F64)	23200ns	3328400ns	35026ns	33300ns	82400ns	231000ns

Javascript numbers are float 64s. So it's really surprising that these behave slower than a normal javascript array. Doing small arrays is actually slower than array.map. I'm guessing this has something to do with how the engine treats them. As the matrices get larger these get faster but even at 100x100 items it's still quite bit slower than a normal flat array.

Float 32

Name	min	max	avg	p75	p99	p995
Add 1x1 (F32)	324ns	554ns	380ns	391ns	506ns	554ns
Add 2x2 (F32)	324ns	594ns	391ns	408ns	520ns	594ns
Add 4x4 (F32)	396ns	658ns	463ns	489ns	569ns	658ns
Add 8x8 (F32)	508ns	822ns	620ns	673ns	822ns	822ns
Add 16x16 (F32)	1148ns	1784ns	1345ns	1422ns	1784ns	1784ns
Add 32x32 (F32)	3258ns	3840ns	3344ns	3337ns	3840ns	3840ns
Add 64x64 (F32)	10500ns	1101800ns	18473ns	21600ns	66500ns	101200ns
Add 100x100 (F32)	25800ns	1797500ns	37062ns	35800ns	99800ns	245400ns

F32 arrays have the same problem as Float64s. The performance is near identical despite being smaller so for pure speed there's no point in choosing them. In fact at 100x100 F64 arrays are decently faster. The only benefit we get is halfing our memory which might be a reason to choose these.

Int 32

Name	min	max	avg	p75	p99	p995
Add 1x1 (I32)	321ns	1015ns	390ns	398ns	704ns	1015ns
Add 2x2 (I32)	324ns	570ns	390ns	403ns	501ns	570ns
Add 4x4 (I32)	372ns	530ns	426ns	443ns	488ns	530ns
Add 8x8 (I32)	455ns	621ns	539ns	575ns	616ns	621ns
Add 16x16 (I32)	784ns	1202ns	913ns	966ns	1202ns	1202ns
Add 32x32 (I32)	2111ns	2704ns	2182ns	2182ns	2704ns	2704ns
Add 64x64 (I32)	8742ns	9569ns	9138ns	9305ns	9569ns	9569ns
Add 100x100 (I32)	12600ns	2578300ns	22470ns	21600ns	50300ns	72200ns

I32s again behave similarly but start to see much bigger gains at larger matrices. In fact at 100x100 the I32 matrix is about equal to a flat matrix. Not amazing but if you are dealing with large integer matrices this is probably your best choice.

Conclusion

For simple, single-threaded javascript we've observed a few things (in Deno/V8 @ 2023-03-31):

Loops will mostly perform better than .map except at very large values and only with nested arrays (I tried a flat array and it wasn't notable enough to copy-paste the data).
Bespoke unrolled functions work well on very small sizes 4x4 or less but doesn't beat a simple loop and falls off very, very quickly.
Reducing structure makes a big difference.
Pre-allocating arrays makes a huge different, always do this if you can.
Typed-arrays provide no speed advantage (but we might get less conversion overhead and space-space saving).

There's more ways we can deal matrices, I'd like to maybe look at what WASM and WebGPU look like with high overhead but potentially massive speed increases in actual calculation due to parallelism. Web Workers as well. Also different ops can vary widely. Matrix multiplication uses the left-hand and right-hand structures differently and might require some different strategies. But I think the biggest take-away:

Your best bet for a generalized element-wise matrix op is a single flat loop over a normal JS array as it's fast and scales well

The Data

Name	min	max	avg	p75	p99	p995
Add 1x1 (Func)	63ns	180ns	70ns	74ns	113ns	124ns
Add 1x1 (Loop)	28ns	210ns	46ns	47ns	142ns	168ns
Add 1x1 (Loop Prealloc)	13ns	137ns	18ns	20ns	56ns	73ns
Add 1x1 (unrolled)	7ns	34ns	8ns	8ns	19ns	20ns
Add 1x1 (unrolled dynamic)	7ns	40ns	8ns	7ns	19ns	20ns
Add 1x1 (flat)	9ns	53ns	10ns	10ns	24ns	29ns
Add 1x1 (flat col major)	9ns	41ns	10ns	9ns	21ns	22ns
Add 1x1 (flat simple)	7ns	46ns	8ns	8ns	18ns	20ns
Add 1x1 (flat unrolled)	6ns	53ns	7ns	7ns	19ns	22ns
Add 1x1 (F64)	330ns	1600ns	400ns	397ns	663ns	1600ns
Add 1x1 (F32)	324ns	554ns	380ns	391ns	506ns	554ns
Add 1x1 (I32)	321ns	1015ns	390ns	398ns	704ns	1015ns
Add 2x2 (Func)	144ns	208ns	152ns	158ns	184ns	196ns
Add 2x2 (Loop)	55ns	163ns	71ns	76ns	125ns	143ns
Add 2x2 (Loop Prealloc)	25ns	65ns	28ns	27ns	45ns	53ns
Add 2x2 (unrolled)	11ns	46ns	13ns	12ns	26ns	29ns
Add 2x2 (unrolled dynamic)	11ns	39ns	12ns	12ns	27ns	29ns
Add 2x2 (flat)	14ns	49ns	15ns	15ns	29ns	30ns
Add 2x2 (flat col major)	14ns	41ns	15ns	14ns	29ns	32ns
Add 2x2 (flat simple)	9ns	54ns	10ns	10ns	23ns	26ns
Add 2x2 (flat unrolled)	7ns	62ns	8ns	8ns	21ns	23ns
Add 2x2 (F64)	329ns	598ns	393ns	409ns	493ns	598ns
Add 2x2 (F32)	324ns	594ns	391ns	408ns	520ns	594ns
Add 2x2 (I32)	324ns	570ns	390ns	403ns	501ns	570ns
Add 4x4 (Func)	312ns	373ns	329ns	335ns	370ns	373ns
Add 4x4 (Loop)	122ns	227ns	143ns	151ns	195ns	225ns
Add 4x4 (Loop Prealloc)	61ns	152ns	73ns	78ns	124ns	129ns
Add 4x4 (unrolled)	36ns	159ns	59ns	72ns	124ns	130ns
Add 4x4 (unrolled dynamic)	36ns	236ns	67ns	84ns	156ns	181ns
Add 4x4 (flat)	32ns	107ns	40ns	46ns	86ns	94ns
Add 4x4 (flat col major)	32ns	79ns	37ns	37ns	61ns	67ns
Add 4x4 (flat simple)	18ns	77ns	24ns	28ns	51ns	56ns
Add 4x4 (flat unrolled)	24ns	136ns	37ns	41ns	84ns	93ns
Add 4x4 (F64)	393ns	1786ns	490ns	503ns	662ns	1786ns
Add 4x4 (F32)	396ns	658ns	463ns	489ns	569ns	658ns
Add 4x4 (I32)	372ns	530ns	426ns	443ns	488ns	530ns
Add 8x8 (Func)	694ns	930ns	724ns	731ns	930ns	930ns
Add 8x8 (Loop)	360ns	807ns	411ns	422ns	744ns	807ns
Add 8x8 (Loop Prealloc)	203ns	444ns	228ns	232ns	348ns	434ns
Add 8x8 (unrolled)	92ns	243ns	130ns	142ns	235ns	242ns
Add 8x8 (unrolled dynamic)	89ns	262ns	113ns	119ns	186ns	209ns
Add 8x8 (flat)	97ns	167ns	110ns	113ns	143ns	157ns
Add 8x8 (flat col major)	101ns	156ns	114ns	116ns	147ns	153ns
Add 8x8 (flat simple)	55ns	159ns	73ns	78ns	125ns	136ns
Add 8x8 (flat unrolled)	61ns	185ns	81ns	86ns	131ns	144ns
Add 8x8 (F64)	490ns	778ns	621ns	664ns	778ns	778ns
Add 8x8 (F32)	508ns	822ns	620ns	673ns	822ns	822ns
Add 8x8 (I32)	455ns	621ns	539ns	575ns	616ns	621ns
Add 16x16 (Func)	1798ns	1942ns	1836ns	1843ns	1942ns	1942ns
Add 16x16 (Loop)	1179ns	1246ns	1208ns	1217ns	1246ns	1246ns
Add 16x16 (Loop Prealloc)	710ns	942ns	762ns	768ns	942ns	942ns
Add 16x16 (unrolled)	500ns	672800ns	734ns	600ns	3400ns	10500ns
Add 16x16 (unrolled dynamic)	500ns	2052000ns	799ns	600ns	6400ns	10600ns
Add 16x16 (flat)	400ns	548ns	436ns	447ns	517ns	548ns
Add 16x16 (flat col major)	423ns	532ns	453ns	465ns	513ns	532ns
Add 16x16 (flat simple)	276ns	405ns	315ns	335ns	393ns	405ns
Add 16x16 (flat unrolled)	300ns	564700ns	508ns	400ns	1000ns	6100ns
Add 16x16 (F64)	1024ns	5425ns	1311ns	1334ns	5425ns	5425ns
Add 16x16 (F32)	1148ns	1784ns	1345ns	1422ns	1784ns	1784ns
Add 16x16 (I32)	784ns	1202ns	913ns	966ns	1202ns	1202ns
Add 32x32 (Func)	5274ns	6599ns	5495ns	5605ns	6599ns	6599ns
Add 32x32 (Loop)	5031ns	5216ns	5090ns	5105ns	5216ns	5216ns
Add 32x32 (Loop Prealloc)	2648ns	2769ns	2700ns	2716ns	2769ns	2769ns
Add 32x32 (unrolled)	73800ns	562500ns	83976ns	85200ns	136400ns	160600ns
Add 32x32 (unrolled dynamic)	73000ns	908200ns	90772ns	90900ns	137900ns	162600ns
Add 32x32 (flat)	1985ns	2900ns	2222ns	2276ns	2900ns	2900ns
Add 32x32 (flat col major)	2047ns	3228ns	2199ns	2258ns	3228ns	3228ns
Add 32x32 (flat simple)	1387ns	1682ns	1490ns	1547ns	1682ns	1682ns
Add 32x32 (flat unrolled)	63600ns	826700ns	74574ns	75200ns	133000ns	162600ns
Add 32x32 (F64)	3346ns	4707ns	3772ns	4115ns	4707ns	4707ns
Add 32x32 (F32)	3258ns	3840ns	3344ns	3337ns	3840ns	3840ns
Add 32x32 (I32)	2111ns	2704ns	2182ns	2182ns	2704ns	2704ns
Add 64x64 (Func)	13000ns	2331200ns	17451ns	16300ns	41900ns	60700ns
Add 64x64 (Loop)	14300ns	362400ns	20651ns	19200ns	52900ns	110500ns
Add 64x64 (Loop Prealloc)	9500ns	372100ns	10926ns	10100ns	25000ns	35800ns
Add 64x64 (unrolled)	328700ns	737300ns	350104ns	343900ns	574500ns	587000ns
Add 64x64 (unrolled dynamic)	327600ns	698800ns	349201ns	345400ns	573900ns	592400ns
Add 64x64 (flat)	8512ns	10514ns	8775ns	8715ns	10514ns	10514ns
Add 64x64 (flat col major)	7500ns	413800ns	10417ns	10200ns	26200ns	37000ns
Add 64x64 (flat simple)	6381ns	7219ns	6602ns	6675ns	7219ns	7219ns
Add 64x64 (flat unrolled)	263500ns	788800ns	286503ns	280600ns	502900ns	528900ns
Add 64x64 (F64)	8000ns	2309700ns	14203ns	12700ns	35300ns	44800ns
Add 64x64 (F32)	10500ns	1101800ns	18473ns	21600ns	66500ns	101200ns
Add 64x64 (I32)	8742ns	9569ns	9138ns	9305ns	9569ns	9569ns
Add 100x100 (Func)	30800ns	512800ns	40269ns	38200ns	105700ns	218300ns
Add 100x100 (Loop)	38200ns	425400ns	54401ns	54100ns	227700ns	256300ns
Add 100x100 (Loop Prealloc)	24500ns	515800ns	28392ns	26300ns	62100ns	204400ns
Add 100x100 (unrolled)	829600ns	1250900ns	876580ns	873700ns	1143900ns	1157500ns
Add 100x100 (unrolled dynamic)	816900ns	1416300ns	891844ns	894500ns	1227700ns	1288200ns
Add 100x100 (flat)	15500ns	701100ns	23261ns	21800ns	54200ns	194800ns
Add 100x100 (flat col major)	19800ns	575300ns	25090ns	23500ns	63000ns	198500ns
Add 100x100 (flat simple)	9000ns	598000ns	17166ns	15700ns	49400ns	178400ns
Add 100x100 (flat unrolled)	706400ns	1760300ns	764369ns	758900ns	1102800ns	1118900ns
Add 100x100 (F64)	23200ns	3328400ns	35026ns	33300ns	82400ns	231000ns
Add 100x100 (F32)	25800ns	1797500ns	37062ns	35800ns	99800ns	245400ns
Add 100x100 (I32)	12600ns	2578300ns	22470ns	21600ns	50300ns	72200ns