ndesmic

Posted on

# Fast Matrix Math 2

Last time we looked a number of ways to add two matrices starting from naive and applying simplifications until we basically just formed a flat loop over the data which turned out to be the fastest, beating out even TypedArrays. In some ways I'm regretting choosing such a simple case of element-wise addition because it did just simplify into testing the speed of flat loops and that maybe matrix multiplication would have been better. Oh well, maybe next time.

There were some shortcomings because of the sheer amount of permutations I couldn't get to everything so we'll start off really looking at byte representations as a source of speed. In addition I've removed the 100x100 for a 128x128 sized tests and added a final 256x256 to really challenge differences with "big" matrices.

This actually changes some strategies as things like the simple flat loop actually get a lot worse than allocated loops. Typed arrays really pull ahead. My guess is because the compiler doesn't trust allocating large JS arrays.

## Simple Typed Array Loops

I neglected to test typed arrays with flat loops since we know we don't need the structured row-column loops we have before. There's not much to these:

``````export function addMatrixFloat64Flat(a, b) {
const out = {
shape: a.shape,
data: new Float64Array(a.data.length)
};
for (let i = 0; i < a.data.length; i++) {
out.data[i] = a.data[i] + b.data[i];
}
return out;
}
``````
size F64 (avg ns) F64 flat (avg ns) F32 (avg ns) F32 flat (avg ns) I32 (avg ns) I32 flat (avg ns)
1x1 353.9186842105266 364.64897959183673 348.8261688311688 348.4743506493506 350.96366013071895 352.73506578947377
2x2 356.2128476821191 339.31120253164556 362.29460000000006 343.3510897435899 348.14629870129863 342.82884615384626
4x4 380.7755319148939 365.68258503401336 370.45848275862056 366.86794520547926 357.1944666666666 352.2581699346405
8x8 477.342155172414 671.9946511627907 479.1566086956522 435.98608000000013 424.4869531249997 408.53601503759376
16x16 868.6750000000002 777.2616 879.089552238806 738.4939743589744 674.5704761904763 662.666627906977
32x32 2606.12 2246.1545454545458 2592.8500000000004 1986.655277777778 1733.6833333333332 1722.907
64x64 9949 12755 9609.210000000001 8915 6712.572222222221 6916.692352941176
128x128 54327 40977 42542 33077 29148 29749
256x256 156667 148573 136578 116117 101296 105838

As expected they are a tad faster except strangely for I32 which is slightly faster with the inner loop bound checks, weird.

## Fixing Tests for Float32 Arrays

One issue that came up that I simply ignored last time was that I could write tests for Float32Arrays because the precision differences between them and 64-bit floats. I found that there's actually a function in the Javascript `Math` library called `Math.fround` that can do floating point precision rounding. While generating data I can `fround` the arguments and `fround` the result to keep it accurate for 32-bit floats.

## WASM

Another way we can look to speed things up is WASM. WASM is often discribed as a math co-processor for the web. At least, people think that if you move your code to WASM you'll see a big speed up. This however is much more complicated that we might think. The overhead of calling WASM is high because memory has to be moved between the host and WASM module, this means that we might not see big gains, but hopefully the low internal overhead might help speed things up.

I built a module in pure WAT, the WebAssembly Text Format. While normally we don't do this and instead use a language like Rust or C to compile to WASM there's a lot of complexity to the compilation and weird stuff might make it into the module and change our results and it's hard to optimize. To keep it as close to the metal as possible WAT is the best option. Thankfully what we are doing is not complicated and can be reasonably written in WAT.

``````;;mat.wat
(module
(func \$matrix_add_f64 (param \$lhs_ptr i32) (param \$rhs_ptr i32) (param \$length i32) (param \$out_ptr i32)
(local \$i i32)
(local \$offset i32)
(local.set \$i (i32.const 0))
(block
(loop
(local.set \$offset (i32.mul (local.get \$i) (i32.const 8)))

(f64.store)

(local.set \$i (i32.add (local.get \$i) (i32.const 1)))
(br_if 1 (i32.eq (local.get \$i) (local.get \$length)))
(br 0)
)
)
)
(memory (export "memory") 1)
)
``````

I'm not going to explain this much as it's out of scope, but this is basically the same as the flat simple version. It treats data as a singular array and iterates over it while adding. The parameters are the pointer to the left-hand array, the right-hand array, the total length and where to store the output. We could have technically just used length and if we start at 0 know exactly where everything is but to start I wanted explicit to start.

### Compiling WAT

To actually compile this we can use a tool called WABT (WebAssembly Binary Toolkit). It's basically a mess that requires CMake and I couldn't get it to run on WSL and I wasn't going to install MinGW. Instead there's a nice tool called WAPM from Wasmer which works like npm for webassembly packages and since it's been compiled down to webassembly we can run it in any environment. In fact we don't even need to add configuration so long as wapm is installed. We can run `wax wat2wasm -- wat/mat.wat -o wasm/mat.wasm`. `wax` is like npx for npm. If you're wondering the command we give wax is defined by the `wasmer/wabt` package: https://wapm.io/wasmer/wabt. Also for some reason you can't prefix local paths with `./` so `wax wat2wasm -- ./wat/mat.wat` doesn't work which tool me a while to figure out. Anyway this provides a nice simple compile environment if you want to work on raw WAT files.

This will work up until we try with a 128x128 matrix. It will give us out-of-bounds access. The reason is because the memory we allocated can't fit all the data we need! We can however grow the pages in response to larger inputs.

With page growing:

``````export function addMatrixWasmF64(a,b){
const lhsElementOffset = 0;
const rhsElementOffset = lhsElementOffset + a.data.length;
const rhsByteOffset = rhsElementOffset * 8;
const resultElementOffset = lhsElementOffset + a.data.length + b.data.length;
const resultByteOffset = resultElementOffset * 8;
const elementLength = a.data.length;

//grow memory if needed
const spaceLeftover = matInstance.exports.memory.buffer.byteLength - (a.data.length * 3 * 8)
if (spaceLeftover < 0){
const pagesNeeded = Math.ceil((a.data.length * 3 * 8) / (64 * 1024));
const pagesHave = matInstance.exports.memory.buffer.byteLength / (64 * 1024);
matInstance.exports.memory.grow(pagesNeeded - pagesHave);
}

const f64View = new Float64Array(matInstance.exports.memory.buffer);
f64View.set(a.data, lhsElementOffset);
f64View.set(b.data, rhsElementOffset);

return {
shape: a.shape,
data: f64View.slice(resultElementOffset, resultElementOffset + elementLength)
};
}
``````

This might be how we'd do a production version so things don't explode but for performance reasons I'm going to leave that out and make it static, we'll only do up-to 256x256 sized matrixes so we need a total of 24 pages allocated.

``````(memory (export "memory") 24)
``````

The benchmark results are tad disappointing. It expectedly starts out at the bottom of the pack, loading data into the wasm module is expensive overhead. It slowly climbs the latter but even at 128x128 is still slower than a simple loop. The lack of allocation make a big difference. This means unless more of the data can be stored on the WASM side, it's unlikely WASM will provide much of a benefit. Still, WASM has one more trick.

## SIMD

Unlike Javascript WASM has access to SIMD instructions which can be used to parallelize the iteration and these are pretty well supported. SIMD stands for Single Instruction Multiple Data. The idea is that it let's us do add/load etc more than one value at a time provided everything lines up correctly. WASM at least as of the time of writing only has a 128-bit type. The way it works is that we load a 128-bit value and at the operation level we can choose how to interpret it, which in our case will correspond to 2 float-64s. When we increment `\$i` we now do so by 2.

It's actually not too hard to modify the existing code. Instead of stepping be 8-bytes, we step by 16-bytes loading 128-bit values instead of 64. Then we use the `f64x2.add` to add the two floats at the same time and store it back to memory as a 128-bit value.

``````(func \$matrix_add_simd_f64 (param \$lhs_ptr i32) (param \$rhs_ptr i32) (param \$length i32) (param \$out_ptr i32)
(local \$i i32)
(local \$offset i32)
(local.set \$i (i32.const 0))
(block
(loop
(local.set \$offset (i32.mul (local.get \$i) (i32.const 8))) ;;moving in 128-bit blocks
(v128.store)
(local.set \$i (i32.add (local.get \$i) (i32.const 2)))
(br_if 1 (i32.ge_u (local.get \$i) (local.get \$length)))
(br 0)
)
)
)
``````

Another thing to think about is what happens if we don't have exactly 2 floats to add? We need one modification which is to check the branch condition to be greater-than-or-equal (`i32.ge`) rather than equal (`i32.eq`) to account for possible overrun by one. But let's also think about the memory bounds. If we have just one f64 then the second f64 will load garbage from outside our bounds but we never write (`v128.store`) in such a way it'll bump into the next segment (assuming the pointers are not overlapping), only the final array at `\$out_ptr` is ever written to. Since we only read back the exact number of elements we want the odd value is just ignored. We also can't read outside the bounds of allocated memory because they are always given by even pages.

I have to do a little debugging to get this right. A simple way is to import little functions for printing values:

``````const { instance: matInstance } = await WebAssembly.instantiate(wasm, {
lib: {
print_int: function(num){
console.log(num);
},
print_float: function(num){
console.log(num);
},
print_brk: function(){
console.log("---")
}
}
});

``````
``````(import "lib" "print_int" (func \$print_int (param i32)))
(import "lib" "print_float" (func \$print_float (param f32)))
(import "lib" "print_brk" (func \$print_brk))

;;---stuff

(call \$print_int (local.get \$offset)) ;;print offset
(call \$print_int (local.get \$i)) ;;print \$i
(call \$print_brk) ;;at a visual barrier between element logs

``````

We claw our way up the latter a bit further. We're still not beating a preallocated loop at 256x256. So still not very usable.

## SIMD F32 and I32

We can still mess with the precision. 32-bit values would give us an advantage of being able to add 4 at a time. We also know that integers generally work a bit faster.

``````(func \$matrix_add_simd_f32 (param \$lhs_ptr i32) (param \$rhs_ptr i32) (param \$length i32) (param \$out_ptr i32)
(local \$i i32)
(local \$offset i32)
(local.set \$i (i32.const 0))
(block
(loop
(local.set \$offset (i32.mul (local.get \$i) (i32.const 8))) ;;moving in 128-bit blocks
(v128.store)
(local.set \$i (i32.add (local.get \$i) (i32.const 2)))
(br_if 1 (i32.ge_u (local.get \$i) (local.get \$length)))
(br 0)
)
)
)
(func \$matrix_add_simd_i32 (param \$lhs_ptr i32) (param \$rhs_ptr i32) (param \$length i32) (param \$out_ptr i32)
(local \$i i32)
(local \$offset i32)
(local.set \$i (i32.const 0))
(block
(loop
(local.set \$offset (i32.mul (local.get \$i) (i32.const 8))) ;;moving in 128-bit blocks
(v128.store)
(local.set \$i (i32.add (local.get \$i) (i32.const 2)))
(br_if 1 (i32.ge_u (local.get \$i) (local.get \$length)))
(br 0)
)
)
)
``````
``````export function addMatrixWasmSimdF32(a, b) {
const lhsElementOffset = 0;
const rhsElementOffset = lhsElementOffset + a.data.length;
const rhsByteOffset = rhsElementOffset * 4;
const resultElementOffset = lhsElementOffset + a.data.length + b.data.length;
const resultByteOffset = resultElementOffset * 4;
const elementLength = a.data.length;

//grow memory if needed
// const spaceLeftover = matInstance.exports.memory.buffer.byteLength - (a.data.length * 3 * 8)
// if (spaceLeftover < 0){
//  const pagesNeeded = Math.ceil((a.data.length * 3 * 8) / (64 * 1024));
//  const pagesHave = matInstance.exports.memory.buffer.byteLength / (64 * 1024);
//  matInstance.exports.memory.grow(pagesNeeded - pagesHave);
// }

const f32View = new Float32Array(matInstance.exports.memory.buffer);
f32View.set(a.data, lhsElementOffset);
f32View.set(b.data, rhsElementOffset);

return {
shape: a.shape,
data: f32View.slice(resultElementOffset, resultElementOffset + elementLength)
};
}

const lhsElementOffset = 0;
const rhsElementOffset = lhsElementOffset + a.data.length;
const rhsByteOffset = rhsElementOffset * 4;
const resultElementOffset = lhsElementOffset + a.data.length + b.data.length;
const resultByteOffset = resultElementOffset * 4;
const elementLength = a.data.length;

//grow memory if needed
// const spaceLeftover = matInstance.exports.memory.buffer.byteLength - (a.data.length * 3 * 8)
// if (spaceLeftover < 0){
//  const pagesNeeded = Math.ceil((a.data.length * 3 * 8) / (64 * 1024));
//  const pagesHave = matInstance.exports.memory.buffer.byteLength / (64 * 1024);
//  matInstance.exports.memory.grow(pagesNeeded - pagesHave);
// }

const i32View = new Int32Array(matInstance.exports.memory.buffer);
i32View.set(a.data, lhsElementOffset);
i32View.set(b.data, rhsElementOffset);

return {
shape: a.shape,
data: i32View.slice(resultElementOffset, resultElementOffset + elementLength)
};
}
``````

These are just basic substitutions of the original code but with 4-byte sized elements. The WASM still jumps 128-bit values it's just treating them as 4 values instead of 2. For arrays larger than 32x32 I32 and F32 are nearly equal and out new speed kings! Still, it's not really a huge difference versus an `I32Array` in plain javascript (~+10%) so it's a far bit of complexity.

## Conclusion

What we've observed this time is that WASM by itself does not yield great performance gains until it actually overcomes the overhead of memory copy, which in our case is around the 64x64 element mark. In these cases using typed arrays can still beat it by not introducing that complexity, the compiler seems smart enough to optimize these. However, once we start introducing SIMD to the picture WASM can overtake plain JS starting around 64x64 and pulling away as sizes get bigger.

## Code

https://github.com/ndesmic/fast-mat/tree/v1.1

## Data

Name min max avg p75 p99 p995
Add 1x1 (Func) 61ns 198ns 70ns 66ns 168ns 171ns
Add 1x1 (Loop) 40ns 141ns 57ns 62ns 113ns 119ns
Add 1x1 (Loop Prealloc) 13ns 56ns 15ns 15ns 35ns 40ns
Add 1x1 (unrolled) 8ns 47ns 8ns 8ns 21ns 22ns
Add 1x1 (unrolled dynamic) 6ns 37ns 7ns 7ns 19ns 21ns
Add 1x1 (flat) 8ns 39ns 9ns 9ns 21ns 22ns
Add 1x1 (flat col major) 8ns 36ns 9ns 9ns 21ns 22ns
Add 1x1 (flat simple) 6ns 39ns 7ns 7ns 18ns 19ns
Add 1x1 (flat unrolled) 5ns 28ns 6ns 6ns 17ns 18ns
Add 1x1 (F64) 320ns 441ns 354ns 364ns 406ns 441ns
Add 1x1 (F32) 304ns 398ns 349ns 360ns 398ns 398ns
Add 1x1 (I32) 302ns 532ns 351ns 360ns 469ns 532ns
Add 1x1 (F64 flat) 309ns 696ns 365ns 369ns 555ns 696ns
Add 1x1 (F32 flat) 299ns 544ns 348ns 359ns 411ns 544ns
Add 1x1 (I32 flat) 305ns 601ns 353ns 364ns 585ns 601ns
Add 1x1 (flat func) 56ns 87ns 59ns 58ns 79ns 85ns
Add 1x1 (WASM F64) 463ns 818ns 500ns 507ns 546ns 818ns
Add 1x1 (WASM SIMD F64) 461ns 780ns 513ns 516ns 751ns 780ns
Add 1x1 (WASM SIMD F32) 444ns 555ns 491ns 503ns 551ns 555ns
Add 1x1 (WASM SIMD I32) 441ns 734ns 499ns 513ns 644ns 734ns
Add 2x2 (Func) 128ns 184ns 136ns 139ns 168ns 173ns
Add 2x2 (Loop) 59ns 168ns 72ns 76ns 115ns 128ns
Add 2x2 (Loop Prealloc) 23ns 70ns 26ns 25ns 44ns 48ns
Add 2x2 (unrolled) 10ns 34ns 11ns 11ns 23ns 24ns
Add 2x2 (unrolled dynamic) 10ns 37ns 11ns 11ns 23ns 25ns
Add 2x2 (flat) 13ns 43ns 14ns 14ns 26ns 27ns
Add 2x2 (flat col major) 12ns 39ns 13ns 13ns 26ns 27ns
Add 2x2 (flat simple) 8ns 49ns 9ns 9ns 21ns 23ns
Add 2x2 (flat unrolled) 7ns 41ns 8ns 7ns 19ns 21ns
Add 2x2 (F64) 300ns 457ns 356ns 374ns 425ns 457ns
Add 2x2 (F32) 300ns 570ns 362ns 377ns 568ns 570ns
Add 2x2 (I32) 296ns 412ns 348ns 367ns 404ns 412ns
Add 2x2 (F64 flat) 292ns 487ns 339ns 360ns 393ns 487ns
Add 2x2 (F32 flat) 295ns 420ns 343ns 366ns 414ns 420ns
Add 2x2 (I32 flat) 293ns 445ns 343ns 362ns 401ns 445ns
Add 2x2 (flat func) 64ns 113ns 67ns 66ns 91ns 95ns
Add 2x2 (WASM F64) 434ns 685ns 494ns 510ns 678ns 685ns
Add 2x2 (WASM SIMD F64) 427ns 551ns 480ns 500ns 538ns 551ns
Add 2x2 (WASM SIMD F32) 434ns 812ns 482ns 492ns 768ns 812ns
Add 2x2 (WASM SIMD I32) 433ns 672ns 475ns 487ns 622ns 672ns
Add 4x4 (Func) 280ns 341ns 293ns 299ns 329ns 341ns
Add 4x4 (Loop) 113ns 199ns 126ns 130ns 168ns 173ns
Add 4x4 (Loop Prealloc) 58ns 110ns 63ns 70ns 91ns 94ns
Add 4x4 (unrolled) 42ns 115ns 50ns 54ns 78ns 97ns
Add 4x4 (unrolled dynamic) 42ns 122ns 50ns 55ns 78ns 89ns
Add 4x4 (flat) 28ns 51ns 30ns 29ns 43ns 44ns
Add 4x4 (flat col major) 28ns 66ns 31ns 30ns 45ns 50ns
Add 4x4 (flat simple) 17ns 67ns 20ns 19ns 36ns 40ns
Add 4x4 (flat unrolled) 26ns 80ns 32ns 31ns 53ns 62ns
Add 4x4 (F64) 338ns 459ns 381ns 400ns 455ns 459ns
Add 4x4 (F32) 330ns 451ns 370ns 393ns 432ns 451ns
Add 4x4 (I32) 312ns 548ns 357ns 378ns 485ns 548ns
Add 4x4 (F64 flat) 329ns 560ns 366ns 379ns 441ns 560ns
Add 4x4 (F32 flat) 317ns 605ns 367ns 388ns 568ns 605ns
Add 4x4 (I32 flat) 314ns 550ns 352ns 372ns 421ns 550ns
Add 4x4 (flat func) 95ns 164ns 101ns 107ns 127ns 132ns
Add 4x4 (WASM F64) 478ns 610ns 510ns 518ns 586ns 610ns
Add 4x4 (WASM SIMD F64) 466ns 591ns 497ns 512ns 572ns 591ns
Add 4x4 (WASM SIMD F32) 451ns 548ns 481ns 494ns 539ns 548ns
Add 4x4 (WASM SIMD I32) 453ns 546ns 478ns 488ns 537ns 546ns
Add 8x8 (Func) 656ns 736ns 679ns 685ns 736ns 736ns
Add 8x8 (Loop) 326ns 372ns 334ns 333ns 366ns 372ns
Add 8x8 (Loop Prealloc) 180ns 259ns 194ns 197ns 247ns 254ns
Add 8x8 (unrolled) 106ns 479ns 125ns 127ns 175ns 185ns
Add 8x8 (unrolled dynamic) 105ns 480ns 125ns 127ns 190ns 207ns
Add 8x8 (flat) 86ns 165ns 94ns 100ns 128ns 149ns
Add 8x8 (flat col major) 90ns 150ns 98ns 105ns 127ns 134ns
Add 8x8 (flat simple) 61ns 124ns 69ns 75ns 97ns 104ns
Add 8x8 (flat unrolled) 65ns 155ns 79ns 84ns 131ns 136ns
Add 8x8 (F64) 414ns 725ns 477ns 495ns 705ns 725ns
Add 8x8 (F32) 434ns 725ns 479ns 497ns 584ns 725ns
Add 8x8 (I32) 379ns 583ns 424ns 441ns 547ns 583ns
Add 8x8 (F64 flat) 379ns 2317ns 672ns 591ns 2317ns 2317ns
Add 8x8 (F32 flat) 392ns 712ns 436ns 443ns 667ns 712ns
Add 8x8 (I32 flat) 377ns 544ns 409ns 419ns 475ns 544ns
Add 8x8 (flat func) 223ns 298ns 240ns 244ns 282ns 283ns
Add 8x8 (WASM F64) 554ns 2077ns 802ns 674ns 2077ns 2077ns
Add 8x8 (WASM SIMD F64) 527ns 2345ns 778ns 646ns 2345ns 2345ns
Add 8x8 (WASM SIMD F32) 499ns 875ns 544ns 550ns 694ns 875ns
Add 8x8 (WASM SIMD I32) 500ns 882ns 533ns 537ns 601ns 882ns
Add 16x16 (Func) 1703ns 1779ns 1743ns 1753ns 1779ns 1779ns
Add 16x16 (Loop) 1016ns 1099ns 1044ns 1052ns 1099ns 1099ns
Add 16x16 (Loop Prealloc) 616ns 689ns 651ns 660ns 689ns 689ns
Add 16x16 (unrolled) 400ns 258900ns 600ns 500ns 1500ns 9600ns
Add 16x16 (unrolled dynamic) 600ns 263200ns 793ns 700ns 9400ns 9900ns
Add 16x16 (flat) 366ns 609ns 386ns 391ns 432ns 609ns
Add 16x16 (flat col major) 380ns 489ns 403ns 410ns 482ns 489ns
Add 16x16 (flat simple) 260ns 348ns 281ns 285ns 339ns 348ns
Add 16x16 (flat unrolled) 326ns 4014ns 366ns 347ns 387ns 4014ns
Add 16x16 (F64) 773ns 1185ns 869ns 906ns 1185ns 1185ns
Add 16x16 (F32) 830ns 996ns 879ns 905ns 996ns 996ns
Add 16x16 (I32) 617ns 989ns 675ns 688ns 989ns 989ns
Add 16x16 (F64 flat) 676ns 1359ns 777ns 800ns 1359ns 1359ns
Add 16x16 (F32 flat) 673ns 1072ns 738ns 762ns 1072ns 1072ns
Add 16x16 (I32 flat) 617ns 932ns 663ns 685ns 932ns 932ns
Add 16x16 (flat func) 773ns 853ns 805ns 820ns 853ns 853ns
Add 16x16 (WASM F64) 867ns 1649ns 971ns 1002ns 1649ns 1649ns
Add 16x16 (WASM SIMD F64) 783ns 951ns 844ns 874ns 951ns 951ns
Add 16x16 (WASM SIMD F32) 654ns 1054ns 705ns 720ns 1054ns 1054ns
Add 16x16 (WASM SIMD I32) 658ns 792ns 701ns 723ns 792ns 792ns
Add 32x32 (Func) 4961ns 5128ns 5038ns 5052ns 5128ns 5128ns
Add 32x32 (Loop) 4431ns 4549ns 4490ns 4511ns 4549ns 4549ns
Add 32x32 (Loop Prealloc) 2411ns 2465ns 2431ns 2441ns 2465ns 2465ns
Add 32x32 (unrolled) 72200ns 478500ns 79994ns 81900ns 125100ns 155100ns
Add 32x32 (unrolled dynamic) 72000ns 1280700ns 82065ns 82600ns 137600ns 176500ns
Add 32x32 (flat) 1720ns 2557ns 1998ns 2053ns 2557ns 2557ns
Add 32x32 (flat col major) 1772ns 2037ns 1827ns 1845ns 2037ns 2037ns
Add 32x32 (flat simple) 1217ns 1616ns 1276ns 1287ns 1616ns 1616ns
Add 32x32 (flat unrolled) 1200ns 492700ns 9236ns 1900ns 75800ns 84200ns
Add 32x32 (F64) 2494ns 2927ns 2606ns 2642ns 2927ns 2927ns
Add 32x32 (F32) 2499ns 2695ns 2593ns 2630ns 2695ns 2695ns
Add 32x32 (I32) 1665ns 1891ns 1734ns 1763ns 1891ns 1891ns
Add 32x32 (F64 flat) 2129ns 2429ns 2246ns 2281ns 2429ns 2429ns
Add 32x32 (F32 flat) 1917ns 2107ns 1987ns 2015ns 2107ns 2107ns
Add 32x32 (I32 flat) 1650ns 1853ns 1723ns 1735ns 1853ns 1853ns
Add 32x32 (flat func) 3215ns 3515ns 3314ns 3380ns 3515ns 3515ns
Add 32x32 (WASM F64) 2737ns 3032ns 2864ns 2903ns 3032ns 3032ns
Add 32x32 (WASM SIMD F64) 2372ns 3883ns 2588ns 2500ns 3883ns 3883ns
Add 32x32 (WASM SIMD F32) 1483ns 1783ns 1560ns 1589ns 1783ns 1783ns
Add 32x32 (WASM SIMD I32) 1499ns 1645ns 1555ns 1584ns 1645ns 1645ns
Add 64x64 (Func) 13200ns 346600ns 17719ns 16700ns 34600ns 84700ns
Add 64x64 (Loop) 13300ns 246700ns 18280ns 17300ns 39600ns 97300ns
Add 64x64 (Loop Prealloc) 8600ns 337000ns 9713ns 9100ns 21900ns 27700ns
Add 64x64 (unrolled) 318700ns 640900ns 339704ns 339600ns 498500ns 513100ns
Add 64x64 (unrolled dynamic) 316900ns 688100ns 341144ns 340000ns 504500ns 519200ns
Add 64x64 (flat) 7324ns 7669ns 7461ns 7476ns 7669ns 7669ns
Add 64x64 (flat col major) 8030ns 9241ns 8222ns 8239ns 9241ns 9241ns
Add 64x64 (flat simple) 5529ns 5990ns 5648ns 5706ns 5990ns 5990ns
Add 64x64 (flat unrolled) 241000ns 645600ns 266799ns 266000ns 437800ns 446800ns
Add 64x64 (F64) 5500ns 1690200ns 9949ns 9300ns 22700ns 27100ns
Add 64x64 (F32) 9501ns 9758ns 9609ns 9634ns 9758ns 9758ns
Add 64x64 (I32) 6333ns 8882ns 6713ns 6585ns 8882ns 8882ns
Add 64x64 (F64 flat) 5000ns 3709000ns 12755ns 13900ns 27900ns 32800ns
Add 64x64 (F32 flat) 5500ns 4133600ns 8915ns 7800ns 27300ns 33300ns
Add 64x64 (I32 flat) 6460ns 8540ns 6917ns 6982ns 8540ns 8540ns
Add 64x64 (flat func) 9900ns 254800ns 13897ns 13200ns 27200ns 69400ns
Add 64x64 (WASM F64) 7600ns 272000ns 11090ns 10800ns 19100ns 27000ns
Add 64x64 (WASM SIMD F64) 6200ns 2670300ns 10132ns 9400ns 23100ns 27900ns
Add 64x64 (WASM SIMD F32) 5154ns 6094ns 5360ns 5378ns 6094ns 6094ns
Add 64x64 (WASM SIMD I32) 5188ns 5973ns 5396ns 5433ns 5973ns 5973ns
Add 128x128 (Func) 45800ns 271300ns 60844ns 61900ns 185300ns 195600ns
Add 128x128 (Loop) 55200ns 390500ns 72186ns 74300ns 205700ns 213500ns
Add 128x128 (Loop Prealloc) 35100ns 248300ns 40496ns 38600ns 114200ns 160200ns
Add 128x128 (unrolled) 1278700ns 1828300ns 1343788ns 1344500ns 1620500ns 1723800ns
Add 128x128 (unrolled dynamic) 1274000ns 1792100ns 1336337ns 1334900ns 1586600ns 1627400ns
Add 128x128 (flat) 67700ns 1414500ns 94394ns 83300ns 393000ns 485100ns
Add 128x128 (flat col major) 119700ns 4007600ns 148592ns 132900ns 471900ns 519400ns
Add 128x128 (flat simple) 59900ns 508700ns 93954ns 105200ns 272900ns 299900ns
Add 128x128 (flat unrolled) 1202000ns 1744700ns 1291079ns 1326300ns 1670100ns 1682700ns
Add 128x128 (F64) 26500ns 3763100ns 54327ns 62100ns 131800ns 268700ns
Add 128x128 (F32) 27700ns 3320900ns 42542ns 40700ns 73400ns 88700ns
Add 128x128 (I32) 21200ns 1250800ns 29148ns 26900ns 54100ns 68200ns
Add 128x128 (F64 flat) 22300ns 4049800ns 40977ns 37300ns 104200ns 244500ns
Add 128x128 (F32 flat) 24800ns 1078300ns 33077ns 31000ns 62400ns 73600ns
Add 128x128 (I32 flat) 21000ns 4687200ns 29749ns 26700ns 60500ns 69300ns
Add 128x128 (flat func) 312400ns 2369600ns 409509ns 356500ns 1593000ns 1767300ns
Add 128x128 (WASM F64) 30100ns 374300ns 42977ns 41900ns 83300ns 137000ns
Add 128x128 (WASM SIMD F64) 27600ns 1592700ns 43336ns 42700ns 83900ns 115400ns
Add 128x128 (WASM SIMD F32) 18200ns 2968600ns 25829ns 23300ns 50000ns 60700ns
Add 128x128 (WASM SIMD I32) 16300ns 3989800ns 25720ns 24100ns 54400ns 66100ns
Add 256x256 (Func) 174800ns 538100ns 203033ns 190800ns 384200ns 411800ns
Add 256x256 (Loop) 282800ns 707400ns 315612ns 300600ns 526700ns 596700ns
Add 256x256 (Loop Prealloc) 147400ns 440400ns 163873ns 159600ns 314000ns 324800ns
Add 256x256 (unrolled) 5352600ns 6199900ns 5632107ns 5749700ns 6199900ns 6199900ns
Add 256x256 (unrolled dynamic) 1283700ns 2213900ns 1348989ns 1344300ns 1629500ns 2088000ns
Add 256x256 (flat) 243200ns 4196000ns 393873ns 402300ns 1048000ns 1139200ns
Add 256x256 (flat col major) 629600ns 1093900ns 703013ns 713100ns 951000ns 1000000ns
Add 256x256 (flat simple) 318600ns 882200ns 390075ns 394200ns 604500ns 636800ns
Add 256x256 (flat unrolled) 4992100ns 6206500ns 5352088ns 5491100ns 6097200ns 6206500ns
Add 256x256 (F64) 72100ns 957800ns 156667ns 151300ns 544800ns 561800ns
Add 256x256 (F32) 88900ns 585400ns 136578ns 150000ns 303100ns 406400ns
Add 256x256 (I32) 65900ns 468900ns 101296ns 100200ns 241400ns 376000ns
Add 256x256 (F64 flat) 69800ns 724800ns 148573ns 132400ns 540600ns 554900ns
Add 256x256 (F32 flat) 80700ns 676500ns 116117ns 114500ns 270200ns 388400ns
Add 256x256 (I32 flat) 67600ns 3160600ns 105838ns 100600ns 274300ns 375900ns
Add 256x256 (flat func) 1254400ns 5091400ns 1749203ns 1452600ns 5024400ns 5042400ns
Add 256x256 (WASM F64) 116900ns 1158400ns 201761ns 213700ns 570500ns 655800ns
Add 256x256 (WASM SIMD F64) 90500ns 715600ns 173352ns 157100ns 551100ns 559100ns
Add 256x256 (WASM SIMD F32) 60700ns 651400ns 93714ns 92100ns 255200ns 359800ns
Add 256x256 (WASM SIMD I32) 60500ns 977300ns 91504ns 90400ns 264200ns 356300ns