Fast Matrix Math 2
Last time we looked a number of ways to add two matrices starting from naive and applying simplifications until we basically just formed a flat loop over the data which turned out to be the fastest, beating out even TypedArrays. In some ways I'm regretting choosing such a simple case of element-wise addition because it did just simplify into testing the speed of flat loops and that maybe matrix multiplication would have been better. Oh well, maybe next time.
There were some shortcomings because of the sheer amount of permutations I couldn't get to everything so we'll start off really looking at byte representations as a source of speed. In addition I've removed the 100x100 for a 128x128 sized tests and added a final 256x256 to really challenge differences with "big" matrices.
This actually changes some strategies as things like the simple flat loop actually get a lot worse than allocated loops. Typed arrays really pull ahead. My guess is because the compiler doesn't trust allocating large JS arrays.
Simple Typed Array Loops
I neglected to test typed arrays with flat loops since we know we don't need the structured row-column loops we have before. There's not much to these:
export function addMatrixFloat64Flat(a, b) {
const out = {
shape: a.shape,
data: new Float64Array(a.data.length)
};
for (let i = 0; i < a.data.length; i++) {
out.data[i] = a.data[i] + b.data[i];
}
return out;
}
size | F64 (avg ns) | F64 flat (avg ns) | F32 (avg ns) | F32 flat (avg ns) | I32 (avg ns) | I32 flat (avg ns) |
---|---|---|---|---|---|---|
1x1 | 353.9186842105266 | 364.64897959183673 | 348.8261688311688 | 348.4743506493506 | 350.96366013071895 | 352.73506578947377 |
2x2 | 356.2128476821191 | 339.31120253164556 | 362.29460000000006 | 343.3510897435899 | 348.14629870129863 | 342.82884615384626 |
4x4 | 380.7755319148939 | 365.68258503401336 | 370.45848275862056 | 366.86794520547926 | 357.1944666666666 | 352.2581699346405 |
8x8 | 477.342155172414 | 671.9946511627907 | 479.1566086956522 | 435.98608000000013 | 424.4869531249997 | 408.53601503759376 |
16x16 | 868.6750000000002 | 777.2616 | 879.089552238806 | 738.4939743589744 | 674.5704761904763 | 662.666627906977 |
32x32 | 2606.12 | 2246.1545454545458 | 2592.8500000000004 | 1986.655277777778 | 1733.6833333333332 | 1722.907 |
64x64 | 9949 | 12755 | 9609.210000000001 | 8915 | 6712.572222222221 | 6916.692352941176 |
128x128 | 54327 | 40977 | 42542 | 33077 | 29148 | 29749 |
256x256 | 156667 | 148573 | 136578 | 116117 | 101296 | 105838 |
As expected they are a tad faster except strangely for I32 which is slightly faster with the inner loop bound checks, weird.
Fixing Tests for Float32 Arrays
One issue that came up that I simply ignored last time was that I could write tests for Float32Arrays because the precision differences between them and 64-bit floats. I found that there's actually a function in the Javascript Math
library called Math.fround
that can do floating point precision rounding. While generating data I can fround
the arguments and fround
the result to keep it accurate for 32-bit floats.
WASM
Another way we can look to speed things up is WASM. WASM is often discribed as a math co-processor for the web. At least, people think that if you move your code to WASM you'll see a big speed up. This however is much more complicated that we might think. The overhead of calling WASM is high because memory has to be moved between the host and WASM module, this means that we might not see big gains, but hopefully the low internal overhead might help speed things up.
I built a module in pure WAT, the WebAssembly Text Format. While normally we don't do this and instead use a language like Rust or C to compile to WASM there's a lot of complexity to the compilation and weird stuff might make it into the module and change our results and it's hard to optimize. To keep it as close to the metal as possible WAT is the best option. Thankfully what we are doing is not complicated and can be reasonably written in WAT.
;;mat.wat
(module
(func $matrix_add_f64 (param $lhs_ptr i32) (param $rhs_ptr i32) (param $length i32) (param $out_ptr i32)
(local $i i32)
(local $offset i32)
(local.set $i (i32.const 0))
(block
(loop
(local.set $offset (i32.mul (local.get $i) (i32.const 8)))
(i32.add (local.get $out_ptr) (local.get $offset)) ;;out addr on stack
(f64.add ;;result on stack
(f64.load (i32.add (local.get $lhs_ptr) (local.get $offset)))
(f64.load (i32.add (local.get $rhs_ptr) (local.get $offset))))
(f64.store)
(local.set $i (i32.add (local.get $i) (i32.const 1)))
(br_if 1 (i32.eq (local.get $i) (local.get $length)))
(br 0)
)
)
)
(export "addMatrix" (func $matrix_add))
(memory (export "memory") 1)
)
I'm not going to explain this much as it's out of scope, but this is basically the same as the flat simple version. It treats data as a singular array and iterates over it while adding. The parameters are the pointer to the left-hand array, the right-hand array, the total length and where to store the output. We could have technically just used length and if we start at 0 know exactly where everything is but to start I wanted explicit to start.
Compiling WAT
To actually compile this we can use a tool called WABT (WebAssembly Binary Toolkit). It's basically a mess that requires CMake and I couldn't get it to run on WSL and I wasn't going to install MinGW. Instead there's a nice tool called WAPM from Wasmer which works like npm for webassembly packages and since it's been compiled down to webassembly we can run it in any environment. In fact we don't even need to add configuration so long as wapm is installed. We can run wax wat2wasm -- wat/mat.wat -o wasm/mat.wasm
. wax
is like npx for npm. If you're wondering the command we give wax is defined by the wasmer/wabt
package: https://wapm.io/wasmer/wabt. Also for some reason you can't prefix local paths with ./
so wax wat2wasm -- ./wat/mat.wat
doesn't work which tool me a while to figure out. Anyway this provides a nice simple compile environment if you want to work on raw WAT files.
This will work up until we try with a 128x128 matrix. It will give us out-of-bounds access. The reason is because the memory we allocated can't fit all the data we need! We can however grow the pages in response to larger inputs.
With page growing:
export function addMatrixWasmF64(a,b){
const lhsElementOffset = 0;
const rhsElementOffset = lhsElementOffset + a.data.length;
const rhsByteOffset = rhsElementOffset * 8;
const resultElementOffset = lhsElementOffset + a.data.length + b.data.length;
const resultByteOffset = resultElementOffset * 8;
const elementLength = a.data.length;
//grow memory if needed
const spaceLeftover = matInstance.exports.memory.buffer.byteLength - (a.data.length * 3 * 8)
if (spaceLeftover < 0){
const pagesNeeded = Math.ceil((a.data.length * 3 * 8) / (64 * 1024));
const pagesHave = matInstance.exports.memory.buffer.byteLength / (64 * 1024);
matInstance.exports.memory.grow(pagesNeeded - pagesHave);
}
const f64View = new Float64Array(matInstance.exports.memory.buffer);
f64View.set(a.data, lhsElementOffset);
f64View.set(b.data, rhsElementOffset);
matInstance.exports.addMatrix(0, rhsByteOffset, elementLength, resultByteOffset);
return {
shape: a.shape,
data: f64View.slice(resultElementOffset, resultElementOffset + elementLength)
};
}
This might be how we'd do a production version so things don't explode but for performance reasons I'm going to leave that out and make it static, we'll only do up-to 256x256 sized matrixes so we need a total of 24 pages allocated.
(memory (export "memory") 24)
The benchmark results are tad disappointing. It expectedly starts out at the bottom of the pack, loading data into the wasm module is expensive overhead. It slowly climbs the latter but even at 128x128 is still slower than a simple loop. The lack of allocation make a big difference. This means unless more of the data can be stored on the WASM side, it's unlikely WASM will provide much of a benefit. Still, WASM has one more trick.
SIMD
Unlike Javascript WASM has access to SIMD instructions which can be used to parallelize the iteration and these are pretty well supported. SIMD stands for Single Instruction Multiple Data. The idea is that it let's us do add/load etc more than one value at a time provided everything lines up correctly. WASM at least as of the time of writing only has a 128-bit type. The way it works is that we load a 128-bit value and at the operation level we can choose how to interpret it, which in our case will correspond to 2 float-64s. When we increment $i
we now do so by 2.
It's actually not too hard to modify the existing code. Instead of stepping be 8-bytes, we step by 16-bytes loading 128-bit values instead of 64. Then we use the f64x2.add
to add the two floats at the same time and store it back to memory as a 128-bit value.
(func $matrix_add_simd_f64 (param $lhs_ptr i32) (param $rhs_ptr i32) (param $length i32) (param $out_ptr i32)
(local $i i32)
(local $offset i32)
(local.set $i (i32.const 0))
(block
(loop
(local.set $offset (i32.mul (local.get $i) (i32.const 8))) ;;moving in 128-bit blocks
(i32.add (local.get $out_ptr) (local.get $offset)) ;;out addr on stack
(f64x2.add
(v128.load (i32.add (local.get $lhs_ptr) (local.get $offset)))
(v128.load (i32.add (local.get $rhs_ptr) (local.get $offset)))) ;;result on stack
(v128.store)
(local.set $i (i32.add (local.get $i) (i32.const 2)))
(br_if 1 (i32.ge_u (local.get $i) (local.get $length)))
(br 0)
)
)
)
Another thing to think about is what happens if we don't have exactly 2 floats to add? We need one modification which is to check the branch condition to be greater-than-or-equal (i32.ge
) rather than equal (i32.eq
) to account for possible overrun by one. But let's also think about the memory bounds. If we have just one f64 then the second f64 will load garbage from outside our bounds but we never write (v128.store
) in such a way it'll bump into the next segment (assuming the pointers are not overlapping), only the final array at $out_ptr
is ever written to. Since we only read back the exact number of elements we want the odd value is just ignored. We also can't read outside the bounds of allocated memory because they are always given by even pages.
I have to do a little debugging to get this right. A simple way is to import little functions for printing values:
const { instance: matInstance } = await WebAssembly.instantiate(wasm, {
lib: {
print_int: function(num){
console.log(num);
},
print_float: function(num){
console.log(num);
},
print_brk: function(){
console.log("---")
}
}
});
(import "lib" "print_int" (func $print_int (param i32)))
(import "lib" "print_float" (func $print_float (param f32)))
(import "lib" "print_brk" (func $print_brk))
;;---stuff
(call $print_int (local.get $offset)) ;;print offset
(call $print_int (local.get $i)) ;;print $i
(call $print_brk) ;;at a visual barrier between element logs
We claw our way up the latter a bit further. We're still not beating a preallocated loop at 256x256. So still not very usable.
SIMD F32 and I32
We can still mess with the precision. 32-bit values would give us an advantage of being able to add 4 at a time. We also know that integers generally work a bit faster.
(func $matrix_add_simd_f32 (param $lhs_ptr i32) (param $rhs_ptr i32) (param $length i32) (param $out_ptr i32)
(local $i i32)
(local $offset i32)
(local.set $i (i32.const 0))
(block
(loop
(local.set $offset (i32.mul (local.get $i) (i32.const 8))) ;;moving in 128-bit blocks
(i32.add (local.get $out_ptr) (local.get $offset)) ;;out addr on stack
(f32x4.add
(v128.load (i32.add (local.get $lhs_ptr) (local.get $offset)))
(v128.load (i32.add (local.get $rhs_ptr) (local.get $offset)))) ;;result on stack
(v128.store)
(local.set $i (i32.add (local.get $i) (i32.const 2)))
(br_if 1 (i32.ge_u (local.get $i) (local.get $length)))
(br 0)
)
)
)
(func $matrix_add_simd_i32 (param $lhs_ptr i32) (param $rhs_ptr i32) (param $length i32) (param $out_ptr i32)
(local $i i32)
(local $offset i32)
(local.set $i (i32.const 0))
(block
(loop
(local.set $offset (i32.mul (local.get $i) (i32.const 8))) ;;moving in 128-bit blocks
(i32.add (local.get $out_ptr) (local.get $offset)) ;;out addr on stack
(i32x4.add
(v128.load (i32.add (local.get $lhs_ptr) (local.get $offset)))
(v128.load (i32.add (local.get $rhs_ptr) (local.get $offset)))) ;;result on stack
(v128.store)
(local.set $i (i32.add (local.get $i) (i32.const 2)))
(br_if 1 (i32.ge_u (local.get $i) (local.get $length)))
(br 0)
)
)
)
export function addMatrixWasmSimdF32(a, b) {
const lhsElementOffset = 0;
const rhsElementOffset = lhsElementOffset + a.data.length;
const rhsByteOffset = rhsElementOffset * 4;
const resultElementOffset = lhsElementOffset + a.data.length + b.data.length;
const resultByteOffset = resultElementOffset * 4;
const elementLength = a.data.length;
//grow memory if needed
// const spaceLeftover = matInstance.exports.memory.buffer.byteLength - (a.data.length * 3 * 8)
// if (spaceLeftover < 0){
// const pagesNeeded = Math.ceil((a.data.length * 3 * 8) / (64 * 1024));
// const pagesHave = matInstance.exports.memory.buffer.byteLength / (64 * 1024);
// matInstance.exports.memory.grow(pagesNeeded - pagesHave);
// }
const f32View = new Float32Array(matInstance.exports.memory.buffer);
f32View.set(a.data, lhsElementOffset);
f32View.set(b.data, rhsElementOffset);
matInstance.exports.addMatrixSimdF32(0, rhsByteOffset, elementLength, resultByteOffset);
return {
shape: a.shape,
data: f32View.slice(resultElementOffset, resultElementOffset + elementLength)
};
}
export function addMatrixWasmSimdI32(a, b) {
const lhsElementOffset = 0;
const rhsElementOffset = lhsElementOffset + a.data.length;
const rhsByteOffset = rhsElementOffset * 4;
const resultElementOffset = lhsElementOffset + a.data.length + b.data.length;
const resultByteOffset = resultElementOffset * 4;
const elementLength = a.data.length;
//grow memory if needed
// const spaceLeftover = matInstance.exports.memory.buffer.byteLength - (a.data.length * 3 * 8)
// if (spaceLeftover < 0){
// const pagesNeeded = Math.ceil((a.data.length * 3 * 8) / (64 * 1024));
// const pagesHave = matInstance.exports.memory.buffer.byteLength / (64 * 1024);
// matInstance.exports.memory.grow(pagesNeeded - pagesHave);
// }
const i32View = new Int32Array(matInstance.exports.memory.buffer);
i32View.set(a.data, lhsElementOffset);
i32View.set(b.data, rhsElementOffset);
matInstance.exports.addMatrixSimdI32(0, rhsByteOffset, elementLength, resultByteOffset);
return {
shape: a.shape,
data: i32View.slice(resultElementOffset, resultElementOffset + elementLength)
};
}
These are just basic substitutions of the original code but with 4-byte sized elements. The WASM still jumps 128-bit values it's just treating them as 4 values instead of 2. For arrays larger than 32x32 I32 and F32 are nearly equal and out new speed kings! Still, it's not really a huge difference versus an I32Array
in plain javascript (~+10%) so it's a far bit of complexity.
Conclusion
What we've observed this time is that WASM by itself does not yield great performance gains until it actually overcomes the overhead of memory copy, which in our case is around the 64x64 element mark. In these cases using typed arrays can still beat it by not introducing that complexity, the compiler seems smart enough to optimize these. However, once we start introducing SIMD to the picture WASM can overtake plain JS starting around 64x64 and pulling away as sizes get bigger.
Code
https://github.com/ndesmic/fast-mat/tree/v1.1
Data
Name | min | max | avg | p75 | p99 | p995 |
---|---|---|---|---|---|---|
Add 1x1 (Func) | 61ns | 198ns | 70ns | 66ns | 168ns | 171ns |
Add 1x1 (Loop) | 40ns | 141ns | 57ns | 62ns | 113ns | 119ns |
Add 1x1 (Loop Prealloc) | 13ns | 56ns | 15ns | 15ns | 35ns | 40ns |
Add 1x1 (unrolled) | 8ns | 47ns | 8ns | 8ns | 21ns | 22ns |
Add 1x1 (unrolled dynamic) | 6ns | 37ns | 7ns | 7ns | 19ns | 21ns |
Add 1x1 (flat) | 8ns | 39ns | 9ns | 9ns | 21ns | 22ns |
Add 1x1 (flat col major) | 8ns | 36ns | 9ns | 9ns | 21ns | 22ns |
Add 1x1 (flat simple) | 6ns | 39ns | 7ns | 7ns | 18ns | 19ns |
Add 1x1 (flat unrolled) | 5ns | 28ns | 6ns | 6ns | 17ns | 18ns |
Add 1x1 (F64) | 320ns | 441ns | 354ns | 364ns | 406ns | 441ns |
Add 1x1 (F32) | 304ns | 398ns | 349ns | 360ns | 398ns | 398ns |
Add 1x1 (I32) | 302ns | 532ns | 351ns | 360ns | 469ns | 532ns |
Add 1x1 (F64 flat) | 309ns | 696ns | 365ns | 369ns | 555ns | 696ns |
Add 1x1 (F32 flat) | 299ns | 544ns | 348ns | 359ns | 411ns | 544ns |
Add 1x1 (I32 flat) | 305ns | 601ns | 353ns | 364ns | 585ns | 601ns |
Add 1x1 (flat func) | 56ns | 87ns | 59ns | 58ns | 79ns | 85ns |
Add 1x1 (WASM F64) | 463ns | 818ns | 500ns | 507ns | 546ns | 818ns |
Add 1x1 (WASM SIMD F64) | 461ns | 780ns | 513ns | 516ns | 751ns | 780ns |
Add 1x1 (WASM SIMD F32) | 444ns | 555ns | 491ns | 503ns | 551ns | 555ns |
Add 1x1 (WASM SIMD I32) | 441ns | 734ns | 499ns | 513ns | 644ns | 734ns |
Add 2x2 (Func) | 128ns | 184ns | 136ns | 139ns | 168ns | 173ns |
Add 2x2 (Loop) | 59ns | 168ns | 72ns | 76ns | 115ns | 128ns |
Add 2x2 (Loop Prealloc) | 23ns | 70ns | 26ns | 25ns | 44ns | 48ns |
Add 2x2 (unrolled) | 10ns | 34ns | 11ns | 11ns | 23ns | 24ns |
Add 2x2 (unrolled dynamic) | 10ns | 37ns | 11ns | 11ns | 23ns | 25ns |
Add 2x2 (flat) | 13ns | 43ns | 14ns | 14ns | 26ns | 27ns |
Add 2x2 (flat col major) | 12ns | 39ns | 13ns | 13ns | 26ns | 27ns |
Add 2x2 (flat simple) | 8ns | 49ns | 9ns | 9ns | 21ns | 23ns |
Add 2x2 (flat unrolled) | 7ns | 41ns | 8ns | 7ns | 19ns | 21ns |
Add 2x2 (F64) | 300ns | 457ns | 356ns | 374ns | 425ns | 457ns |
Add 2x2 (F32) | 300ns | 570ns | 362ns | 377ns | 568ns | 570ns |
Add 2x2 (I32) | 296ns | 412ns | 348ns | 367ns | 404ns | 412ns |
Add 2x2 (F64 flat) | 292ns | 487ns | 339ns | 360ns | 393ns | 487ns |
Add 2x2 (F32 flat) | 295ns | 420ns | 343ns | 366ns | 414ns | 420ns |
Add 2x2 (I32 flat) | 293ns | 445ns | 343ns | 362ns | 401ns | 445ns |
Add 2x2 (flat func) | 64ns | 113ns | 67ns | 66ns | 91ns | 95ns |
Add 2x2 (WASM F64) | 434ns | 685ns | 494ns | 510ns | 678ns | 685ns |
Add 2x2 (WASM SIMD F64) | 427ns | 551ns | 480ns | 500ns | 538ns | 551ns |
Add 2x2 (WASM SIMD F32) | 434ns | 812ns | 482ns | 492ns | 768ns | 812ns |
Add 2x2 (WASM SIMD I32) | 433ns | 672ns | 475ns | 487ns | 622ns | 672ns |
Add 4x4 (Func) | 280ns | 341ns | 293ns | 299ns | 329ns | 341ns |
Add 4x4 (Loop) | 113ns | 199ns | 126ns | 130ns | 168ns | 173ns |
Add 4x4 (Loop Prealloc) | 58ns | 110ns | 63ns | 70ns | 91ns | 94ns |
Add 4x4 (unrolled) | 42ns | 115ns | 50ns | 54ns | 78ns | 97ns |
Add 4x4 (unrolled dynamic) | 42ns | 122ns | 50ns | 55ns | 78ns | 89ns |
Add 4x4 (flat) | 28ns | 51ns | 30ns | 29ns | 43ns | 44ns |
Add 4x4 (flat col major) | 28ns | 66ns | 31ns | 30ns | 45ns | 50ns |
Add 4x4 (flat simple) | 17ns | 67ns | 20ns | 19ns | 36ns | 40ns |
Add 4x4 (flat unrolled) | 26ns | 80ns | 32ns | 31ns | 53ns | 62ns |
Add 4x4 (F64) | 338ns | 459ns | 381ns | 400ns | 455ns | 459ns |
Add 4x4 (F32) | 330ns | 451ns | 370ns | 393ns | 432ns | 451ns |
Add 4x4 (I32) | 312ns | 548ns | 357ns | 378ns | 485ns | 548ns |
Add 4x4 (F64 flat) | 329ns | 560ns | 366ns | 379ns | 441ns | 560ns |
Add 4x4 (F32 flat) | 317ns | 605ns | 367ns | 388ns | 568ns | 605ns |
Add 4x4 (I32 flat) | 314ns | 550ns | 352ns | 372ns | 421ns | 550ns |
Add 4x4 (flat func) | 95ns | 164ns | 101ns | 107ns | 127ns | 132ns |
Add 4x4 (WASM F64) | 478ns | 610ns | 510ns | 518ns | 586ns | 610ns |
Add 4x4 (WASM SIMD F64) | 466ns | 591ns | 497ns | 512ns | 572ns | 591ns |
Add 4x4 (WASM SIMD F32) | 451ns | 548ns | 481ns | 494ns | 539ns | 548ns |
Add 4x4 (WASM SIMD I32) | 453ns | 546ns | 478ns | 488ns | 537ns | 546ns |
Add 8x8 (Func) | 656ns | 736ns | 679ns | 685ns | 736ns | 736ns |
Add 8x8 (Loop) | 326ns | 372ns | 334ns | 333ns | 366ns | 372ns |
Add 8x8 (Loop Prealloc) | 180ns | 259ns | 194ns | 197ns | 247ns | 254ns |
Add 8x8 (unrolled) | 106ns | 479ns | 125ns | 127ns | 175ns | 185ns |
Add 8x8 (unrolled dynamic) | 105ns | 480ns | 125ns | 127ns | 190ns | 207ns |
Add 8x8 (flat) | 86ns | 165ns | 94ns | 100ns | 128ns | 149ns |
Add 8x8 (flat col major) | 90ns | 150ns | 98ns | 105ns | 127ns | 134ns |
Add 8x8 (flat simple) | 61ns | 124ns | 69ns | 75ns | 97ns | 104ns |
Add 8x8 (flat unrolled) | 65ns | 155ns | 79ns | 84ns | 131ns | 136ns |
Add 8x8 (F64) | 414ns | 725ns | 477ns | 495ns | 705ns | 725ns |
Add 8x8 (F32) | 434ns | 725ns | 479ns | 497ns | 584ns | 725ns |
Add 8x8 (I32) | 379ns | 583ns | 424ns | 441ns | 547ns | 583ns |
Add 8x8 (F64 flat) | 379ns | 2317ns | 672ns | 591ns | 2317ns | 2317ns |
Add 8x8 (F32 flat) | 392ns | 712ns | 436ns | 443ns | 667ns | 712ns |
Add 8x8 (I32 flat) | 377ns | 544ns | 409ns | 419ns | 475ns | 544ns |
Add 8x8 (flat func) | 223ns | 298ns | 240ns | 244ns | 282ns | 283ns |
Add 8x8 (WASM F64) | 554ns | 2077ns | 802ns | 674ns | 2077ns | 2077ns |
Add 8x8 (WASM SIMD F64) | 527ns | 2345ns | 778ns | 646ns | 2345ns | 2345ns |
Add 8x8 (WASM SIMD F32) | 499ns | 875ns | 544ns | 550ns | 694ns | 875ns |
Add 8x8 (WASM SIMD I32) | 500ns | 882ns | 533ns | 537ns | 601ns | 882ns |
Add 16x16 (Func) | 1703ns | 1779ns | 1743ns | 1753ns | 1779ns | 1779ns |
Add 16x16 (Loop) | 1016ns | 1099ns | 1044ns | 1052ns | 1099ns | 1099ns |
Add 16x16 (Loop Prealloc) | 616ns | 689ns | 651ns | 660ns | 689ns | 689ns |
Add 16x16 (unrolled) | 400ns | 258900ns | 600ns | 500ns | 1500ns | 9600ns |
Add 16x16 (unrolled dynamic) | 600ns | 263200ns | 793ns | 700ns | 9400ns | 9900ns |
Add 16x16 (flat) | 366ns | 609ns | 386ns | 391ns | 432ns | 609ns |
Add 16x16 (flat col major) | 380ns | 489ns | 403ns | 410ns | 482ns | 489ns |
Add 16x16 (flat simple) | 260ns | 348ns | 281ns | 285ns | 339ns | 348ns |
Add 16x16 (flat unrolled) | 326ns | 4014ns | 366ns | 347ns | 387ns | 4014ns |
Add 16x16 (F64) | 773ns | 1185ns | 869ns | 906ns | 1185ns | 1185ns |
Add 16x16 (F32) | 830ns | 996ns | 879ns | 905ns | 996ns | 996ns |
Add 16x16 (I32) | 617ns | 989ns | 675ns | 688ns | 989ns | 989ns |
Add 16x16 (F64 flat) | 676ns | 1359ns | 777ns | 800ns | 1359ns | 1359ns |
Add 16x16 (F32 flat) | 673ns | 1072ns | 738ns | 762ns | 1072ns | 1072ns |
Add 16x16 (I32 flat) | 617ns | 932ns | 663ns | 685ns | 932ns | 932ns |
Add 16x16 (flat func) | 773ns | 853ns | 805ns | 820ns | 853ns | 853ns |
Add 16x16 (WASM F64) | 867ns | 1649ns | 971ns | 1002ns | 1649ns | 1649ns |
Add 16x16 (WASM SIMD F64) | 783ns | 951ns | 844ns | 874ns | 951ns | 951ns |
Add 16x16 (WASM SIMD F32) | 654ns | 1054ns | 705ns | 720ns | 1054ns | 1054ns |
Add 16x16 (WASM SIMD I32) | 658ns | 792ns | 701ns | 723ns | 792ns | 792ns |
Add 32x32 (Func) | 4961ns | 5128ns | 5038ns | 5052ns | 5128ns | 5128ns |
Add 32x32 (Loop) | 4431ns | 4549ns | 4490ns | 4511ns | 4549ns | 4549ns |
Add 32x32 (Loop Prealloc) | 2411ns | 2465ns | 2431ns | 2441ns | 2465ns | 2465ns |
Add 32x32 (unrolled) | 72200ns | 478500ns | 79994ns | 81900ns | 125100ns | 155100ns |
Add 32x32 (unrolled dynamic) | 72000ns | 1280700ns | 82065ns | 82600ns | 137600ns | 176500ns |
Add 32x32 (flat) | 1720ns | 2557ns | 1998ns | 2053ns | 2557ns | 2557ns |
Add 32x32 (flat col major) | 1772ns | 2037ns | 1827ns | 1845ns | 2037ns | 2037ns |
Add 32x32 (flat simple) | 1217ns | 1616ns | 1276ns | 1287ns | 1616ns | 1616ns |
Add 32x32 (flat unrolled) | 1200ns | 492700ns | 9236ns | 1900ns | 75800ns | 84200ns |
Add 32x32 (F64) | 2494ns | 2927ns | 2606ns | 2642ns | 2927ns | 2927ns |
Add 32x32 (F32) | 2499ns | 2695ns | 2593ns | 2630ns | 2695ns | 2695ns |
Add 32x32 (I32) | 1665ns | 1891ns | 1734ns | 1763ns | 1891ns | 1891ns |
Add 32x32 (F64 flat) | 2129ns | 2429ns | 2246ns | 2281ns | 2429ns | 2429ns |
Add 32x32 (F32 flat) | 1917ns | 2107ns | 1987ns | 2015ns | 2107ns | 2107ns |
Add 32x32 (I32 flat) | 1650ns | 1853ns | 1723ns | 1735ns | 1853ns | 1853ns |
Add 32x32 (flat func) | 3215ns | 3515ns | 3314ns | 3380ns | 3515ns | 3515ns |
Add 32x32 (WASM F64) | 2737ns | 3032ns | 2864ns | 2903ns | 3032ns | 3032ns |
Add 32x32 (WASM SIMD F64) | 2372ns | 3883ns | 2588ns | 2500ns | 3883ns | 3883ns |
Add 32x32 (WASM SIMD F32) | 1483ns | 1783ns | 1560ns | 1589ns | 1783ns | 1783ns |
Add 32x32 (WASM SIMD I32) | 1499ns | 1645ns | 1555ns | 1584ns | 1645ns | 1645ns |
Add 64x64 (Func) | 13200ns | 346600ns | 17719ns | 16700ns | 34600ns | 84700ns |
Add 64x64 (Loop) | 13300ns | 246700ns | 18280ns | 17300ns | 39600ns | 97300ns |
Add 64x64 (Loop Prealloc) | 8600ns | 337000ns | 9713ns | 9100ns | 21900ns | 27700ns |
Add 64x64 (unrolled) | 318700ns | 640900ns | 339704ns | 339600ns | 498500ns | 513100ns |
Add 64x64 (unrolled dynamic) | 316900ns | 688100ns | 341144ns | 340000ns | 504500ns | 519200ns |
Add 64x64 (flat) | 7324ns | 7669ns | 7461ns | 7476ns | 7669ns | 7669ns |
Add 64x64 (flat col major) | 8030ns | 9241ns | 8222ns | 8239ns | 9241ns | 9241ns |
Add 64x64 (flat simple) | 5529ns | 5990ns | 5648ns | 5706ns | 5990ns | 5990ns |
Add 64x64 (flat unrolled) | 241000ns | 645600ns | 266799ns | 266000ns | 437800ns | 446800ns |
Add 64x64 (F64) | 5500ns | 1690200ns | 9949ns | 9300ns | 22700ns | 27100ns |
Add 64x64 (F32) | 9501ns | 9758ns | 9609ns | 9634ns | 9758ns | 9758ns |
Add 64x64 (I32) | 6333ns | 8882ns | 6713ns | 6585ns | 8882ns | 8882ns |
Add 64x64 (F64 flat) | 5000ns | 3709000ns | 12755ns | 13900ns | 27900ns | 32800ns |
Add 64x64 (F32 flat) | 5500ns | 4133600ns | 8915ns | 7800ns | 27300ns | 33300ns |
Add 64x64 (I32 flat) | 6460ns | 8540ns | 6917ns | 6982ns | 8540ns | 8540ns |
Add 64x64 (flat func) | 9900ns | 254800ns | 13897ns | 13200ns | 27200ns | 69400ns |
Add 64x64 (WASM F64) | 7600ns | 272000ns | 11090ns | 10800ns | 19100ns | 27000ns |
Add 64x64 (WASM SIMD F64) | 6200ns | 2670300ns | 10132ns | 9400ns | 23100ns | 27900ns |
Add 64x64 (WASM SIMD F32) | 5154ns | 6094ns | 5360ns | 5378ns | 6094ns | 6094ns |
Add 64x64 (WASM SIMD I32) | 5188ns | 5973ns | 5396ns | 5433ns | 5973ns | 5973ns |
Add 128x128 (Func) | 45800ns | 271300ns | 60844ns | 61900ns | 185300ns | 195600ns |
Add 128x128 (Loop) | 55200ns | 390500ns | 72186ns | 74300ns | 205700ns | 213500ns |
Add 128x128 (Loop Prealloc) | 35100ns | 248300ns | 40496ns | 38600ns | 114200ns | 160200ns |
Add 128x128 (unrolled) | 1278700ns | 1828300ns | 1343788ns | 1344500ns | 1620500ns | 1723800ns |
Add 128x128 (unrolled dynamic) | 1274000ns | 1792100ns | 1336337ns | 1334900ns | 1586600ns | 1627400ns |
Add 128x128 (flat) | 67700ns | 1414500ns | 94394ns | 83300ns | 393000ns | 485100ns |
Add 128x128 (flat col major) | 119700ns | 4007600ns | 148592ns | 132900ns | 471900ns | 519400ns |
Add 128x128 (flat simple) | 59900ns | 508700ns | 93954ns | 105200ns | 272900ns | 299900ns |
Add 128x128 (flat unrolled) | 1202000ns | 1744700ns | 1291079ns | 1326300ns | 1670100ns | 1682700ns |
Add 128x128 (F64) | 26500ns | 3763100ns | 54327ns | 62100ns | 131800ns | 268700ns |
Add 128x128 (F32) | 27700ns | 3320900ns | 42542ns | 40700ns | 73400ns | 88700ns |
Add 128x128 (I32) | 21200ns | 1250800ns | 29148ns | 26900ns | 54100ns | 68200ns |
Add 128x128 (F64 flat) | 22300ns | 4049800ns | 40977ns | 37300ns | 104200ns | 244500ns |
Add 128x128 (F32 flat) | 24800ns | 1078300ns | 33077ns | 31000ns | 62400ns | 73600ns |
Add 128x128 (I32 flat) | 21000ns | 4687200ns | 29749ns | 26700ns | 60500ns | 69300ns |
Add 128x128 (flat func) | 312400ns | 2369600ns | 409509ns | 356500ns | 1593000ns | 1767300ns |
Add 128x128 (WASM F64) | 30100ns | 374300ns | 42977ns | 41900ns | 83300ns | 137000ns |
Add 128x128 (WASM SIMD F64) | 27600ns | 1592700ns | 43336ns | 42700ns | 83900ns | 115400ns |
Add 128x128 (WASM SIMD F32) | 18200ns | 2968600ns | 25829ns | 23300ns | 50000ns | 60700ns |
Add 128x128 (WASM SIMD I32) | 16300ns | 3989800ns | 25720ns | 24100ns | 54400ns | 66100ns |
Add 256x256 (Func) | 174800ns | 538100ns | 203033ns | 190800ns | 384200ns | 411800ns |
Add 256x256 (Loop) | 282800ns | 707400ns | 315612ns | 300600ns | 526700ns | 596700ns |
Add 256x256 (Loop Prealloc) | 147400ns | 440400ns | 163873ns | 159600ns | 314000ns | 324800ns |
Add 256x256 (unrolled) | 5352600ns | 6199900ns | 5632107ns | 5749700ns | 6199900ns | 6199900ns |
Add 256x256 (unrolled dynamic) | 1283700ns | 2213900ns | 1348989ns | 1344300ns | 1629500ns | 2088000ns |
Add 256x256 (flat) | 243200ns | 4196000ns | 393873ns | 402300ns | 1048000ns | 1139200ns |
Add 256x256 (flat col major) | 629600ns | 1093900ns | 703013ns | 713100ns | 951000ns | 1000000ns |
Add 256x256 (flat simple) | 318600ns | 882200ns | 390075ns | 394200ns | 604500ns | 636800ns |
Add 256x256 (flat unrolled) | 4992100ns | 6206500ns | 5352088ns | 5491100ns | 6097200ns | 6206500ns |
Add 256x256 (F64) | 72100ns | 957800ns | 156667ns | 151300ns | 544800ns | 561800ns |
Add 256x256 (F32) | 88900ns | 585400ns | 136578ns | 150000ns | 303100ns | 406400ns |
Add 256x256 (I32) | 65900ns | 468900ns | 101296ns | 100200ns | 241400ns | 376000ns |
Add 256x256 (F64 flat) | 69800ns | 724800ns | 148573ns | 132400ns | 540600ns | 554900ns |
Add 256x256 (F32 flat) | 80700ns | 676500ns | 116117ns | 114500ns | 270200ns | 388400ns |
Add 256x256 (I32 flat) | 67600ns | 3160600ns | 105838ns | 100600ns | 274300ns | 375900ns |
Add 256x256 (flat func) | 1254400ns | 5091400ns | 1749203ns | 1452600ns | 5024400ns | 5042400ns |
Add 256x256 (WASM F64) | 116900ns | 1158400ns | 201761ns | 213700ns | 570500ns | 655800ns |
Add 256x256 (WASM SIMD F64) | 90500ns | 715600ns | 173352ns | 157100ns | 551100ns | 559100ns |
Add 256x256 (WASM SIMD F32) | 60700ns | 651400ns | 93714ns | 92100ns | 255200ns | 359800ns |
Add 256x256 (WASM SIMD I32) | 60500ns | 977300ns | 91504ns | 90400ns | 264200ns | 356300ns |
Top comments (0)