ndesmic

Posted on May 23, 2023

Fast Matrix Math in JS 2: WASM

#webassembly #javascript #performance

Fast Matrix Math 2

Last time we looked a number of ways to add two matrices starting from naive and applying simplifications until we basically just formed a flat loop over the data which turned out to be the fastest, beating out even TypedArrays. In some ways I'm regretting choosing such a simple case of element-wise addition because it did just simplify into testing the speed of flat loops and that maybe matrix multiplication would have been better. Oh well, maybe next time.

There were some shortcomings because of the sheer amount of permutations I couldn't get to everything so we'll start off really looking at byte representations as a source of speed. In addition I've removed the 100x100 for a 128x128 sized tests and added a final 256x256 to really challenge differences with "big" matrices.

This actually changes some strategies as things like the simple flat loop actually get a lot worse than allocated loops. Typed arrays really pull ahead. My guess is because the compiler doesn't trust allocating large JS arrays.

Simple Typed Array Loops

I neglected to test typed arrays with flat loops since we know we don't need the structured row-column loops we have before. There's not much to these:



export function addMatrixFloat64Flat(a, b) {
    const out = {
        shape: a.shape,
        data: new Float64Array(a.data.length)
    };
    for (let i = 0; i < a.data.length; i++) {
        out.data[i] = a.data[i] + b.data[i];
    }
    return out;
}

size	F64 (avg ns)	F64 flat (avg ns)	F32 (avg ns)	F32 flat (avg ns)	I32 (avg ns)	I32 flat (avg ns)
1x1	353.9186842105266	364.64897959183673	348.8261688311688	348.4743506493506	350.96366013071895	352.73506578947377
2x2	356.2128476821191	339.31120253164556	362.29460000000006	343.3510897435899	348.14629870129863	342.82884615384626
4x4	380.7755319148939	365.68258503401336	370.45848275862056	366.86794520547926	357.1944666666666	352.2581699346405
8x8	477.342155172414	671.9946511627907	479.1566086956522	435.98608000000013	424.4869531249997	408.53601503759376
16x16	868.6750000000002	777.2616	879.089552238806	738.4939743589744	674.5704761904763	662.666627906977
32x32	2606.12	2246.1545454545458	2592.8500000000004	1986.655277777778	1733.6833333333332	1722.907
64x64	9949	12755	9609.210000000001	8915	6712.572222222221	6916.692352941176
128x128	54327	40977	42542	33077	29148	29749
256x256	156667	148573	136578	116117	101296	105838

As expected they are a tad faster except strangely for I32 which is slightly faster with the inner loop bound checks, weird.

Fixing Tests for Float32 Arrays

One issue that came up that I simply ignored last time was that I could write tests for Float32Arrays because the precision differences between them and 64-bit floats. I found that there's actually a function in the Javascript Math library called Math.fround that can do floating point precision rounding. While generating data I can fround the arguments and fround the result to keep it accurate for 32-bit floats.

WASM

Another way we can look to speed things up is WASM. WASM is often discribed as a math co-processor for the web. At least, people think that if you move your code to WASM you'll see a big speed up. This however is much more complicated that we might think. The overhead of calling WASM is high because memory has to be moved between the host and WASM module, this means that we might not see big gains, but hopefully the low internal overhead might help speed things up.

I built a module in pure WAT, the WebAssembly Text Format. While normally we don't do this and instead use a language like Rust or C to compile to WASM there's a lot of complexity to the compilation and weird stuff might make it into the module and change our results and it's hard to optimize. To keep it as close to the metal as possible WAT is the best option. Thankfully what we are doing is not complicated and can be reasonably written in WAT.



;;mat.wat
(module
    (func $matrix_add_f64 (param $lhs_ptr i32) (param $rhs_ptr i32) (param $length i32) (param $out_ptr i32)
        (local $i i32)
        (local $offset i32)
        (local.set $i (i32.const 0))
        (block
            (loop               
                (local.set $offset (i32.mul (local.get $i) (i32.const 8)))

                (i32.add (local.get $out_ptr) (local.get $offset)) ;;out addr on stack
                (f64.add ;;result on stack
                    (f64.load (i32.add (local.get $lhs_ptr) (local.get $offset))) 
                    (f64.load (i32.add (local.get $rhs_ptr) (local.get $offset))))

                (f64.store)

                (local.set $i (i32.add (local.get $i) (i32.const 1)))
                (br_if 1 (i32.eq (local.get $i) (local.get $length)))
                (br 0)
            )
        )
    )
    (export "addMatrix" (func $matrix_add))
    (memory (export "memory") 1)
)

I'm not going to explain this much as it's out of scope, but this is basically the same as the flat simple version. It treats data as a singular array and iterates over it while adding. The parameters are the pointer to the left-hand array, the right-hand array, the total length and where to store the output. We could have technically just used length and if we start at 0 know exactly where everything is but to start I wanted explicit to start.

Compiling WAT

To actually compile this we can use a tool called WABT (WebAssembly Binary Toolkit). It's basically a mess that requires CMake and I couldn't get it to run on WSL and I wasn't going to install MinGW. Instead there's a nice tool called WAPM from Wasmer which works like npm for webassembly packages and since it's been compiled down to webassembly we can run it in any environment. In fact we don't even need to add configuration so long as wapm is installed. We can run wax wat2wasm -- wat/mat.wat -o wasm/mat.wasm. wax is like npx for npm. If you're wondering the command we give wax is defined by the wasmer/wabt package: https://wapm.io/wasmer/wabt. Also for some reason you can't prefix local paths with ./ so wax wat2wasm -- ./wat/mat.wat doesn't work which tool me a while to figure out. Anyway this provides a nice simple compile environment if you want to work on raw WAT files.

This will work up until we try with a 128x128 matrix. It will give us out-of-bounds access. The reason is because the memory we allocated can't fit all the data we need! We can however grow the pages in response to larger inputs.

With page growing:



export function addMatrixWasmF64(a,b){
    const lhsElementOffset = 0;
    const rhsElementOffset = lhsElementOffset + a.data.length;
    const rhsByteOffset = rhsElementOffset * 8;
    const resultElementOffset = lhsElementOffset + a.data.length + b.data.length;
    const resultByteOffset = resultElementOffset * 8;
    const elementLength = a.data.length;

    //grow memory if needed
    const spaceLeftover = matInstance.exports.memory.buffer.byteLength - (a.data.length * 3 * 8)
    if (spaceLeftover < 0){
        const pagesNeeded = Math.ceil((a.data.length * 3 * 8) / (64 * 1024));
        const pagesHave = matInstance.exports.memory.buffer.byteLength / (64 * 1024);
        matInstance.exports.memory.grow(pagesNeeded - pagesHave);
    }

    const f64View = new Float64Array(matInstance.exports.memory.buffer);
    f64View.set(a.data, lhsElementOffset);
    f64View.set(b.data, rhsElementOffset);

    matInstance.exports.addMatrix(0, rhsByteOffset, elementLength, resultByteOffset);

    return {
        shape: a.shape,
        data: f64View.slice(resultElementOffset, resultElementOffset + elementLength)
    };
}

This might be how we'd do a production version so things don't explode but for performance reasons I'm going to leave that out and make it static, we'll only do up-to 256x256 sized matrixes so we need a total of 24 pages allocated.



(memory (export "memory") 24)

The benchmark results are tad disappointing. It expectedly starts out at the bottom of the pack, loading data into the wasm module is expensive overhead. It slowly climbs the latter but even at 128x128 is still slower than a simple loop. The lack of allocation make a big difference. This means unless more of the data can be stored on the WASM side, it's unlikely WASM will provide much of a benefit. Still, WASM has one more trick.

SIMD

Unlike Javascript WASM has access to SIMD instructions which can be used to parallelize the iteration and these are pretty well supported. SIMD stands for Single Instruction Multiple Data. The idea is that it let's us do add/load etc more than one value at a time provided everything lines up correctly. WASM at least as of the time of writing only has a 128-bit type. The way it works is that we load a 128-bit value and at the operation level we can choose how to interpret it, which in our case will correspond to 2 float-64s. When we increment $i we now do so by 2.

It's actually not too hard to modify the existing code. Instead of stepping be 8-bytes, we step by 16-bytes loading 128-bit values instead of 64. Then we use the f64x2.add to add the two floats at the same time and store it back to memory as a 128-bit value.



(func $matrix_add_simd_f64 (param $lhs_ptr i32) (param $rhs_ptr i32) (param $length i32) (param $out_ptr i32)
    (local $i i32)
    (local $offset i32)
    (local.set $i (i32.const 0))
    (block
        (loop               
            (local.set $offset (i32.mul (local.get $i) (i32.const 8))) ;;moving in 128-bit blocks
            (i32.add (local.get $out_ptr) (local.get $offset)) ;;out addr on stack
            (f64x2.add 
                (v128.load (i32.add (local.get $lhs_ptr) (local.get $offset))) 
                (v128.load (i32.add (local.get $rhs_ptr) (local.get $offset)))) ;;result on stack
            (v128.store)
            (local.set $i (i32.add (local.get $i) (i32.const 2)))
            (br_if 1 (i32.ge_u (local.get $i) (local.get $length)))
            (br 0)
        )
    )
)

Another thing to think about is what happens if we don't have exactly 2 floats to add? We need one modification which is to check the branch condition to be greater-than-or-equal (i32.ge) rather than equal (i32.eq) to account for possible overrun by one. But let's also think about the memory bounds. If we have just one f64 then the second f64 will load garbage from outside our bounds but we never write (v128.store) in such a way it'll bump into the next segment (assuming the pointers are not overlapping), only the final array at $out_ptr is ever written to. Since we only read back the exact number of elements we want the odd value is just ignored. We also can't read outside the bounds of allocated memory because they are always given by even pages.

I have to do a little debugging to get this right. A simple way is to import little functions for printing values:



const { instance: matInstance } = await WebAssembly.instantiate(wasm, {
    lib: {
        print_int: function(num){
            console.log(num);
        },
        print_float: function(num){
            console.log(num);
        },
        print_brk: function(){
            console.log("---")
        }
    }
});



(import "lib" "print_int" (func $print_int (param i32)))
(import "lib" "print_float" (func $print_float (param f32)))
(import "lib" "print_brk" (func $print_brk))

;;---stuff

(call $print_int (local.get $offset)) ;;print offset
(call $print_int (local.get $i)) ;;print $i
(call $print_brk) ;;at a visual barrier between element logs

We claw our way up the latter a bit further. We're still not beating a preallocated loop at 256x256. So still not very usable.

SIMD F32 and I32

We can still mess with the precision. 32-bit values would give us an advantage of being able to add 4 at a time. We also know that integers generally work a bit faster.



(func $matrix_add_simd_f32 (param $lhs_ptr i32) (param $rhs_ptr i32) (param $length i32) (param $out_ptr i32)
    (local $i i32)
    (local $offset i32)
    (local.set $i (i32.const 0))
    (block
        (loop               
            (local.set $offset (i32.mul (local.get $i) (i32.const 8))) ;;moving in 128-bit blocks
            (i32.add (local.get $out_ptr) (local.get $offset)) ;;out addr on stack
            (f32x4.add 
                (v128.load (i32.add (local.get $lhs_ptr) (local.get $offset))) 
                (v128.load (i32.add (local.get $rhs_ptr) (local.get $offset)))) ;;result on stack
            (v128.store)
            (local.set $i (i32.add (local.get $i) (i32.const 2)))
            (br_if 1 (i32.ge_u (local.get $i) (local.get $length)))
            (br 0)
        )
    )
)
(func $matrix_add_simd_i32 (param $lhs_ptr i32) (param $rhs_ptr i32) (param $length i32) (param $out_ptr i32)
    (local $i i32)
    (local $offset i32)
    (local.set $i (i32.const 0))
    (block
        (loop               
            (local.set $offset (i32.mul (local.get $i) (i32.const 8))) ;;moving in 128-bit blocks
            (i32.add (local.get $out_ptr) (local.get $offset)) ;;out addr on stack
            (i32x4.add 
                (v128.load (i32.add (local.get $lhs_ptr) (local.get $offset))) 
                (v128.load (i32.add (local.get $rhs_ptr) (local.get $offset)))) ;;result on stack
            (v128.store)
            (local.set $i (i32.add (local.get $i) (i32.const 2)))
            (br_if 1 (i32.ge_u (local.get $i) (local.get $length)))
            (br 0)
        )
    )
)



export function addMatrixWasmSimdF32(a, b) {
    const lhsElementOffset = 0;
    const rhsElementOffset = lhsElementOffset + a.data.length;
    const rhsByteOffset = rhsElementOffset * 4;
    const resultElementOffset = lhsElementOffset + a.data.length + b.data.length;
    const resultByteOffset = resultElementOffset * 4;
    const elementLength = a.data.length;

    //grow memory if needed
    // const spaceLeftover = matInstance.exports.memory.buffer.byteLength - (a.data.length * 3 * 8)
    // if (spaceLeftover < 0){
    //  const pagesNeeded = Math.ceil((a.data.length * 3 * 8) / (64 * 1024));
    //  const pagesHave = matInstance.exports.memory.buffer.byteLength / (64 * 1024);
    //  matInstance.exports.memory.grow(pagesNeeded - pagesHave);
    // }

    const f32View = new Float32Array(matInstance.exports.memory.buffer);
    f32View.set(a.data, lhsElementOffset);
    f32View.set(b.data, rhsElementOffset);

    matInstance.exports.addMatrixSimdF32(0, rhsByteOffset, elementLength, resultByteOffset);

    return {
        shape: a.shape,
        data: f32View.slice(resultElementOffset, resultElementOffset + elementLength)
    };
}

export function addMatrixWasmSimdI32(a, b) {
    const lhsElementOffset = 0;
    const rhsElementOffset = lhsElementOffset + a.data.length;
    const rhsByteOffset = rhsElementOffset * 4;
    const resultElementOffset = lhsElementOffset + a.data.length + b.data.length;
    const resultByteOffset = resultElementOffset * 4;
    const elementLength = a.data.length;

    //grow memory if needed
    // const spaceLeftover = matInstance.exports.memory.buffer.byteLength - (a.data.length * 3 * 8)
    // if (spaceLeftover < 0){
    //  const pagesNeeded = Math.ceil((a.data.length * 3 * 8) / (64 * 1024));
    //  const pagesHave = matInstance.exports.memory.buffer.byteLength / (64 * 1024);
    //  matInstance.exports.memory.grow(pagesNeeded - pagesHave);
    // }

    const i32View = new Int32Array(matInstance.exports.memory.buffer);
    i32View.set(a.data, lhsElementOffset);
    i32View.set(b.data, rhsElementOffset);

    matInstance.exports.addMatrixSimdI32(0, rhsByteOffset, elementLength, resultByteOffset);

    return {
        shape: a.shape,
        data: i32View.slice(resultElementOffset, resultElementOffset + elementLength)
    };
}

These are just basic substitutions of the original code but with 4-byte sized elements. The WASM still jumps 128-bit values it's just treating them as 4 values instead of 2. For arrays larger than 32x32 I32 and F32 are nearly equal and out new speed kings! Still, it's not really a huge difference versus an I32Array in plain javascript (~+10%) so it's a far bit of complexity.

Conclusion

What we've observed this time is that WASM by itself does not yield great performance gains until it actually overcomes the overhead of memory copy, which in our case is around the 64x64 element mark. In these cases using typed arrays can still beat it by not introducing that complexity, the compiler seems smart enough to optimize these. However, once we start introducing SIMD to the picture WASM can overtake plain JS starting around 64x64 and pulling away as sizes get bigger.

Code

https://github.com/ndesmic/fast-mat/tree/v1.1

Data

Name	min	max	avg	p75	p99	p995
Add 1x1 (Func)	61ns	198ns	70ns	66ns	168ns	171ns
Add 1x1 (Loop)	40ns	141ns	57ns	62ns	113ns	119ns
Add 1x1 (Loop Prealloc)	13ns	56ns	15ns	15ns	35ns	40ns
Add 1x1 (unrolled)	8ns	47ns	8ns	8ns	21ns	22ns
Add 1x1 (unrolled dynamic)	6ns	37ns	7ns	7ns	19ns	21ns
Add 1x1 (flat)	8ns	39ns	9ns	9ns	21ns	22ns
Add 1x1 (flat col major)	8ns	36ns	9ns	9ns	21ns	22ns
Add 1x1 (flat simple)	6ns	39ns	7ns	7ns	18ns	19ns
Add 1x1 (flat unrolled)	5ns	28ns	6ns	6ns	17ns	18ns
Add 1x1 (F64)	320ns	441ns	354ns	364ns	406ns	441ns
Add 1x1 (F32)	304ns	398ns	349ns	360ns	398ns	398ns
Add 1x1 (I32)	302ns	532ns	351ns	360ns	469ns	532ns
Add 1x1 (F64 flat)	309ns	696ns	365ns	369ns	555ns	696ns
Add 1x1 (F32 flat)	299ns	544ns	348ns	359ns	411ns	544ns
Add 1x1 (I32 flat)	305ns	601ns	353ns	364ns	585ns	601ns
Add 1x1 (flat func)	56ns	87ns	59ns	58ns	79ns	85ns
Add 1x1 (WASM F64)	463ns	818ns	500ns	507ns	546ns	818ns
Add 1x1 (WASM SIMD F64)	461ns	780ns	513ns	516ns	751ns	780ns
Add 1x1 (WASM SIMD F32)	444ns	555ns	491ns	503ns	551ns	555ns
Add 1x1 (WASM SIMD I32)	441ns	734ns	499ns	513ns	644ns	734ns
Add 2x2 (Func)	128ns	184ns	136ns	139ns	168ns	173ns
Add 2x2 (Loop)	59ns	168ns	72ns	76ns	115ns	128ns
Add 2x2 (Loop Prealloc)	23ns	70ns	26ns	25ns	44ns	48ns
Add 2x2 (unrolled)	10ns	34ns	11ns	11ns	23ns	24ns
Add 2x2 (unrolled dynamic)	10ns	37ns	11ns	11ns	23ns	25ns
Add 2x2 (flat)	13ns	43ns	14ns	14ns	26ns	27ns
Add 2x2 (flat col major)	12ns	39ns	13ns	13ns	26ns	27ns
Add 2x2 (flat simple)	8ns	49ns	9ns	9ns	21ns	23ns
Add 2x2 (flat unrolled)	7ns	41ns	8ns	7ns	19ns	21ns
Add 2x2 (F64)	300ns	457ns	356ns	374ns	425ns	457ns
Add 2x2 (F32)	300ns	570ns	362ns	377ns	568ns	570ns
Add 2x2 (I32)	296ns	412ns	348ns	367ns	404ns	412ns
Add 2x2 (F64 flat)	292ns	487ns	339ns	360ns	393ns	487ns
Add 2x2 (F32 flat)	295ns	420ns	343ns	366ns	414ns	420ns
Add 2x2 (I32 flat)	293ns	445ns	343ns	362ns	401ns	445ns
Add 2x2 (flat func)	64ns	113ns	67ns	66ns	91ns	95ns
Add 2x2 (WASM F64)	434ns	685ns	494ns	510ns	678ns	685ns
Add 2x2 (WASM SIMD F64)	427ns	551ns	480ns	500ns	538ns	551ns
Add 2x2 (WASM SIMD F32)	434ns	812ns	482ns	492ns	768ns	812ns
Add 2x2 (WASM SIMD I32)	433ns	672ns	475ns	487ns	622ns	672ns
Add 4x4 (Func)	280ns	341ns	293ns	299ns	329ns	341ns
Add 4x4 (Loop)	113ns	199ns	126ns	130ns	168ns	173ns
Add 4x4 (Loop Prealloc)	58ns	110ns	63ns	70ns	91ns	94ns
Add 4x4 (unrolled)	42ns	115ns	50ns	54ns	78ns	97ns
Add 4x4 (unrolled dynamic)	42ns	122ns	50ns	55ns	78ns	89ns
Add 4x4 (flat)	28ns	51ns	30ns	29ns	43ns	44ns
Add 4x4 (flat col major)	28ns	66ns	31ns	30ns	45ns	50ns
Add 4x4 (flat simple)	17ns	67ns	20ns	19ns	36ns	40ns
Add 4x4 (flat unrolled)	26ns	80ns	32ns	31ns	53ns	62ns
Add 4x4 (F64)	338ns	459ns	381ns	400ns	455ns	459ns
Add 4x4 (F32)	330ns	451ns	370ns	393ns	432ns	451ns
Add 4x4 (I32)	312ns	548ns	357ns	378ns	485ns	548ns
Add 4x4 (F64 flat)	329ns	560ns	366ns	379ns	441ns	560ns
Add 4x4 (F32 flat)	317ns	605ns	367ns	388ns	568ns	605ns
Add 4x4 (I32 flat)	314ns	550ns	352ns	372ns	421ns	550ns
Add 4x4 (flat func)	95ns	164ns	101ns	107ns	127ns	132ns
Add 4x4 (WASM F64)	478ns	610ns	510ns	518ns	586ns	610ns
Add 4x4 (WASM SIMD F64)	466ns	591ns	497ns	512ns	572ns	591ns
Add 4x4 (WASM SIMD F32)	451ns	548ns	481ns	494ns	539ns	548ns
Add 4x4 (WASM SIMD I32)	453ns	546ns	478ns	488ns	537ns	546ns
Add 8x8 (Func)	656ns	736ns	679ns	685ns	736ns	736ns
Add 8x8 (Loop)	326ns	372ns	334ns	333ns	366ns	372ns
Add 8x8 (Loop Prealloc)	180ns	259ns	194ns	197ns	247ns	254ns
Add 8x8 (unrolled)	106ns	479ns	125ns	127ns	175ns	185ns
Add 8x8 (unrolled dynamic)	105ns	480ns	125ns	127ns	190ns	207ns
Add 8x8 (flat)	86ns	165ns	94ns	100ns	128ns	149ns
Add 8x8 (flat col major)	90ns	150ns	98ns	105ns	127ns	134ns
Add 8x8 (flat simple)	61ns	124ns	69ns	75ns	97ns	104ns
Add 8x8 (flat unrolled)	65ns	155ns	79ns	84ns	131ns	136ns
Add 8x8 (F64)	414ns	725ns	477ns	495ns	705ns	725ns
Add 8x8 (F32)	434ns	725ns	479ns	497ns	584ns	725ns
Add 8x8 (I32)	379ns	583ns	424ns	441ns	547ns	583ns
Add 8x8 (F64 flat)	379ns	2317ns	672ns	591ns	2317ns	2317ns
Add 8x8 (F32 flat)	392ns	712ns	436ns	443ns	667ns	712ns
Add 8x8 (I32 flat)	377ns	544ns	409ns	419ns	475ns	544ns
Add 8x8 (flat func)	223ns	298ns	240ns	244ns	282ns	283ns
Add 8x8 (WASM F64)	554ns	2077ns	802ns	674ns	2077ns	2077ns
Add 8x8 (WASM SIMD F64)	527ns	2345ns	778ns	646ns	2345ns	2345ns
Add 8x8 (WASM SIMD F32)	499ns	875ns	544ns	550ns	694ns	875ns
Add 8x8 (WASM SIMD I32)	500ns	882ns	533ns	537ns	601ns	882ns
Add 16x16 (Func)	1703ns	1779ns	1743ns	1753ns	1779ns	1779ns
Add 16x16 (Loop)	1016ns	1099ns	1044ns	1052ns	1099ns	1099ns
Add 16x16 (Loop Prealloc)	616ns	689ns	651ns	660ns	689ns	689ns
Add 16x16 (unrolled)	400ns	258900ns	600ns	500ns	1500ns	9600ns
Add 16x16 (unrolled dynamic)	600ns	263200ns	793ns	700ns	9400ns	9900ns
Add 16x16 (flat)	366ns	609ns	386ns	391ns	432ns	609ns
Add 16x16 (flat col major)	380ns	489ns	403ns	410ns	482ns	489ns
Add 16x16 (flat simple)	260ns	348ns	281ns	285ns	339ns	348ns
Add 16x16 (flat unrolled)	326ns	4014ns	366ns	347ns	387ns	4014ns
Add 16x16 (F64)	773ns	1185ns	869ns	906ns	1185ns	1185ns
Add 16x16 (F32)	830ns	996ns	879ns	905ns	996ns	996ns
Add 16x16 (I32)	617ns	989ns	675ns	688ns	989ns	989ns
Add 16x16 (F64 flat)	676ns	1359ns	777ns	800ns	1359ns	1359ns
Add 16x16 (F32 flat)	673ns	1072ns	738ns	762ns	1072ns	1072ns
Add 16x16 (I32 flat)	617ns	932ns	663ns	685ns	932ns	932ns
Add 16x16 (flat func)	773ns	853ns	805ns	820ns	853ns	853ns
Add 16x16 (WASM F64)	867ns	1649ns	971ns	1002ns	1649ns	1649ns
Add 16x16 (WASM SIMD F64)	783ns	951ns	844ns	874ns	951ns	951ns
Add 16x16 (WASM SIMD F32)	654ns	1054ns	705ns	720ns	1054ns	1054ns
Add 16x16 (WASM SIMD I32)	658ns	792ns	701ns	723ns	792ns	792ns
Add 32x32 (Func)	4961ns	5128ns	5038ns	5052ns	5128ns	5128ns
Add 32x32 (Loop)	4431ns	4549ns	4490ns	4511ns	4549ns	4549ns
Add 32x32 (Loop Prealloc)	2411ns	2465ns	2431ns	2441ns	2465ns	2465ns
Add 32x32 (unrolled)	72200ns	478500ns	79994ns	81900ns	125100ns	155100ns
Add 32x32 (unrolled dynamic)	72000ns	1280700ns	82065ns	82600ns	137600ns	176500ns
Add 32x32 (flat)	1720ns	2557ns	1998ns	2053ns	2557ns	2557ns
Add 32x32 (flat col major)	1772ns	2037ns	1827ns	1845ns	2037ns	2037ns
Add 32x32 (flat simple)	1217ns	1616ns	1276ns	1287ns	1616ns	1616ns
Add 32x32 (flat unrolled)	1200ns	492700ns	9236ns	1900ns	75800ns	84200ns
Add 32x32 (F64)	2494ns	2927ns	2606ns	2642ns	2927ns	2927ns
Add 32x32 (F32)	2499ns	2695ns	2593ns	2630ns	2695ns	2695ns
Add 32x32 (I32)	1665ns	1891ns	1734ns	1763ns	1891ns	1891ns
Add 32x32 (F64 flat)	2129ns	2429ns	2246ns	2281ns	2429ns	2429ns
Add 32x32 (F32 flat)	1917ns	2107ns	1987ns	2015ns	2107ns	2107ns
Add 32x32 (I32 flat)	1650ns	1853ns	1723ns	1735ns	1853ns	1853ns
Add 32x32 (flat func)	3215ns	3515ns	3314ns	3380ns	3515ns	3515ns
Add 32x32 (WASM F64)	2737ns	3032ns	2864ns	2903ns	3032ns	3032ns
Add 32x32 (WASM SIMD F64)	2372ns	3883ns	2588ns	2500ns	3883ns	3883ns
Add 32x32 (WASM SIMD F32)	1483ns	1783ns	1560ns	1589ns	1783ns	1783ns
Add 32x32 (WASM SIMD I32)	1499ns	1645ns	1555ns	1584ns	1645ns	1645ns
Add 64x64 (Func)	13200ns	346600ns	17719ns	16700ns	34600ns	84700ns
Add 64x64 (Loop)	13300ns	246700ns	18280ns	17300ns	39600ns	97300ns
Add 64x64 (Loop Prealloc)	8600ns	337000ns	9713ns	9100ns	21900ns	27700ns
Add 64x64 (unrolled)	318700ns	640900ns	339704ns	339600ns	498500ns	513100ns
Add 64x64 (unrolled dynamic)	316900ns	688100ns	341144ns	340000ns	504500ns	519200ns
Add 64x64 (flat)	7324ns	7669ns	7461ns	7476ns	7669ns	7669ns
Add 64x64 (flat col major)	8030ns	9241ns	8222ns	8239ns	9241ns	9241ns
Add 64x64 (flat simple)	5529ns	5990ns	5648ns	5706ns	5990ns	5990ns
Add 64x64 (flat unrolled)	241000ns	645600ns	266799ns	266000ns	437800ns	446800ns
Add 64x64 (F64)	5500ns	1690200ns	9949ns	9300ns	22700ns	27100ns
Add 64x64 (F32)	9501ns	9758ns	9609ns	9634ns	9758ns	9758ns
Add 64x64 (I32)	6333ns	8882ns	6713ns	6585ns	8882ns	8882ns
Add 64x64 (F64 flat)	5000ns	3709000ns	12755ns	13900ns	27900ns	32800ns
Add 64x64 (F32 flat)	5500ns	4133600ns	8915ns	7800ns	27300ns	33300ns
Add 64x64 (I32 flat)	6460ns	8540ns	6917ns	6982ns	8540ns	8540ns
Add 64x64 (flat func)	9900ns	254800ns	13897ns	13200ns	27200ns	69400ns
Add 64x64 (WASM F64)	7600ns	272000ns	11090ns	10800ns	19100ns	27000ns
Add 64x64 (WASM SIMD F64)	6200ns	2670300ns	10132ns	9400ns	23100ns	27900ns
Add 64x64 (WASM SIMD F32)	5154ns	6094ns	5360ns	5378ns	6094ns	6094ns
Add 64x64 (WASM SIMD I32)	5188ns	5973ns	5396ns	5433ns	5973ns	5973ns
Add 128x128 (Func)	45800ns	271300ns	60844ns	61900ns	185300ns	195600ns
Add 128x128 (Loop)	55200ns	390500ns	72186ns	74300ns	205700ns	213500ns
Add 128x128 (Loop Prealloc)	35100ns	248300ns	40496ns	38600ns	114200ns	160200ns
Add 128x128 (unrolled)	1278700ns	1828300ns	1343788ns	1344500ns	1620500ns	1723800ns
Add 128x128 (unrolled dynamic)	1274000ns	1792100ns	1336337ns	1334900ns	1586600ns	1627400ns
Add 128x128 (flat)	67700ns	1414500ns	94394ns	83300ns	393000ns	485100ns
Add 128x128 (flat col major)	119700ns	4007600ns	148592ns	132900ns	471900ns	519400ns
Add 128x128 (flat simple)	59900ns	508700ns	93954ns	105200ns	272900ns	299900ns
Add 128x128 (flat unrolled)	1202000ns	1744700ns	1291079ns	1326300ns	1670100ns	1682700ns
Add 128x128 (F64)	26500ns	3763100ns	54327ns	62100ns	131800ns	268700ns
Add 128x128 (F32)	27700ns	3320900ns	42542ns	40700ns	73400ns	88700ns
Add 128x128 (I32)	21200ns	1250800ns	29148ns	26900ns	54100ns	68200ns
Add 128x128 (F64 flat)	22300ns	4049800ns	40977ns	37300ns	104200ns	244500ns
Add 128x128 (F32 flat)	24800ns	1078300ns	33077ns	31000ns	62400ns	73600ns
Add 128x128 (I32 flat)	21000ns	4687200ns	29749ns	26700ns	60500ns	69300ns
Add 128x128 (flat func)	312400ns	2369600ns	409509ns	356500ns	1593000ns	1767300ns
Add 128x128 (WASM F64)	30100ns	374300ns	42977ns	41900ns	83300ns	137000ns
Add 128x128 (WASM SIMD F64)	27600ns	1592700ns	43336ns	42700ns	83900ns	115400ns
Add 128x128 (WASM SIMD F32)	18200ns	2968600ns	25829ns	23300ns	50000ns	60700ns
Add 128x128 (WASM SIMD I32)	16300ns	3989800ns	25720ns	24100ns	54400ns	66100ns
Add 256x256 (Func)	174800ns	538100ns	203033ns	190800ns	384200ns	411800ns
Add 256x256 (Loop)	282800ns	707400ns	315612ns	300600ns	526700ns	596700ns
Add 256x256 (Loop Prealloc)	147400ns	440400ns	163873ns	159600ns	314000ns	324800ns
Add 256x256 (unrolled)	5352600ns	6199900ns	5632107ns	5749700ns	6199900ns	6199900ns
Add 256x256 (unrolled dynamic)	1283700ns	2213900ns	1348989ns	1344300ns	1629500ns	2088000ns
Add 256x256 (flat)	243200ns	4196000ns	393873ns	402300ns	1048000ns	1139200ns
Add 256x256 (flat col major)	629600ns	1093900ns	703013ns	713100ns	951000ns	1000000ns
Add 256x256 (flat simple)	318600ns	882200ns	390075ns	394200ns	604500ns	636800ns
Add 256x256 (flat unrolled)	4992100ns	6206500ns	5352088ns	5491100ns	6097200ns	6206500ns
Add 256x256 (F64)	72100ns	957800ns	156667ns	151300ns	544800ns	561800ns
Add 256x256 (F32)	88900ns	585400ns	136578ns	150000ns	303100ns	406400ns
Add 256x256 (I32)	65900ns	468900ns	101296ns	100200ns	241400ns	376000ns
Add 256x256 (F64 flat)	69800ns	724800ns	148573ns	132400ns	540600ns	554900ns
Add 256x256 (F32 flat)	80700ns	676500ns	116117ns	114500ns	270200ns	388400ns
Add 256x256 (I32 flat)	67600ns	3160600ns	105838ns	100600ns	274300ns	375900ns
Add 256x256 (flat func)	1254400ns	5091400ns	1749203ns	1452600ns	5024400ns	5042400ns
Add 256x256 (WASM F64)	116900ns	1158400ns	201761ns	213700ns	570500ns	655800ns
Add 256x256 (WASM SIMD F64)	90500ns	715600ns	173352ns	157100ns	551100ns	559100ns
Add 256x256 (WASM SIMD F32)	60700ns	651400ns	93714ns	92100ns	255200ns	359800ns
Add 256x256 (WASM SIMD I32)	60500ns	977300ns	91504ns	90400ns	264200ns	356300ns

DEV Community

Fast Matrix Math in JS 2: WASM

Fast Matrix Math 2

Simple Typed Array Loops

Fixing Tests for Float32 Arrays

WASM

Compiling WAT

SIMD

SIMD F32 and I32

Conclusion

Code

Data

Top comments (0)

Read next

React Router: Concepts and Practical Guide(Part 1)

React Native New Architecture

Tim Berners-Lee : The Man Behind the Web

J'ai essayé de passer à Vue.js depuis React.js