DEV Community

ndesmic
ndesmic

Posted on

Fast Matrix Math in JS 3: WebGL/WebGPU

Last time we looked at WASM and the interesting SIMD capability. One question we could ask was instead of adding 2-4 values at a time what if there was some way we could add more? As it turns out most computers including phones and other device have dedicated hardware for this called a GPU. GPUs are typically associated with graphics, they were literally designed to perform hundreds to thousands of math operations in parallel in order to calculate pixel colors for rasterization. Think about the screen you are using right now. It's probably at least 1080x1920 pixels (1080p) and that's many times larger than our matrices and it does does so over 3 color channels and it's expected to do that at least 60 times a second. Definitely seems like something we can take advantage of.

Housekeeping

I tried to consolidate all the generated output into the temp folder (not sure about the name but whatever it's in one place). This way generated code isn't cluttering the project and is easy to delete. Still, there no way to generate everything in correct order so if you're running the project you might need a little trial and error with the scripts.

Benchmarking on the web

Before we were able to use Deno which comes with a nice benchmarking tool. Unfortunately, support for GPUs on server-side frameworks is extremely lacking, depressingly so (hopefully some of my articles inspire you to want such things) so we're going to need to use browsers and in tradition for my posts I guess that means we need to roll our own benchmarking.

This isn't too hard though we just need to run a function a certain number of times and get some stats from the runs. One thing to keep in mind is because javascript compilation is tiered we need to run a bunch of times first to "warm up". This will make sure that we're getting the real fast performance (it also depending on what you are measuring, cold starts are also a thing to consider, just not for our exercise). After the warm-up iterations then we can go into real iterations and then take measurements like averages. Deno gave us a bit more than that like the P90, P999 etc. We can do those, but to keep the topic somewhat narrow we're looking at averages. Deno also seemed to run some number of times until it converged or something, we won't do anything fancy like that a manually add the iterations.



export function sum(values) {
    return values.reduce((sum, v) => sum + v);
}
export function min(values) {
    return values.reduce((min, v) => Math.min(min, v));
}
export function max(values) {
    return values.reduce((max, v) => Math.max(max, v));
}
export function average(values) {
    return sum(values) / values.length;
}

export async function bench(name, options, benchFn){
    options.warmupIterations = options.warmupIterations ?? 500;
    options.iterations = options.iterations ?? 500;

    for(let i = 0; i < options.warmupIterations; i++){
        await benchFn();
    }

    const runs = [];

    for(let i = 0; i < options.iterations; i++){
        const start = performance.now();
        await benchFn();
        runs.push(performance.now() - start);
    }

    return { 
        name,
        group: options.group,
        results: [
            {
                ok : {
                    origin: window.location.href,
                    n: options.iterations,
                    min: min(runs),
                    max: max(runs),
                    avg: average(runs),
                    p75: NaN,
                    p99: NaN,
                    p995: NaN
                }
            }
        ]
    };
}


Enter fullscreen mode Exit fullscreen mode

This is a simple benchmarking tool using performance.now

Test generation

Since we have to rewrite the tests (at least to some degree). I thought it would be better to generate them. This is because all the cases adds up to a lot of code and it's all very formulaic. The few times I've had to modify tests I spend a long time copy-pasting and it's time we grow out of that (it wasn't even that hard, I should have done this sooner...).



let testFile = "import {\n";

const sizes = [1,2,4,8,16,32,64,128,256];

//test data (nums)

for(const size of sizes){
    testFile += `\tmat${size}ANum,\n\tmat${size}BNum,\n`;

}

testFile += `} from "../data/mat-data-num.js"\n\n`


//strats

testFile += "import {\n";

const strategies = ["Loop"]; //etc

for(const strat of strategies){
    testFile += `\taddMatrix${strat},\n`
}

testFile += `} from "../mat.js"\n\n`

//libs

testFile += `import { bench } from "../web/bench.js"\n\n`;


//tests

for(const strat of strategies){
    for(const size of sizes){
    testFile += `await bench("Add ${size}x${size} (${strat})", { group: "${size}x${size}}" }, () => {
    addMatrix${strat}(mat${size}ANum, mat${size}BNum);
});\n\n`
    }
}

testFile += `console.log("Complete!");\n`;

Deno.writeTextFileSync("./temp/tests.js", testFile);


Enter fullscreen mode Exit fullscreen mode

This is a simple script which generates test data for the web runner. We need to manually fill in the strategies with the ones that interest us. This file actually started to get really complicated over time due to the little variations of things.

The test runner

Finally it might be useful to make running the test runner quick and easy from the command line without manually starting up a server, navigating and then trying to copy paste stuff out of the console.



import { typeByExtension } from "https://deno.land/std/media_types/mod.ts";
import { extname } from "https://deno.land/std/path/mod.ts";

function getChromeExecutablePath(){
    //windows
    return `C:/Program Files (x86)/Google/Chrome/Application/chrome.exe`;
}

function launchChrome(url){
    const command = new Deno.Command(getChromeExecutablePath(), {
        args: [
            url
        ]
    });
    command.outputSync();
}

const baseDir = ".";

Deno.serve(async req => {
    const url = new URL(req.url);
    let inputPath = url.pathname;

    if (inputPath.endsWith("/")) {
        inputPath += "index";
    } 


    if (!inputPath.includes(".")) {
        inputPath += ".html";
    }

    if (inputPath.includes("/web/index.html") && req.method === "POST"){
        const body = await req.json();
        await Deno.writeTextFile("./temp/web-bench.json", JSON.stringify(body, null, 4));
    }
    if(inputPath.includes("favicon.ico")) {
        inputPath = "/web/favicon.ico";
    }

    const serverPath = baseDir + inputPath
    const ext = extname(serverPath);
    const file = await Deno.open(serverPath);

    return new Response(file.readable, {
        headers: {
            "Content-Type": typeByExtension(ext)
        }
    });
});

launchChrome("http://localhost:8000/web/");


Enter fullscreen mode Exit fullscreen mode

This launches a very tiny server. Basically an even simpler (and hacky) version of my dev server code. I also created a very simple Chrome launcher. What this does is start a server, launch Chrome to the URL and let it run. With the generated test we can extract the data from the console. This version actually includes a POST route so that the page script can post the bench data back and the server can write it to a file.

It gets more complicated...

There's actually a lot of features that I added for code generation and to automate things. Everything was made to run off a central JSON file with the list of strategies. Lots of little refactorings here and there, renames, file shuffling, more than I want to describe and certainly a lot of boring work. I think you could probably figure it out by reading the source code though.

Units and Resolution?

The Deno benchmark utility used nanoseconds as a unit. performance.now uses milliseconds but as floating point values. However we need to be careful, performance.now has a security considerations (https://developer.mozilla.org/en-US/docs/Web/API/Performance/now#security_requirements) where it will only give us granularity down to 5 microseconds and only in an "isolated context" (browser implementations vary). To be sure we have the isolated context, we add some security headers to our response:



return new Response(file.readable, {
    headers: {
        "Content-Type": typeByExtension(ext),
        "Cross-Origin-Opener-Policy": "same-origin",
        "Cross-Origin-Embedder-Policy": "require-corp"
    }
});


Enter fullscreen mode Exit fullscreen mode

Even still this means tests that take less than 5000 nanoseconds (which include many of the smaller ones) will not be representable. One way to get around this is to simply take more samples over a longer period and average them. So more samples and we're only getting averages, no more min and max either. There are much more clever ways to sample to get around this, but it's probably an entire post to itself.

New Results

The results are roughly the same in relative terms but because we can't get the same level of resolution things vary a bit more. But at least we know the data we've acquired so far seems to hold for Chrome as well. Links to CSVs because they're way too big now, just trust that strategies are similar-ish though some measurements are an order-of-magnitude different.

Web: https://github.com/ndesmic/fast-mat/blob/v3/snapshot/bench-by-strat.web.csv
Deno: https://github.com/ndesmic/fast-mat/blob/v3/snapshot/bench-by-start.deno.csv

WebGL

Okay, now on the more fun stuff. One way we can try to improve performance is using the GPU. The classic way to do this on the web is WebGL. Unfortunately, WebGL was design in an era when GPU really were Graphics Processing Units and so it's designed to push pixels. Still, what is it really doing? It's taking pixel values (numbers) and applying small arithmetic programs to them to get a finished canvas. So what we could do it pack our numbers into textures, do the arithmetic and then read the results from the framebuffer.

The bad part

WebGL only supports a limited number of formats for colors. In fact since we want floating points we need an extension EXT_color_buffer_float to the WebGL2 context (there's some other stuff for WebGL1 which I'm not going to talk about or implement). The combination of WebGL2 and this extensions lets us have our float textures, but only at 32-bit precision.

The ugly part

WebGL leaves a lot to browser implementation so there's a lot of stuff that just doesn't work and is not obvious. The main one being how exactly can you read floating point pixels values? If you tried (context.readPixels(0, 0, width, height, context.RED, context.FLOAT, buffer)) you would get an error about an invalid type even with extensions. Browsers only need to implement a single combination of format and type, RGBA with UNSIGNED_INT you aren't guaranteed more. For the main backbuffer that's all Chrome will ever give you. In order to make this work we need to render to framebuffer we create and once bound it will allow Chrome to use another format namely RED with FLOAT. It's not obvious at all. One helpful way to test what the browser will allow:



function glEnumToString(gl, value) {
    for (let key in gl) {
        if (gl[key] === value) {
            return key;
        }
    }
    return `0x${value.toString(16)}`;
}
const altFormat = glEnumToString(context, context.getParameter(context.IMPLEMENTATION_COLOR_READ_FORMAT))
const altType = glEnumToString(context, context.getParameter(context.IMPLEMENTATION_COLOR_READ_TYPE))
console.log(altFormat, altType)


Enter fullscreen mode Exit fullscreen mode

https://webglfundamentals.org/webgl/lessons/webgl-qna-i-get-invalid-type-error-when-calling-readpixels.html

This value will change if depending on the registered framebuffer's format but if it doesn't report the format you want, it's not going to work.

Code

I'm not going to explain this too much as I've written plenty about it already. But we setup a WebGL2 context, create a scene of a single quad so we can do pixel ops on top of it. For our input we create data textures which are format RED with type FLOAT so each pixel is represented by a single float-32 value. In the shader all we do is sample the pixel (using the UVs to line them up) from each texture, add them and write to the framebuffer. Then we use readPixels to read it out back into a Float32Array.



const canvas = document.createElement("canvas");
canvas.height = 256;
canvas.width = 256;
const context = canvas.getContext("webgl2");

context.getExtension("EXT_color_buffer_float");

function createDataTexture(context, data, textureIndex = 0, width = 32, height = 32) {
    context.activeTexture(context.TEXTURE0 + textureIndex);
    const texture = context.createTexture();
    context.bindTexture(context.TEXTURE_2D, texture);

    context.texParameteri(context.TEXTURE_2D, context.TEXTURE_WRAP_S, context.CLAMP_TO_EDGE);
    context.texParameteri(context.TEXTURE_2D, context.TEXTURE_WRAP_T, context.CLAMP_TO_EDGE);
    context.texParameteri(context.TEXTURE_2D, context.TEXTURE_MIN_FILTER, context.NEAREST);
    context.texParameteri(context.TEXTURE_2D, context.TEXTURE_MAG_FILTER, context.NEAREST);

    context.texImage2D(context.TEXTURE_2D, 0, context.R32F, width, height, 0, context.RED, context.FLOAT, data);
}

function compileProgram(context){
    const vertexShaderText = `#version 300 es
        precision highp float;
        in vec3 aPosition;
        in vec2 aUV;

        out vec2 uv;

        void main(){
            gl_Position = vec4(aPosition, 1.0);
            uv = aUV;
        }
    `;
    const vertexShader = context.createShader(context.VERTEX_SHADER);
    context.shaderSource(vertexShader, vertexShaderText);
    context.compileShader(vertexShader);

    const fragmentShaderText = `#version 300 es
        precision highp float;
        uniform sampler2D samplerA;
        uniform sampler2D samplerB;

        in vec2 uv;

        out float glColor;

        void main(){
            glColor = texture(samplerA, uv).r + texture(samplerB, uv).r;
        }
    `;
    const fragmentShader = context.createShader(context.FRAGMENT_SHADER);
    context.shaderSource(fragmentShader, fragmentShaderText);
    context.compileShader(fragmentShader);

    if (!context.getShaderParameter(vertexShader, context.COMPILE_STATUS)) {
        console.error(`⚠ Failed to compile vertex shader: ${context.getShaderInfoLog(vertexShader)}`);
    }
    if (!context.getShaderParameter(fragmentShader, context.COMPILE_STATUS)) {
        console.error(`⚠ Failed to compile fragment shader: ${context.getShaderInfoLog(fragmentShader)}`);
    }

    const program = context.createProgram();

    context.attachShader(program, vertexShader);
    context.attachShader(program, fragmentShader);

    context.linkProgram(program);
    context.useProgram(program);

    return program;
}

function createScene(context, program){
    const positions = new Float32Array([
        -1.0, -1.0,
        1.0, -1.0,
        1.0, 1.0,
        -1.0, 1.0
    ]);
    const positionBuffer = context.createBuffer();
    context.bindBuffer(context.ARRAY_BUFFER, positionBuffer);
    context.bufferData(context.ARRAY_BUFFER, positions, context.STATIC_DRAW);

    const positionLocation = context.getAttribLocation(program, "aPosition");
    context.enableVertexAttribArray(positionLocation);
    context.vertexAttribPointer(positionLocation, 2, context.FLOAT, false, 0, 0);

    const uvs = new Float32Array([
        0.0, 1.0,
        1.0, 1.0,
        1.0, 0.0,
        0.0, 0.0
    ]);
    const uvBuffer = context.createBuffer();
    context.bindBuffer(context.ARRAY_BUFFER, uvBuffer);
    context.bufferData(context.ARRAY_BUFFER, uvs, context.STATIC_DRAW);

    const texCoordLocation = context.getAttribLocation(program, "aUV");
    context.enableVertexAttribArray(texCoordLocation);
    context.vertexAttribPointer(texCoordLocation, 2, context.FLOAT, false, 0, 0);

    const indicies = new Uint16Array([
        0, 1, 2,
        0, 2, 3
    ]);
    const indexBuffer = context.createBuffer();
    context.bindBuffer(context.ELEMENT_ARRAY_BUFFER, indexBuffer);
    context.bufferData(context.ELEMENT_ARRAY_BUFFER, indicies, context.STATIC_DRAW);

    const samplerALocation = context.getUniformLocation(program, "samplerA");
    const samplerBLocation = context.getUniformLocation(program, "samplerB");
    context.uniform1i(samplerALocation, 0);
    context.uniform1i(samplerBLocation, 1);
}

function createFramebuffer(context, width, height){
    const framebufferTexture = context.createTexture();
    context.bindTexture(context.TEXTURE_2D, framebufferTexture);
    context.texImage2D(context.TEXTURE_2D, 0, context.R32F, width, height, 0, context.RED, context.FLOAT, null);

    const framebuffer = context.createFramebuffer();
    context.bindFramebuffer(context.FRAMEBUFFER, framebuffer);
    context.framebufferTexture2D(context.FRAMEBUFFER, context.COLOR_ATTACHMENT0, context.TEXTURE_2D, framebufferTexture, 0)
}

const program = compileProgram(context);
createScene(context, program);
createFramebuffer(context, canvas.width, canvas.height);

export function addMatrixWebGl(a, b){
    createDataTexture(context, a.data, 0, a.shape[0], b.shape[1]);
    createDataTexture(context, b.data, 1, b.shape[0], b.shape[1]);
    context.clear(context.COLOR_BUFFER_BIT | context.DEPTH_BUFFER_BIT);
    context.drawElements(context.TRIANGLES, 6, context.UNSIGNED_SHORT, 0);

    const result = new Float32Array(a.shape[0] * a.shape[1]);context.readPixels(0, 0, a.shape[0], a.shape[1], context.RED, context.FLOAT, result)
    ;

    return {
        shape: a.shape,
        data: result
    };
}



Enter fullscreen mode Exit fullscreen mode

150 lines isn't small but that's what it takes.

WebGPU

So WebGL is a bit hacky, but what about something tailor made for number crunching? That's WebGPU. Unlike, WebGL we get access to compute shaders which are pretty much the same thing we were doing without all the overhead of making textures and representing things as pixels. We can run our math ops directly.

Code



const adapter = await navigator.gpu.requestAdapter();
const device = await adapter.requestDevice();

const module = device.createShaderModule({
    code: `
        @group(0) @binding(0)
        var<storage, read> inputA: array<f32>;

        @group(0) @binding(1)
        var<storage, read> inputB: array<f32>;

        @group(0) @binding(2)
        var<storage, read_write> output: array<f32>;

        @compute @workgroup_size(64)
        fn main(@builtin(global_invocation_id) global_id: vec3<u32>){
            output[global_id.x] = inputA[global_id.x] + inputB[global_id.x];
        }
    `
});

const bindGroupLayout = device.createBindGroupLayout({
    entries: [
        {
            binding: 0,
            visibility: GPUShaderStage.COMPUTE,
            buffer: {
                type: "read-only-storage"
            }
        },
        {
            binding: 1,
            visibility: GPUShaderStage.COMPUTE,
            buffer: {
                type: "read-only-storage"
            }
        },
        {
            binding: 2,
            visibility: GPUShaderStage.COMPUTE,
            buffer: {
                type: "storage"
            }
        }
    ]
});

const pipeline = device.createComputePipeline({
    layout: device.createPipelineLayout({
        bindGroupLayouts: [bindGroupLayout]
    }),
    compute: {
        module,
        entryPoint: "main"
    }
});


export async function addMatrixWebGpu(a, b){
    const bufferSize = a.shape[0] * a.shape[1] * 4;


    const inputA = device.createBuffer({
        size: bufferSize,
        usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST
    });

    const inputB = device.createBuffer({
        size: bufferSize,
        usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST
    });

    const output = device.createBuffer({
        size: bufferSize,
        usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC
    });

    const stagingBuffer = device.createBuffer({
        size: bufferSize,
        usage: GPUBufferUsage.MAP_READ | GPUBufferUsage.COPY_DST
    });

    const bindGroup = device.createBindGroup({
        layout: bindGroupLayout,
        entries: [
            {
                binding: 0,
                resource: {
                    buffer: inputA
                }
            },
            {
                binding: 1,
                resource: {
                    buffer: inputB
                }
            },
            {
                binding: 2,
                resource: {
                    buffer: output
                }
            }
        ]
    });

    device.queue.writeBuffer(inputA, 0, a.data);
    device.queue.writeBuffer(inputB, 0, b.data);
    const commandEncoder = device.createCommandEncoder();
    const passEncoder = commandEncoder.beginComputePass();
    passEncoder.setPipeline(pipeline);
    passEncoder.setBindGroup(0, bindGroup);
    passEncoder.dispatchWorkgroups(Math.ceil(bufferSize / 64));
    passEncoder.end();
    commandEncoder.copyBufferToBuffer(output, 0, stagingBuffer, 0, bufferSize);
    const commands = commandEncoder.finish();
    device.queue.submit([commands]);

    await stagingBuffer.mapAsync(GPUMapMode.READ, 0, bufferSize);
    const copyArrayBuffer = stagingBuffer.getMappedRange(0, bufferSize);
    const data = copyArrayBuffer.slice();
    stagingBuffer.unmap();

    return {
        shape: a.shape,
        data: new Float32Array(data)
    }
}


Enter fullscreen mode Exit fullscreen mode

Nearly 10 lines shorter and less dense in general. Starting at the top we get the adapter and request a device, this is basic webGPU boilerplate. Next we create the shader code. Since it's just a computer shader we only need one. The group(x) binding(y) tell the compiler where the value comes from, in this case they come from the binding groups, essentially the same as WebGL uniforms but a little more free-form. In the main function we add. We do so using a global id to denotate where we are, you can think of this as similar to the pixel coordinate in a fragment shader. For whatever reason it's 3 dimensional but we're only using one. Also note the workgroup_size, this curious value determines the size of the workgroup, roughly translated it's the number of parallel threads but modeled as a 3 dimensional number I guess because GPUs can do something smart with it like that. 64 or as it translates (64x1x1) is a typical value. The global id derives from this, if you think about the workgroups as little 3d rectangles, the global id is the global coordinate of the piece of work. Since we only gave workgroups an X dimension the global id's x value goes from 1 to array length. This also means that even for a 4x4 we have 16 elements but it'll index into all 64 values at the same time. Despite the fact that indices 16-63 don't exist this will work because it's just going to clamp to last value meaning index 15 gets added 38 times but in parallel. Moving on we have the bindGroupLayout which specifies the template (and locations) of bindGroup data. We only read from the inputs so they can have readonly which the GPU might optimize. They are also only used in the compute stage (the only stage we have) so we tag that too.

Then we define the pipeline which is the what really sets WebGPU apart from WebGL. We can essentially customize the stages instead of a fixed vertex -> fragment. In this case, we just have a simple compute pipeline with a shader and bindgroups as input/output. And finally we get to the function itself. We need to bind the arrays to GPU buffers which we define with a size and some attributes to help the GPU optimize. Then we attach them to the bindGroup. The next few items are what actually sends commands to the GPU. The first two write the input data to the GPU buffers we just created (you can also use mappedAtCreation in the createBuffer call to do this). Then we create a command encoder which encodes commands (binary instructions) to the GPU. Using the pass encoder we actually run the shader with dispatchWorkgroups once we've set the pipeline and bindings. Then we copy the output to a buffer which is an odd things that's required to access data from WebGPU. It's basically moving GPU buffer data to memory where the CPU can access it. Finally we finish and submit the instructions.

At the end we can pull the data out of the staging buffer. But because it's linked to the GPU that data will disappear so we need to create a last final array buffer copy to make sure it's preserved on the javascript end. It's a complicated dance. Once we have that copy we can return it.

Results

The results are disappointing. WebGPU takes much longer than the rest, way more than even the WebGL which seemed inefficient. It's honestly not clear to me how to better optimize but there is a lot of buffer copies going on so that could be a place to look.

Sizes 1x1 - 16x16 for WebGl and WebGPU

Sizes 32x32 - 256x256 for WebGl and WebGPU

Make it bigger

Being a bit skeptical that our GPU versions have failed spectacularly, I decided to ramp up the matrix sizes hoping that maybe at some point they'd have a real advantage, afterall, GPUs are high latency but high throughput. To do this I constructed another test with 512 and 1024 sized matrices. In these test I've excluded unrolled versions because we know that would be an awful idea (and many thousand line functions that would slow the tests). In addition, I'm also throwing out the pre-computed matrices as well, instead just generating on the fly and manually testing which speeds the benchmarks up a lot.

The scale up also require some tweaks to our functions. For WASM we need to increase the page size again: 1024 * 1024 * 4 * 3 / 65,536 = 192 pages. For WebGL we need to increase the canvas size to 1024x1024. WebGPU is more difficult. Before we were using 1 dimension work groups of size 64. At 1024 we actually exceed the allowed work items per dimension. So instead we can reformulate the workgroups to 8x8, still 64 in total size but spread along dimensions so we're not exceeding the max in any dimension. We can then apply this workgroups in 2 dimensions which is actually more intuitive...kinda. The problem then is that indexing becomes tricky because we don't know the dimensions of the matrix. To remedy this we need to pass those in which means turning our arrays into WGSL structs that look like our JS matrix objects. This then complicates the data passing since we need to encode the struct. What we do is leave 8 bytes at the beginning of each buffer to pass in the 2 32-bit ints that represent the shape meaning 2 more copies. Secondly, we run into issues of alignment. The minimum bytes we can pass for a struct is 16 and for 1x1 matrix we only pass 12. So we need to make sure we're aligned to 16-byte chunks.



const dataSize = (a.shape[0] * a.shape[1] * 4);
const bufferSize = dataSize + 8; //adding dimensions to front
const alignedBufferSize = Math.ceil(bufferSize / 16) * 16; //must meet 16-byte alignment


Enter fullscreen mode Exit fullscreen mode

The shader code is mostly the same just with structs and spread in 2 dimensions:



struct Matrix {
    size: vec2<u32>,
    data: array<f32>
}
@group(0) @binding(0)
var<storage, read> inputA: Matrix;
@group(0) @binding(1)
var<storage, read> inputB: Matrix;
@group(0) @binding(2)
var<storage, read_write> output: array<f32>;
@compute @workgroup_size(8, 8)
fn main(
    @builtin(global_invocation_id) global_id: vec3<u32>
){
    if(global_id.x > inputA.size.x || global_id.y > inputA.size.y){
        return;
    }
    let idx = (inputA.size.x * global_id.y) + global_id.x;
    output[idx] = inputA.data[idx] + inputB.data[idx];
}


Enter fullscreen mode Exit fullscreen mode

But we also have to be careful with the output buffers. We only want to read the amount of bytes we need, not with the encoded dimensions or alignment. This took a bit of debugging for me to get right, if you start seeing chunks of zeros it's probably because you aren't reading the right place. Also, I didn't make output a Matrix struct mostly because I'd have to assign the shape on every run, perhaps that's not actually a problem but I instead opt-ed to keep that in JS instead (a future bench test I'm sure). See the final code link at the bottom for differences from the top implementations.

After all of that we can get the "XL" sized results:

Image description

WebGL makes gains but WebGPU still sucks.

Conclusion

Most of the work here was trying get a reasonable setup (I still don't have a good graphing setup as I need to improve the labeling on mine, so I just copy-pasted in to Google docs). Making the two GPU versions was interesting because the APIs are pretty complicated and I had to learn a few new things. It was really disappointing that they don't do better though, especially WebGPU. The fact that it was by far the loser really makes me think I did something wrong, but at this point I'm not really sure how to go about fixing it. All in all, for fast matrix math a preallocated loop is probably the best all around.

Code

https://github.com/ndesmic/fast-mat/tree/v3

Sources

Top comments (0)