matmul benchmark
Benchmarking fp32 matmul kernels on 4096x4096 matrices.
- "naive" does a simple loop reduction with WebGPU block size
- "shmem-tiling" is tiled reduction with
var<workgroup>memory - "unroll4" has each thread compute a 4x4 block of output
- "unroll4x2" has 4x4 blocks of output, with 2x loop unroll
- "unroll4x4" has 4x4 blocks of output, with 4x loop unroll
- "onnx" runs a
MatMulnode in onnxruntime-web - "tfjs" runs
tf.matMul() - "jax-js" runs
jax.numpy.dot()