matmul benchmark

Benchmarking fp32 matmul kernels on 4096x4096 matrices.

"naive" does a simple loop reduction with WebGPU block size
"shmem-tiling" is tiled reduction with var<workgroup> memory
"unroll4" has each thread compute a 4x4 block of output
"unroll4x2" has 4x4 blocks of output, with 2x loop unroll
"unroll4x4" has 4x4 blocks of output, with 4x loop unroll
"onnx" runs a MatMul node in onnxruntime-web
"tfjs" runs tf.matMul()
"jax-js" runs jax.numpy.dot()