Matvec benchmark

Benchmarking fp32 matrix-vector kernels on a 4096×4096 matrix, repeated 20× per timed run to amortize dispatch and readback overhead.

Strategies

One untimed warmup run, followed by one measured run.

Results

Sorted fastest first.

No samples yet.

Run an individual strategy or start the full queue.