conv2d benchmark

Benchmarking fp32 conv2d kernels on 1x64x256x256 input with 128 filters of size 3x3.