论文标题
使用TVM在CUDA上优化块 - sparse矩阵乘法
Optimizing Block-Sparse Matrix Multiplications on CUDA with TVM
论文作者
论文摘要
我们在CUDA上的密集矩阵和块 - sparse矩阵之间实现并优化了矩阵乘积。我们利用深度学习编译器TVM来探索操作的时间表并生成有效的CUDA代码。随着TVM中的自动参数调整,与其他最先进的框架相比,我们基于跨线程的实现实现了竞争或更好的性能。
We implemented and optimized matrix multiplications between dense and block-sparse matrices on CUDA. We leveraged TVM, a deep learning compiler, to explore the schedule space of the operation and generate efficient CUDA code. With the automatic parameter tuning in TVM, our cross-thread reduction based implementation achieved competitive or better performance compared with other state-of-the-art frameworks.