论文标题
IRIS:自动生成高带宽利用率的有效数据布局
Iris: Automatic Generation of Efficient Data Layouts for High Bandwidth Utilization
论文作者
论文摘要
优化数据移动正在成为应对数据洪水和大数据应用程序的异质计算中最大的挑战之一。在创建专门的加速器时,现代高级合成(HLS)工具在优化计算方面越来越有效,但是数据传输尚未得到充分改进。为了解决这个问题,已经开发了新的架构,例如具有更宽的数据总线的高带宽内存,以便可以并行传输更多数据。设计人员必须量身定制其硬件/软件界面,以充分利用可用的带宽。 HLS工具可以自动化此过程,但是设计人员必须遵循严格的编码风格规则。如果总线宽度不被数据宽度(例如,使用自定义精确数据类型)均匀排除,或者阵列不是两个长度,则HLS生成的加速器可能不会完全利用可用的带宽,要求设计师提供更多的手动努力。我们提出了一种方法来自动查找和实施数据布局,该方法在存储器和加速器之间流式传输时,使用可用带宽的百分比比天真或HLS优化的设计更高。我们从多处理器计划中借用概念来实现如此高效率。
Optimizing data movements is becoming one of the biggest challenges in heterogeneous computing to cope with data deluge and, consequently, big data applications. When creating specialized accelerators, modern high-level synthesis (HLS) tools are increasingly efficient in optimizing the computational aspects, but data transfers have not been adequately improved. To combat this, novel architectures such as High-Bandwidth Memory with wider data busses have been developed so that more data can be transferred in parallel. Designers must tailor their hardware/software interfaces to fully exploit the available bandwidth. HLS tools can automate this process, but the designer must follow strict coding-style rules. If the bus width is not evenly divisible by the data width (e.g., when using custom-precision data types) or if the arrays are not power-of-two length, the HLS-generated accelerator will likely not fully utilize the available bandwidth, demanding even more manual effort from the designer. We propose a methodology to automatically find and implement a data layout that, when streamed between memory and an accelerator, uses a higher percentage of the available bandwidth than a naive or HLS-optimized design. We borrow concepts from multiprocessor scheduling to achieve such high efficiency.