论文标题
COMEFA:FPGA的计算内存块
CoMeFa: Compute-in-Memory Blocks for FPGAs
论文作者
论文摘要
Block RAM(BRAMS)是FPGA的储物室,为使用逻辑块(LB)和数字信号处理(DSP)切片实现的计算单元提供了广泛的芯片内存带宽。我们建议修改BRAMS将其转换为COMEFA(FPGAS的计算记忆块)RAM。这些RAM通过将计算和存储功能组合在一个块中,提供高度并行的计算。 COMEFA RAMS利用FPGA BRAM的真实双端口性质,并包含多个可编程的单位串行处理元素。 COMEFA公羊可用于计算任何精确度,这对于不断发展的应用程序(例如深度学习)非常重要。在FPGA中添加COMEFA RAM会大大增加其计算密度。我们探索并提出了这些RAM的两个架构:COMEFA-D(用于延迟)和COMEFA-A(针对区域进行了优化)。与现有的建议相比,COMEFA RAM不需要更改基础SRAM技术,例如同时激活同一端口上的多行,并且可以实现。 COMFA RAM是多功能的块,可以通过增加Intel Arria-10型FPGA(comefa-a(Comefa-a)公羊(Comefa-a),以3.8%(1.2%)的费用增加3.8%(1.2%)的面积,并在Algorithmic的范围内,我们可以在2. 2. 2. 2. 2. 2. e-empappers a geomean speep a geomean(geompevers),从而在众多多样化的平行应用中找到应用,例如深度学习,信号处理,数据库等。代表性基准。用FPGA中的ComeFa Rams代替全部或一些Brams可以使它们更好地加速现代计算密集型工作负载。
Block RAMs (BRAMs) are the storage houses of FPGAs, providing extensive on-chip memory bandwidth to the compute units implemented using Logic Blocks (LBs) and Digital Signal Processing (DSP) slices. We propose modifying BRAMs to convert them to CoMeFa (Compute-In-Memory Blocks for FPGAs) RAMs. These RAMs provide highly-parallel compute-in-memory by combining computation and storage capabilities in one block. CoMeFa RAMs utilize the true dual port nature of FPGA BRAMs and contain multiple programmable single-bit bit-serial processing elements. CoMeFa RAMs can be used to compute in any precision, which is extremely important for evolving applications like Deep Learning. Adding CoMeFa RAMs to FPGAs significantly increases their compute density. We explore and propose two architectures of these RAMs: CoMeFa-D (optimized for delay) and CoMeFa-A (optimized for area). Compared to existing proposals, CoMeFa RAMs do not require changing the underlying SRAM technology like simultaneously activating multiple rows on the same port, and are practical to implement. CoMeFa RAMs are versatile blocks that find applications in numerous diverse parallel applications like Deep Learning, signal processing, databases, etc. By augmenting an Intel Arria-10-like FPGA with CoMeFa-D (CoMeFa-A) RAMs at the cost of 3.8% (1.2%) area, and with algorithmic improvements and efficient mapping, we observe a geomean speedup of 2.5x (1.8x), across several representative benchmarks. Replacing all or some BRAMs with CoMeFa RAMs in FPGAs can make them better accelerators of modern compute-intensive workloads.