使用PCM进行AI加速的可扩展相干光横梁架构

论文标题

使用PCM进行AI加速的可扩展相干光横梁架构

Scalable Coherent Optical Crossbar Architecture using PCM for AI Acceleration

论文作者

Sturm, Daniel, Moazeni, Sajjad

论文摘要

最近提出了光学计算作为新的计算范式，以满足数据中心和超级计算机中未来AI/ML工作负载的需求。但是，到目前为止，拟议的实施措施缺乏可扩展性，较大的足迹和高功耗以及不完整的系统级体系结构，以集成在现有应用程序中的现有数据中心体系结构中。在这项工作中，我们提出了一个真正可扩展的光学AI加速器，该加速器基于横梁体系结构。我们考虑了所有主要的障碍，并在此设计中解决了它们。权重将使用相变材料（PCM）存储在芯片上，这些材料（PCM）可以单层整合到硅光子过程中。所有电流组件和电路块均基于45nm单片硅光子过程中的测量性能指标进行建模，该过程可以与先进的CPU/GPU和HBM记忆共包装。我们还考虑了所有关键参数，包括内存大小，阵列大小，光子损失以及外围电子电子设备的能量消耗。在这种建模中，都考虑了片上SRAM和片外DRAM能量开销。我们还解决了使用双核横式设计如何在实用的SRAM块尺寸和批处理大小的情况下消除编程时间开销。我们的结果表明，提出的128 x 128架构可以在低功率和降低面积的7.24倍的15.4倍的15.4倍时，每秒（IPS）的推理（IPS）实现。

Optical computing has been recently proposed as a new compute paradigm to meet the demands of future AI/ML workloads in datacenters and supercomputers. However, proposed implementations so far suffer from lack of scalability, large footprints and high power consumption, and incomplete system-level architectures to become integrated within existing datacenter architecture for real-world applications. In this work, we present a truly scalable optical AI accelerator based on a crossbar architecture. We have considered all major roadblocks and address them in this design. Weights will be stored on chip using phase change material (PCM) that can be monolithically integrated in silicon photonic processes. All electro-optical components and circuit blocks are modeled based on measured performance metrics in a 45nm monolithic silicon photonic process, which can be co-packaged with advanced CPU/GPUs and HBM memories. We also present a system-level modeling and analysis of our chip's performance for the Resnet-50V1.5, considering all critical parameters, including memory size, array size, photonic losses, and energy consumption of peripheral electronics. Both on-chip SRAM and off-chip DRAM energy overheads have been considered in this modeling. We additionally address how using a dual-core crossbar design can eliminate programming time overhead at practical SRAM block sizes and batch sizes. Our results show that a 128 x 128 proposed architecture can achieve inference per second (IPS) similar to Nvidia A100 GPU at 15.4 times lower power and 7.24 times lower area.

下载PDF全文

下载文献需遵守相关版权规定

论文标题