论文标题

Exoshuffle:可扩展的洗牌建筑

Exoshuffle: An Extensible Shuffle Architecture

论文作者

Luan, Frank Sifei, Wang, Stephanie, Yagati, Samyukta, Kim, Sean, Lien, Kenneth, Ong, Isaac, Hong, Tony, Cho, SangBin, Liang, Eric, Stoica, Ion

论文摘要

Shuffle是分布式数据处理中最昂贵的沟通原始方法之一,并且难以扩展。先前的工作通过构建整体式洗牌系统来解决洗牌的可扩展性挑战。这些系统的开发成本很高,它们与仅提供高级API(例如SQL)的批处理处理框架紧密集成。新的应用程序(例如ML培训)需要更高的灵活性和随机潮流的粒度互操作性。他们通常无法利用现有的洗牌优化。 我们提出了可扩展的洗牌建筑。我们提出了Exoshuffle,这是一个用于分布式洗牌的库,比单片洗牌系统提供竞争性能和可扩展性以及更大的灵活性。我们设计了一个架构,该体系结构将散装控制平面与数据平面分解而不牺牲性能。 We build Exoshuffle on Ray, a distributed futures system for data and ML applications, and demonstrate that we can: (1) rewrite previous shuffle optimizations as application-level libraries with an order of magnitude less code, (2) achieve shuffle performance and scalability competitive with monolithic shuffle systems, and break the CloudSort record as the world's most cost-efficient sorting system, and (3) enable new applications such as ML training to轻松利用可扩展的混音。

Shuffle is one of the most expensive communication primitives in distributed data processing and is difficult to scale. Prior work addresses the scalability challenges of shuffle by building monolithic shuffle systems. These systems are costly to develop, and they are tightly integrated with batch processing frameworks that offer only high-level APIs such as SQL. New applications, such as ML training, require more flexibility and finer-grained interoperability with shuffle. They are often unable to leverage existing shuffle optimizations. We propose an extensible shuffle architecture. We present Exoshuffle, a library for distributed shuffle that offers competitive performance and scalability as well as greater flexibility than monolithic shuffle systems. We design an architecture that decouples the shuffle control plane from the data plane without sacrificing performance. We build Exoshuffle on Ray, a distributed futures system for data and ML applications, and demonstrate that we can: (1) rewrite previous shuffle optimizations as application-level libraries with an order of magnitude less code, (2) achieve shuffle performance and scalability competitive with monolithic shuffle systems, and break the CloudSort record as the world's most cost-efficient sorting system, and (3) enable new applications such as ML training to easily leverage scalable shuffle.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源