论文标题
分布式数据聚合的快速,可扩展,通用方法
A Fast, Scalable, Universal Approach For Distributed Data Aggregations
论文作者
论文摘要
在当前的大数据时代,数据工程已转变为许多科学分支的基本研究领域。人工智能(AI)的进步扩大了数据工程的范围,并在企业和研究社区开放了新的应用程序。在这些应用程序中,聚合(也称为功能编程减少)是不可或缺的功能。传统上,它们旨在生成有关大型数据集的有意义的信息,如今,它们用于工程为复杂的AI模型进行工程更有效的功能。聚集通常在数据抽象的顶部(例如表/数组)上进行,并与其他操作(例如值分组)结合使用。有些框架在上述域中脱颖而出。但是,我们认为,数据分析工具的基本要求可以普遍地与现有框架集成,从而提高了整个数据分析管道的生产率和效率。 Cylon努力实现这一空白。在本文中,我们介绍了Cylon在分布式内存表结构之上实施的快速可扩展的聚合操作,该操作将与现有框架普遍集成。
In the current era of Big Data, data engineering has transformed into an essential field of study across many branches of science. Advancements in Artificial Intelligence (AI) have broadened the scope of data engineering and opened up new applications in both enterprise and research communities. Aggregations (also termed reduce in functional programming) are an integral functionality in these applications. They are traditionally aimed at generating meaningful information on large data-sets, and today, they are being used for engineering more effective features for complex AI models. Aggregations are usually carried out on top of data abstractions such as tables/ arrays and are combined with other operations such as grouping of values. There are frameworks that excel in the said domains individually. But, we believe that there is an essential requirement for a data analytics tool that can universally integrate with existing frameworks, and thereby increase the productivity and efficiency of the entire data analytics pipeline. Cylon endeavors to fulfill this void. In this paper, we present Cylon's fast and scalable aggregation operations implemented on top of a distributed in-memory table structure that universally integrates with existing frameworks.