大数据需要复杂的系统吗？ Spark和Unicage Shell脚本之间的性能比较

论文标题

大数据需要复杂的系统吗？ Spark和Unicage Shell脚本之间的性能比较

Does Big Data Require Complex Systems? A Performance Comparison Between Spark and Unicage Shell Scripts

论文作者

Nascimento, Duarte M., Ferreira, Miguel, Pardal, Miguel L.

论文摘要

大数据的范式的特征是需要收集和处理大量数量的数据集，以各种格式到达以很高的速度到达系统。 Spark是一种广泛使用的大数据处理系统，可以与Hadoop集成，以向开发人员提供强大的抽象，例如通过HDFS分布式存储和通过YARN提供资源管理。制作所有必需的配置后，Spark还可以提供质量属性，例如可扩展性，容错性和安全性。但是，所有这些好处都以复杂性为代价，具有高记忆要求以及处理的额外延迟。另一种方法是使用像Unicage这样的精益软件堆栈，将大多数控制权委派给了开发人员。在这项工作中，我们在IBM云中托管的集群环境中评估了大数据处理的性能。进行了两组实验：非结构化数据集的批处理处理以及结构化数据集的查询处理。输入数据集的大小显着，体积的范围从64 GB到8192 GB。结果表明，Unicage脚本的性能优于Spark，对于GREP和SELECT等搜索工作负载，但来自Hadoop堆栈中的分布式存储和资源管理的抽象使Spark能够使用正确的输出来执行具有记录间依赖关系的工作负载，例如sort and oft and of-ogin和coins，并具有正确的输出。

The paradigm of big data is characterized by the need to collect and process data sets of great volume, arriving at the systems with great velocity, in a variety of formats. Spark is a widely used big data processing system that can be integrated with Hadoop to provide powerful abstractions to developers, such as distributed storage through HDFS and resource management through YARN. When all the required configurations are made, Spark can also provide quality attributes, such as scalability, fault tolerance, and security. However, all of these benefits come at the cost of complexity, with high memory requirements, and additional latency in processing. An alternative approach is to use a lean software stack, like Unicage, that delegates most control back to the developer. In this work we evaluated the performance of big data processing with Spark versus Unicage, in a cluster environment hosted in the IBM Cloud. Two sets of experiments were performed: batch processing of unstructured data sets, and query processing of structured data sets. The input data sets were of significant size, ranging from 64 GB to 8192 GB in volume. The results show that the performance of Unicage scripts is superior to Spark for search workloads like grep and select, but that the abstractions of distributed storage and resource management from the Hadoop stack enable Spark to execute workloads with inter-record dependencies, such as sort and join, with correct outputs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题