论文标题

数据分析的数据工程:问题的分类和案例研究

Data Engineering for Data Analytics: A Classification of the Issues, and Case Studies

论文作者

Nazabal, Alfredo, Williams, Christopher K. I., Colavizza, Giovanni, Smith, Camila Rangel, Williams, Angus

论文摘要

考虑数据分析师希望对给定数据集进行分析的情况。人们普遍认识到,分析师的大部分时间都将通过\ emph {数据工程}任务(例如获取,理解,清洁和准备数据)占用。在本文中,我们将此类任务的描述和分类分为高级组,即数据组织,数据质量和功能工程。我们还提供了四个数据集和示例分析,这些数据集表现出各种各样的问题,以帮助鼓励开发工具和技术,以帮助减轻这种负担,并将研究推向数据工程过程的自动化或半自动化。

Consider the situation where a data analyst wishes to carry out an analysis on a given dataset. It is widely recognized that most of the analyst's time will be taken up with \emph{data engineering} tasks such as acquiring, understanding, cleaning and preparing the data. In this paper we provide a description and classification of such tasks into high-levels groups, namely data organization, data quality and feature engineering. We also make available four datasets and example analyses that exhibit a wide variety of these problems, to help encourage the development of tools and techniques to help reduce this burden and push forward research towards the automation or semi-automation of the data engineering process.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源