分布式数据分析

论文标题

分布式数据分析

Distributed data analytics

论文作者

Mortier, Richard, Haddadi, Hamed, Servia, Sandra, Wang, Liang

论文摘要

机器学习（ML）技术已开始主导数据分析应用程序和服务。推荐系统是在线服务提供商的关键组成部分。金融行业已采用ML来利用欺诈检测，风险管理和合规性等领域的大量数据。深度学习是基于语音的个人助理等的技术。将ML技术部署到云计算基础架构上的部署使我们日常生活的许多方面受益。尤其是广告和相关的在线行业促进了个人数据收集和分析工具部署的迅速增长。传统上，行为分析依赖于在使用它来训练允许推断用户行为和偏好的机器学习模型之前，依靠集中式云基础架构中收集大量数据。一种对比的方法，分布式数据分析，将培训和推理的代码和模型分布在收集数据的地方，并受到了最近的两个持续发展的开发：增加了网络边缘的用户设备（例如智能手机和家庭助理和家庭助理和家庭助理）的处理能力和内存能力的增加；并提高了对许多这些设备和服务的高度侵入性质的敏感性，以及随之而来的对隐私的要求。的确，增加隐私的潜力并不是将数据分析分配给网络边缘的唯一好处：减少大量数据的移动也可以提高能源效率，有助于减轻数字基础设施不断增长的碳足迹，从而使服务相互作用延迟的延迟，而不是在服务范围内降低服务的延迟。这些方法通常会引入隐私，公用事业和效率折衷方面的挑战，同时必须确保富有成果的用户参与度。

Machine Learning (ML) techniques have begun to dominate data analytics applications and services. Recommendation systems are a key component of online service providers. The financial industry has adopted ML to harness large volumes of data in areas such as fraud detection, risk-management, and compliance. Deep Learning is the technology behind voice-based personal assistants, etc. Deployment of ML technologies onto cloud computing infrastructures has benefited numerous aspects of our daily life. The advertising and associated online industries in particular have fuelled a rapid rise the in deployment of personal data collection and analytics tools. Traditionally, behavioural analytics relies on collecting vast amounts of data in centralised cloud infrastructure before using it to train machine learning models that allow user behaviour and preferences to be inferred. A contrasting approach, distributed data analytics, where code and models for training and inference are distributed to the places where data is collected, has been boosted by two recent, ongoing developments: increased processing power and memory capacity available in user devices at the edge of the network, such as smartphones and home assistants; and increased sensitivity to the highly intrusive nature of many of these devices and services and the attendant demands for improved privacy. Indeed, the potential for increased privacy is not the only benefit of distributing data analytics to the edges of the network: reducing the movement of large volumes of data can also improve energy efficiency, helping to ameliorate the ever increasing carbon footprint of our digital infrastructure, enabling much lower latency for service interactions than is possible when services are cloud-hosted. These approaches often introduce challenges in privacy, utility, and efficiency trade-offs, while having to ensure fruitful user engagement.

下载PDF全文

下载文献需遵守相关版权规定

论文标题