大规模数据科学项目的数据源依赖分析框架

论文标题

大规模数据科学项目的数据源依赖分析框架

A Data Source Dependency Analysis Framework for Large Scale Data Science Projects

论文作者

Boué, Laurent, Kunireddy, Pratap, Subotić, Pavle

论文摘要

依赖性地狱是大型软件项目和机器学习（ML）代码库的开发中众所周知的痛苦点。实际上，ML应用程序具有另一种形式，即“数据源依赖性地狱”。该术语是指数据及其独特的怪癖所起的核心作用，通常会导致ML模型的意外故障，而代码更改无法解释。在本文中，我们提出了一个自动化依赖映射框架，该框架允许MLOP工程师在快速节奏的工程环境中监视其模型的整个依赖关系图，从而提前减轻任何数据源更改的后果（例如，重新培训模型，忽略数据，忽略数据，设置默认数据等）。我们的系统基于采用静态分析技术的统一和通用方法，可以从中可靠地确定数据源，以识别任何类型的依赖对广泛的源语言和人工制品的依赖。依赖关系映射框架被视为REST Web API，其中唯一的输入是通往托管代码库的GIT存储库的路径。当前由Microsoft的MLOPS工程师使用，我们希望将来MLOPS工程师更广泛地采用这种依赖性图API。

Dependency hell is a well-known pain point in the development of large software projects and machine learning (ML) code bases are not immune from it. In fact, ML applications suffer from an additional form, namely, "data source dependency hell". This term refers to the central role played by data and its unique quirks that often lead to unexpected failures of ML models which cannot be explained by code changes. In this paper, we present an automated dependency mapping framework that allows MLOps engineers to monitor the whole dependency map of their models in a fast paced engineering environment and thus mitigate ahead of time the consequences of any data source changes (e.g., re-train model, ignore data, set default data etc.). Our system is based on a unified and generic approach, employing techniques from static analysis, from which data sources can be identified reliably for any type of dependency on a wide range of source languages and artefacts. The dependency mapping framework is exposed as a REST web API where the only input is the path to the Git repository hosting the code base. Currently used by MLOps engineers at Microsoft, we expect such dependency map APIs to be adopted more widely by MLOps engineers in the future.

下载PDF全文

下载文献需遵守相关版权规定

论文标题