论文标题

利用组织资源使模型适应新的数据模式

Leveraging Organizational Resources to Adapt Models to New Data Modalities

论文作者

Suri, Sahaana, Chanda, Raghuveer, Bulut, Neslihan, Narayana, Pradyumna, Zeng, Yemao, Bailis, Peter, Basu, Sugato, Narlikar, Girija, Re, Christopher, Sethi, Abishek

论文摘要

随着大型组织中的应用程序的发展,为它们提供动力的机器学习(ML)模型必须使相同的预测任务适应新出现的数据模式(例如,在社交媒体应用程序中启动了新的视频内容,需要现有的文本或图像模型来扩展到视频)。为了解决此问题,组织通常会从头开始创建ML管道。但是,这无法利用他们从为现有模式制定任务而培养的域专业知识和数据。我们演示了组织资源如何以总统计数据,知识库和现有服务的形式,使团队能够构建一个连接新的和现有数据模式的共同特征空间。这使团队可以在这些不同的数据模式中应用方法进行培训数据策展(例如,弱监督和标签传播)和模型培训(例如,多模式学习的形式)。我们研究了这种组织资源在Google的5个超过5个分类任务中如何在生产规模上构成,并展示了它如何减少从几个月到几周到几天开发新模式的模型所需的时间。

As applications in large organizations evolve, the machine learning (ML) models that power them must adapt the same predictive tasks to newly arising data modalities (e.g., a new video content launch in a social media application requires existing text or image models to extend to video). To solve this problem, organizations typically create ML pipelines from scratch. However, this fails to utilize the domain expertise and data they have cultivated from developing tasks for existing modalities. We demonstrate how organizational resources, in the form of aggregate statistics, knowledge bases, and existing services that operate over related tasks, enable teams to construct a common feature space that connects new and existing data modalities. This allows teams to apply methods for training data curation (e.g., weak supervision and label propagation) and model training (e.g., forms of multi-modal learning) across these different data modalities. We study how this use of organizational resources composes at production scale in over 5 classification tasks at Google, and demonstrate how it reduces the time needed to develop models for new modalities from months to weeks to days.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源