不自然的语言处理：弥合合成和自然语言数据之间的差距

论文标题

不自然的语言处理：弥合合成和自然语言数据之间的差距

Unnatural Language Processing: Bridging the Gap Between Synthetic and Natural Language Data

论文作者

Marzoev, Alana, Madden, Samuel, Kaashoek, M. Frans, Cafarella, Michael, Andreas, Jacob

论文摘要

大型，人类注销的数据集对于自然语言处理模型的发展至关重要。收集这些数据集可能是开发过程中最具挑战性的部分。我们通过引入一种通用技术来解决这个问题，以通过``仿真到现实''转移在语言中使用一组界定的目标行为理解问题，从而可以开发可以在没有自然训练数据的情况下解释自然话语的模型。我们从合成数据生成过程开始，然后训练一个可以准确解释数据生成器产生的话语的模型。为了概括自然话语，我们会自动发现自然语言话语的投影在合成语言的支持上，并使用学习的句子嵌入来定义距离指标。只有合成训练数据，我们的方法匹配或胜过在几个领域中对自然语言数据训练的最先进模型。这些结果表明，模拟传输是开发NLP应用程序的实用框架，改进的转移模型可能会在下游任务中提供广泛的改进。

Large, human-annotated datasets are central to the development of natural language processing models. Collecting these datasets can be the most challenging part of the development process. We address this problem by introducing a general purpose technique for ``simulation-to-real'' transfer in language understanding problems with a delimited set of target behaviors, making it possible to develop models that can interpret natural utterances without natural training data. We begin with a synthetic data generation procedure, and train a model that can accurately interpret utterances produced by the data generator. To generalize to natural utterances, we automatically find projections of natural language utterances onto the support of the synthetic language, using learned sentence embeddings to define a distance metric. With only synthetic training data, our approach matches or outperforms state-of-the-art models trained on natural language data in several domains. These results suggest that simulation-to-real transfer is a practical framework for developing NLP applications, and that improved models for transfer might provide wide-ranging improvements in downstream tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题