模拟：使用静态和上下文化的嵌入的高质量单词对齐，没有并联训练数据

论文标题

模拟：使用静态和上下文化的嵌入的高质量单词对齐，没有并联训练数据

SimAlign: High Quality Word Alignments without Parallel Training Data using Static and Contextualized Embeddings

论文作者

Sabet, Masoud Jalili, Dufter, Philipp, Yvon, François, Schütze, Hinrich

论文摘要

单词一致性对于统计和神经机器翻译（NMT）和跨语言注释投影等任务很有用。统计单词对准器的性能很好，可以与NMT中的翻译共同提取对齐的方法。但是，大多数方法都需要并行培训数据，并且随着培训数据的较少，质量会降低。我们提出了不需要平行数据的单词对齐方法。关键思想是利用静态和上下文化的多语言单词嵌入，以进行单词对齐。我们的多语言嵌入是由单语数据创建的，仅在不依赖任何并行数据或字典的情况下创建。我们发现，与传统统计对准器产生的对齐方式相比，从嵌入式产生的对齐方式是四个语言对，即使具有丰富的并行数据，也可以与两种语言对相当。例如，上下文化的嵌入在英语german中实现了一个单词对齐F1，该单词比Enflomal高出5个百分点，高质量的统计对准器，对100K并行句子进行了训练。

Word alignments are useful for tasks like statistical and neural machine translation (NMT) and cross-lingual annotation projection. Statistical word aligners perform well, as do methods that extract alignments jointly with translations in NMT. However, most approaches require parallel training data, and quality decreases as less training data is available. We propose word alignment methods that require no parallel data. The key idea is to leverage multilingual word embeddings, both static and contextualized, for word alignment. Our multilingual embeddings are created from monolingual data only without relying on any parallel data or dictionaries. We find that alignments created from embeddings are superior for four and comparable for two language pairs compared to those produced by traditional statistical aligners, even with abundant parallel data; e.g., contextualized embeddings achieve a word alignment F1 for English-German that is 5 percentage points higher than eflomal, a high-quality statistical aligner, trained on 100k parallel sentences.

下载PDF全文

下载文献需遵守相关版权规定

论文标题