论文标题

单词边界对无监督的语言学习有用吗?

Are word boundaries useful for unsupervised language learning?

论文作者

Nguyen, Tu Anh, de Seyssel, Maureen, Algayres, Robin, Roze, Patricia, Dunbar, Ewan, Dupoux, Emmanuel

论文摘要

在许多下游应用程序中,基于单词或单词差异的语言模型(LM)通常优先于基于字符的语言模型。这可能并不奇怪,因为与字符相比,单词似乎更语言相关。单词至少提供两种相关信息:边界信息和有意义的单元。但是,在语音输入的情况下,单词边界信息可能不存在或不可靠(在语音流中未明确标记单词边界)。在这里,我们系统地将LSTMS作为输入单元(字符,音素,单词,单词零件)的函数进行比较。我们使用三个言语适应的黑匣子NLP心理语言意义的基准(PWUGGY,PBLIMP,PSIMI)探讨了词汇,句法和语义级别的网络中语言知识。我们发现,根据任务,相对绩效的不存在边界成本在2 \%和28%之间。我们表明,可以通过无监督分段算法获得的金边界可以自动找到的金边界,即使是适度的细分性能,与没有边界信息的基本字符/基于手机的模型相比,这三个任务中的两个任务中的两个任务中的两个任务都可以提高性能。

Word or word-fragment based Language Models (LM) are typically preferred over character-based ones in many downstream applications. This may not be surprising as words seem more linguistically relevant units than characters. Words provide at least two kinds of relevant information: boundary information and meaningful units. However, word boundary information may be absent or unreliable in the case of speech input (word boundaries are not marked explicitly in the speech stream). Here, we systematically compare LSTMs as a function of the input unit (character, phoneme, word, word part), with or without gold boundary information. We probe linguistic knowledge in the networks at the lexical, syntactic and semantic levels using three speech-adapted black box NLP psycholinguistically-inpired benchmarks (pWUGGY, pBLIMP, pSIMI). We find that the absence of boundaries costs between 2\% and 28\% in relative performance depending on the task. We show that gold boundaries can be replaced by automatically found ones obtained with an unsupervised segmentation algorithm, and that even modest segmentation performance gives a gain in performance on two of the three tasks compared to basic character/phone based models without boundary information.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源