论文标题
通过概率概念形成对语言模型有效诱导
Efficient Induction of Language Models Via Probabilistic Concept Formation
论文作者
论文摘要
本文提出了一种新颖的方法,可以从Corpora获取语言模型。该框架建立在CobWeb上,这是一个早期的系统,用于构建概率概念的分类层次结构,该概率概念使用了训练案例和概念的表格,属性值编码的编码,使其不适合诸如语言之类的顺序输入。作为回应,我们探索了蜘蛛网的三个新扩展 - 单词,叶子和路径变体。这些系统将每个训练案例编码为一个锚词和周围的上下文单词,它们将概念描述存储为概念的概率描述,作为锚定和上下文信息的分布。与原始蜘蛛网一样,性能元素通过层次结构向下对新实例进行分类,并使用最终节点预测缺失的功能。随着分类的发生,学习与性能,更新概念概率和层次结构相互交织。因此,新方法以逐步的在线方式处理培训案例,与统计语言学习的大多数方法截然不同。我们研究了三种变体将同义词集中在一起,并将异义字截图,它们回忆同义词作为训练设置大小的函数以及训练效率的能力。最后,我们讨论了有关进一步研究的增量学习和方向的相关工作。
This paper presents a novel approach to the acquisition of language models from corpora. The framework builds on Cobweb, an early system for constructing taxonomic hierarchies of probabilistic concepts that used a tabular, attribute-value encoding of training cases and concepts, making it unsuitable for sequential input like language. In response, we explore three new extensions to Cobweb -- the Word, Leaf, and Path variants. These systems encode each training case as an anchor word and surrounding context words, and they store probabilistic descriptions of concepts as distributions over anchor and context information. As in the original Cobweb, a performance element sorts a new instance downward through the hierarchy and uses the final node to predict missing features. Learning is interleaved with performance, updating concept probabilities and hierarchy structure as classification occurs. Thus, the new approaches process training cases in an incremental, online manner that it very different from most methods for statistical language learning. We examine how well the three variants place synonyms together and keep homonyms apart, their ability to recall synonyms as a function of training set size, and their training efficiency. Finally, we discuss related work on incremental learning and directions for further research.