论文标题
基于知识的文档分类与香农熵
Knowledge-based Document Classification with Shannon Entropy
论文作者
论文摘要
文档分类是文本文档中感兴趣的特定内容。与数据驱动的机器学习分类器相反,可以根据域特定知识构建基于知识的分类器,这通常采用与主题相关的关键字集合的形式。虽然典型的基于知识的分类器根据关键字丰度计算预测分数,但由于缺乏指导原理在测量关键字匹配时,它通常会遭受嘈杂的检测。在本文中,我们提出了一个配备香农熵的新型基于知识的模型,该模型衡量了信息的丰富性,并有利于统一和多样化的关键字匹配。在不调用任何积极样本的情况下,这种方法为文档分类提供了一种简单且可解释的解决方案。我们表明,香农熵在固定水平的假阳性速率下显着提高了召回率。另外,我们表明,与传统的机器学习相比,该模型在推断时的数据分布更改更为强大,尤其是在积极训练样本非常有限的情况下。
Document classification is the detection specific content of interest in text documents. In contrast to the data-driven machine learning classifiers, knowledge-based classifiers can be constructed based on domain specific knowledge, which usually takes the form of a collection of subject related keywords. While typical knowledge-based classifiers compute a prediction score based on the keyword abundance, it generally suffers from noisy detections due to the lack of guiding principle in gauging the keyword matches. In this paper, we propose a novel knowledge-based model equipped with Shannon Entropy, which measures the richness of information and favors uniform and diverse keyword matches. Without invoking any positive sample, such method provides a simple and explainable solution for document classification. We show that the Shannon Entropy significantly improves the recall at fixed level of false positive rate. Also, we show that the model is more robust against change of data distribution at inference while compared with traditional machine learning, particularly when the positive training samples are very limited.