论文标题
使用余弦相似性的文档向量重新审视
The Document Vectors Using Cosine Similarity Revisited
论文作者
论文摘要
IMDB电影评论的当前最新测试准确性(97.42 \%)由\ citet {thongtan-phienthrakul-2019-sendiment}报告,并通过文档相似性(dv-ngrams-cosine ofs offection)在文档上训练的逻辑回归典型词,并在文档上训练(dv-ngrams-cosine),并在他们的论文中(Quests),并在其范围内提出了标准(Quests),并在他们的范围内提出了标准(幼稚的贝叶斯体重。尽管大型预训练的基于变压器的模型已在许多数据集和任务中显示了SOTA结果,但尽管仅在IMDB数据集中进行了更简单和预先训练,但它们尚未超越上述模型。 在本文中,我们描述了该模型评估程序中的一个错误,当我们试图在IMDB数据集上分析其出色的性能时,该错误已被发现。我们进一步表明,先前报道的97.42 \%的测试准确性无效,应纠正至93.68 \%。我们还使用不同量的培训数据(IMDB数据集的子集)分析模型性能,并将其与基于变压器的Roberta模型进行比较。结果表明,尽管罗伯塔(Roberta)对于更大的训练集具有明显的优势,但当标记的训练集非常小(10或20个文档)时,DV-Ngrams-Cosine的性能要比Roberta更好。最后,我们基于DV-Ngrams-Cosine训练过程的天真贝叶斯重量引入了一个子采样方案,这会导致更快的训练和质量更高。
The current state-of-the-art test accuracy (97.42\%) on the IMDB movie reviews dataset was reported by \citet{thongtan-phienthrakul-2019-sentiment} and achieved by the logistic regression classifier trained on the Document Vectors using Cosine Similarity (DV-ngrams-cosine) proposed in their paper and the Bag-of-N-grams (BON) vectors scaled by Naive Bayesian weights. While large pre-trained Transformer-based models have shown SOTA results across many datasets and tasks, the aforementioned model has not been surpassed by them, despite being much simpler and pre-trained on the IMDB dataset only. In this paper, we describe an error in the evaluation procedure of this model, which was found when we were trying to analyze its excellent performance on the IMDB dataset. We further show that the previously reported test accuracy of 97.42\% is invalid and should be corrected to 93.68\%. We also analyze the model performance with different amounts of training data (subsets of the IMDB dataset) and compare it to the Transformer-based RoBERTa model. The results show that while RoBERTa has a clear advantage for larger training sets, the DV-ngrams-cosine performs better than RoBERTa when the labelled training set is very small (10 or 20 documents). Finally, we introduce a sub-sampling scheme based on Naive Bayesian weights for the training process of the DV-ngrams-cosine, which leads to faster training and better quality.