通过利用未标记的数据来改善一级分类的最新分类

论文标题

通过利用未标记的数据来改善一级分类的最新分类

Improving State-of-the-Art in One-Class Classification by Leveraging Unlabeled Data

论文作者

Bagirov, Farid, Ivanov, Dmitry, Shpilman, Aleksei

论文摘要

在处理数据的二元分类时，只有一个标记的类数据科学家采用两种主要方法，即单级（OC）分类和积极的未标记（PU）学习。前者仅从标记的阳性数据中学习，而后者还利用未标记的数据来提高整体性能。由于PU Learning利用了更多数据，因此我们可能很容易想到，当无标记的数据可用时，首选算法应始终来自PU组。但是，我们发现，如果未标记的数据不可靠，即包含有限或有偏见的潜在阴性数据，情况并非总是如此。就未标记的数据可靠性而言，在各种情况下，我们对各种最新的OC和PU算法进行了广泛的实验研究。此外，我们提出了对不可靠的不可靠数据的最新OC算法的PU修改，以及类似地修改其他OC算法的指南。我们的主要实用建议是在未标记的数据可靠时使用最先进的pu算法，并使用否则对最新的OC算法进行拟议的修改。此外，我们概述了使用统计测试区分可靠和不可靠的无标记数据的程序。

When dealing with binary classification of data with only one labeled class data scientists employ two main approaches, namely One-Class (OC) classification and Positive Unlabeled (PU) learning. The former only learns from labeled positive data, whereas the latter also utilizes unlabeled data to improve the overall performance. Since PU learning utilizes more data, we might be prone to think that when unlabeled data is available, the go-to algorithms should always come from the PU group. However, we find that this is not always the case if unlabeled data is unreliable, i.e. contains limited or biased latent negative data. We perform an extensive experimental study of a wide list of state-of-the-art OC and PU algorithms in various scenarios as far as unlabeled data reliability is concerned. Furthermore, we propose PU modifications of state-of-the-art OC algorithms that are robust to unreliable unlabeled data, as well as a guideline to similarly modify other OC algorithms. Our main practical recommendation is to use state-of-the-art PU algorithms when unlabeled data is reliable and to use the proposed modifications of state-of-the-art OC algorithms otherwise. Additionally, we outline procedures to distinguish the cases of reliable and unreliable unlabeled data using statistical tests.

下载PDF全文

下载文献需遵守相关版权规定

论文标题