通过最佳路径森林处理不平衡数据集

论文标题

通过最佳路径森林处理不平衡数据集

Handling Imbalanced Datasets Through Optimum-Path Forest

论文作者

Passos, Leandro Aparecido, Jodas, Danilo S., Ribeiro, Luiz C. F., Akio, Marco, de Souza, Andre Nunes, Papa, João Paulo

论文摘要

在过去的十年中，基于机器学习的方法有时能够比人类更好地执行一系列复杂的任务，这需要一小部分时间。这样的进步部分是由于可用数据量的指数增长，这使得从中提取可信赖的现实世界信息成为可能。但是，这种数据通常是不平衡的，因为某些现象比其他现象更有可能。这样的行为会对机器学习模型的性能产生相当大的影响，因为它在收到的更频繁的数据上变得有偏见。尽管使用了相当多的机器学习方法，但由于许多应用程序（即最佳路径森林（OPF））的出色表现，一种基于图的方法吸引了臭名昭著。在本文中，我们提出了三种基于OPF的策略来解决不平衡问题：$ \ text {o}^2 $ PF和OPF-US，它们分别是过度采样和底漆的新颖方法，以及将这两种方法结合在一起的混合策略。该论文还引入了有关上述策略的一组变体。结果与公共和私人数据集的几种最新技术相比，结果证实了拟议方法的鲁棒性。

In the last decade, machine learning-based approaches became capable of performing a wide range of complex tasks sometimes better than humans, demanding a fraction of the time. Such an advance is partially due to the exponential growth in the amount of data available, which makes it possible to extract trustworthy real-world information from them. However, such data is generally imbalanced since some phenomena are more likely than others. Such a behavior yields considerable influence on the machine learning model's performance since it becomes biased on the more frequent data it receives. Despite the considerable amount of machine learning methods, a graph-based approach has attracted considerable notoriety due to the outstanding performance over many applications, i.e., the Optimum-Path Forest (OPF). In this paper, we propose three OPF-based strategies to deal with the imbalance problem: the $\text{O}^2$PF and the OPF-US, which are novel approaches for oversampling and undersampling, respectively, as well as a hybrid strategy combining both approaches. The paper also introduces a set of variants concerning the strategies mentioned above. Results compared against several state-of-the-art techniques over public and private datasets confirm the robustness of the proposed approaches.

下载PDF全文

下载文献需遵守相关版权规定

论文标题