论文标题
通过最佳路径森林处理不平衡数据集
Handling Imbalanced Datasets Through Optimum-Path Forest
论文作者
论文摘要
在过去的十年中,基于机器学习的方法有时能够比人类更好地执行一系列复杂的任务,这需要一小部分时间。这样的进步部分是由于可用数据量的指数增长,这使得从中提取可信赖的现实世界信息成为可能。但是,这种数据通常是不平衡的,因为某些现象比其他现象更有可能。这样的行为会对机器学习模型的性能产生相当大的影响,因为它在收到的更频繁的数据上变得有偏见。尽管使用了相当多的机器学习方法,但由于许多应用程序(即最佳路径森林(OPF))的出色表现,一种基于图的方法吸引了臭名昭著。在本文中,我们提出了三种基于OPF的策略来解决不平衡问题:$ \ text {o}^2 $ PF和OPF-US,它们分别是过度采样和底漆的新颖方法,以及将这两种方法结合在一起的混合策略。该论文还引入了有关上述策略的一组变体。结果与公共和私人数据集的几种最新技术相比,结果证实了拟议方法的鲁棒性。
In the last decade, machine learning-based approaches became capable of performing a wide range of complex tasks sometimes better than humans, demanding a fraction of the time. Such an advance is partially due to the exponential growth in the amount of data available, which makes it possible to extract trustworthy real-world information from them. However, such data is generally imbalanced since some phenomena are more likely than others. Such a behavior yields considerable influence on the machine learning model's performance since it becomes biased on the more frequent data it receives. Despite the considerable amount of machine learning methods, a graph-based approach has attracted considerable notoriety due to the outstanding performance over many applications, i.e., the Optimum-Path Forest (OPF). In this paper, we propose three OPF-based strategies to deal with the imbalance problem: the $\text{O}^2$PF and the OPF-US, which are novel approaches for oversampling and undersampling, respectively, as well as a hybrid strategy combining both approaches. The paper also introduces a set of variants concerning the strategies mentioned above. Results compared against several state-of-the-art techniques over public and private datasets confirm the robustness of the proposed approaches.