偏斜数据集的积极学习

论文标题

偏斜数据集的积极学习

Active Learning for Skewed Data Sets

论文作者

Kazerouni, Abbas, Zhao, Qi, Xie, Jing, Tata, Sandeep, Najork, Marc

论文摘要

考虑一个顺序的主动学习问题，在每个回合中，代理选择一批未标记的数据点，查询其标签并更新二进制分类器。尽管以这种一般形式的形式存在着丰富的积极学习工作，但在本文中，我们专注于具有两个区别特征的问题：严重的阶级失衡（偏斜）和少量的初始培训数据。这两个问题在许多Web应用程序中都出现了令人惊讶的频率。例如，检测在线社区（色情，暴力和仇恨语音）中的进攻或敏感内容正在受到行业和研究社区的极大关注。此类问题既具有我们描述的特征 - 绝大多数内容都不是令人反感的，因此，此类内容的积极示例数量比负面示例小。此外，当构建机器学习模型以解决此类问题时，通常只有少量的初始培训数据可用。为了解决这两个问题，我们提出了一种混合主动学习算法（HAL），该算法通过当前标记的培训示例来利用可用的知识与探索大量未标记数据可用。通过仿真结果，我们表明HAL与强大的基线相比，将其标记的要点明显更好。经过对HAL选择的示例培训的分类器很容易超过目标指标的基准（例如Precision-Recall曲线下的区域），并给出了相同的预算标签示例。我们认为，HAL提供了一种简单，直观和计算障碍的方法，用于为广泛的机器学习应用程序构建主动学习。

Consider a sequential active learning problem where, at each round, an agent selects a batch of unlabeled data points, queries their labels and updates a binary classifier. While there exists a rich body of work on active learning in this general form, in this paper, we focus on problems with two distinguishing characteristics: severe class imbalance (skew) and small amounts of initial training data. Both of these problems occur with surprising frequency in many web applications. For instance, detecting offensive or sensitive content in online communities (pornography, violence, and hate-speech) is receiving enormous attention from industry as well as research communities. Such problems have both the characteristics we describe -- a vast majority of content is not offensive, so the number of positive examples for such content is orders of magnitude smaller than the negative examples. Furthermore, there is usually only a small amount of initial training data available when building machine-learned models to solve such problems. To address both these issues, we propose a hybrid active learning algorithm (HAL) that balances exploiting the knowledge available through the currently labeled training examples with exploring the large amount of unlabeled data available. Through simulation results, we show that HAL makes significantly better choices for what points to label when compared to strong baselines like margin-sampling. Classifiers trained on the examples selected for labeling by HAL easily out-perform the baselines on target metrics (like area under the precision-recall curve) given the same budget for labeling examples. We believe HAL offers a simple, intuitive, and computationally tractable way to structure active learning for a wide range of machine learning applications.

下载PDF全文

下载文献需遵守相关版权规定

论文标题