论文标题
在结构化异常的最大似然估计中量化和减少偏差
Quantifying and Reducing Bias in Maximum Likelihood Estimation of Structured Anomalies
论文作者
论文摘要
异常估计或找到与数据集其他地方不同的数据集的子集的问题,是机器学习和数据挖掘的经典问题。在理论工作和应用中,假定异常具有由$ \ textit {anomaly family} $的成员资格定义的特定结构。例如,在时间数据中,异常家族可能是时间间隔,而在网络数据中,异常家族可能是连接的。异常估计的最突出方法是计算异常的最大似然估计量(MLE)。但是,最近观察到,对于正常分布的数据,MLE是某些异常家庭的$ \ textit {偏见} $估算器。在这项工作中,我们证明,在正常手段设置中,MLE的偏见取决于异常家族的大小。我们证明,如果含有异常的异常家族中的集合数为亚指数,则MLE是渐近公正的。我们还提供了相反的经验证据是正确的:如果此类集合的数量是指数级的,则MLE是渐近偏见的。我们的分析统一了有关特定异常家庭MLE偏差的许多早期结果。接下来,我们使用混合模型得出了一种新的异常估计量,我们证明我们的异常估计量是渐近无偏见的,而与异常家族的大小无关。我们说明了估计量与疾病爆发和公路交通数据的MLE的优势。
Anomaly estimation, or the problem of finding a subset of a dataset that differs from the rest of the dataset, is a classic problem in machine learning and data mining. In both theoretical work and in applications, the anomaly is assumed to have a specific structure defined by membership in an $\textit{anomaly family}$. For example, in temporal data the anomaly family may be time intervals, while in network data the anomaly family may be connected subgraphs. The most prominent approach for anomaly estimation is to compute the Maximum Likelihood Estimator (MLE) of the anomaly; however, it was recently observed that for normally distributed data, the MLE is a $\textit{biased}$ estimator for some anomaly families. In this work, we demonstrate that in the normal means setting, the bias of the MLE depends on the size of the anomaly family. We prove that if the number of sets in the anomaly family that contain the anomaly is sub-exponential, then the MLE is asymptotically unbiased. We also provide empirical evidence that the converse is true: if the number of such sets is exponential, then the MLE is asymptotically biased. Our analysis unifies a number of earlier results on the bias of the MLE for specific anomaly families. Next, we derive a new anomaly estimator using a mixture model, and we prove that our anomaly estimator is asymptotically unbiased regardless of the size of the anomaly family. We illustrate the advantages of our estimator versus the MLE on disease outbreak and highway traffic data.