论文标题
压缩大型样本数据以进行判别分析
Compressing Large Sample Data for Discriminant Analysis
论文作者
论文摘要
随着数据采集变得更便宜,更容易,大样本数据变得普遍。尽管大型样本量对于许多统计方法具有理论上的优势,但它提出了计算挑战。素描或压缩是一种经过充分研究的方法,可以解决回归设置中的这些问题,但对其在分类设置中的性能知之甚少。在这里,我们考虑了判别分析框架内的样本量较大引起的计算问题。我们提出了一种新的压缩方法,用于减少线性和二次判别分析的训练样本数量,与现有的压缩方法相比,该方法的重点是减少特征的数量。与贝叶斯分类器相比,我们对错误分类错误率的理论结合支持了我们的方法。经验研究证实了该方法的显着计算收益及其与随机亚采样相比的优势预测能力。
Large-sample data became prevalent as data acquisition became cheaper and easier. While a large sample size has theoretical advantages for many statistical methods, it presents computational challenges. Sketching, or compression, is a well-studied approach to address these issues in regression settings, but considerably less is known about its performance in classification settings. Here we consider the computational issues due to large sample size within the discriminant analysis framework. We propose a new compression approach for reducing the number of training samples for linear and quadratic discriminant analysis, in contrast to existing compression methods which focus on reducing the number of features. We support our approach with a theoretical bound on the misclassification error rate compared to the Bayes classifier. Empirical studies confirm the significant computational gains of the proposed method and its superior predictive ability compared to random sub-sampling.