HDLSS分类中的最佳测试数据堆积与协方差异质性

论文标题

HDLSS分类中的最佳测试数据堆积与协方差异质性

Optimal Test-Data Piling in HDLSS Classification with Covariance Heterogeneity

论文作者

Kim, Taehyun, Ahn, Jeongyoun, Jung, Sungkyu

论文摘要

这项工作解决了高维线性分类中的一个长期问题：在异质协方差结构中可以实现完美的分类吗？我们专注于数据堆积的现象，在这些现象中，投影数据点崩溃到离散值。我们提供了两种不同类型的数据堆积的全面表征。第一种数据堆积是指将训练数据投射到特定方向的现象，每班级恰好获得两个不同的值。当数据维度$ p $超过样本量$ n $时，这普遍发生。第二种类型涉及独立的测试数据，并渐近地出现为$ p \ to \ infty $，带有固定的$ n $。尽管先前的工作确定了使用带负脊的分类器在同质峰值协方差结构下存在此类双重数据堆积，但我们的分析扩展到了更一般和现实的异质协方差案例。我们确定所有堆积方向之间的最佳方向，该方向最大化了测试数据堆积之间的分离，这称为第二个最大数据堆积方向。提出了一种基于数据拆分的算法，以仅使用培训数据来计算此方向。我们的分析揭示了一个关键的见解：发现这个方向的主要障碍是尾部特征值的失衡，而不是尖峰计数，尖峰幅度或领先特征空间的对准的差异。广泛的模拟证实了我们的理论结果，并证明了在广泛的高维情况下提出的分类器的有效性。

This work addresses a longstanding question in high-dimensional linear classification: Is perfect classification achievable in heterogeneous covariance structures? We focus on the phenomenon of data piling, where projected data points collapse onto discrete values. We provide a comprehensive characterization of two distinct types of data piling. The first type of data piling refers to the phenomenon where projecting the training data onto a certain direction yields exactly two distinct values-one for each class. This occurs universally when the data dimension $p$ exceeds the sample size $n$. The second type concerns independent test data and arises asymptotically as $p \to \infty$ with fixed $n$. While previous work established the existence of such double data piling under homogeneously spiked covariance structures using negatively ridged classifiers, our analysis extends to the more general and realistic case of heterogeneous covariance. We identify an optimal direction among all piling directions that maximizes the separation between test data piles, which is called the Second Maximal Data Piling direction. An algorithm based on data splitting is proposed to compute this direction using only training data. Our analysis reveals a key insight: the main obstacle to discovering this direction is the imbalance of the tail eigenvalues, rather than differences in spike count, spike magnitude, or the alignment of leading eigenspaces. Extensive simulations confirm our theoretical results and demonstrate the effectiveness of the proposed classifier across a wide range of high-dimensional scenarios.

下载PDF全文

下载文献需遵守相关版权规定

论文标题