论文标题
通过双重随机缩放对多种密度和几何形状的强大推断
Robust Inference of Manifold Density and Geometry by Doubly Stochastic Scaling
论文作者
论文摘要
高斯内核及其传统的正常化(例如,排行区)是评估数据点之间相似性的流行方法。但是,在高维噪声下它们可能不准确,尤其是如果在数据中,在异方差或异常值下的噪声幅度差异很大时。在这项工作中,我们研究了一种更强大的选择 - 高斯内核的双随机归一化。我们考虑了一个设置,即从嵌入高维空间中的低维歧管上的未知密度中取样点,并因可能强大的,非相同的分布式,哥斯的噪声而损坏。我们确定双重随机亲和力矩阵及其缩放因子集中在某些种群形式周围,并提供相应的有限样本概率误差界。然后,我们利用这些结果来开发几种在一般高维噪声下进行鲁棒推理的工具。首先,我们得出了强大的密度估计值,该估计值可靠地渗透潜在的采样密度,并在异卵和异常值下显着优于标准内核密度估计器。其次,我们获得了点噪声大小,尖端信号幅度以及干净数据点之间成对的欧几里得距离的估计器。最后,我们得出了可靠的图形laplacian正常化,这些标准正常化准确地近似于包括拉普拉斯·贝特拉米运算符在内的各种歧管拉普拉斯人,从而在嘈杂的设置中改善了传统的正常化。我们在模拟和实际单细胞RNA-sequering数据中举例说明了我们的结果。对于后者,我们表明,与传统方法相反,我们的方法对细胞类型的技术噪声水平的变化具有良好的变化。
The Gaussian kernel and its traditional normalizations (e.g., row-stochastic) are popular approaches for assessing similarities between data points. Yet, they can be inaccurate under high-dimensional noise, especially if the noise magnitude varies considerably across the data, e.g., under heteroskedasticity or outliers. In this work, we investigate a more robust alternative -- the doubly stochastic normalization of the Gaussian kernel. We consider a setting where points are sampled from an unknown density on a low-dimensional manifold embedded in high-dimensional space and corrupted by possibly strong, non-identically distributed, sub-Gaussian noise. We establish that the doubly stochastic affinity matrix and its scaling factors concentrate around certain population forms, and provide corresponding finite-sample probabilistic error bounds. We then utilize these results to develop several tools for robust inference under general high-dimensional noise. First, we derive a robust density estimator that reliably infers the underlying sampling density and can substantially outperform the standard kernel density estimator under heteroskedasticity and outliers. Second, we obtain estimators for the pointwise noise magnitudes, the pointwise signal magnitudes, and the pairwise Euclidean distances between clean data points. Lastly, we derive robust graph Laplacian normalizations that accurately approximate various manifold Laplacians, including the Laplace Beltrami operator, improving over traditional normalizations in noisy settings. We exemplify our results in simulations and on real single-cell RNA-sequencing data. For the latter, we show that in contrast to traditional methods, our approach is robust to variability in technical noise levels across cell types.