论文标题
从具有非各向异性噪声的异质数据中恢复子空间
Subspace Recovery from Heterogeneous Data with Non-isotropic Noise
论文作者
论文摘要
从数据中恢复线性子空间是统计和机器学习中的一项基本和重要任务。在联邦学习设置中的异质性中,我们研究了此问题的基本表述:主要成分分析(PCA),重点是处理不规则的噪声。我们的数据来自带有用户$ i $ $ i $的$ n $用户,从$ d $二维分布中贡献数据样本,平均$μ_i$。我们的目标是使用所有用户的数据点恢复由$μ_1,\ ldots,μ_n$共享的线性子空间,其中通过将独立的平均零噪声向量添加到$μ_i$中来形成用户$ i $的每个数据点。如果我们只有每个用户的一个数据点,则在理论上,当噪声向量的协方差矩阵可能是非球形的,因此在先前的工作中需要其他限制性假设时,从理论上讲,子空间恢复是不可能的。我们通过利用每个用户的至少两个数据点来避免这些假设,这使我们能够在非球形和用户依赖性噪声下设计一个有效汇总的估计器。在一般情况下,我们证明了估计器的估计误差的上限,在该方案中,数据点和噪声量的数量在用户之间可能会有所不同,并且证明了信息理论误差下限,不仅将上限匹配到恒定因子,而且甚至可以保持球形高斯噪声。这意味着由于噪声不规则性,我们的估计器不会引入额外的估计误差(最高为恒定因素)。我们在类似的设置中显示了线性回归问题的其他结果。
Recovering linear subspaces from data is a fundamental and important task in statistics and machine learning. Motivated by heterogeneity in Federated Learning settings, we study a basic formulation of this problem: the principal component analysis (PCA), with a focus on dealing with irregular noise. Our data come from $n$ users with user $i$ contributing data samples from a $d$-dimensional distribution with mean $μ_i$. Our goal is to recover the linear subspace shared by $μ_1,\ldots,μ_n$ using the data points from all users, where every data point from user $i$ is formed by adding an independent mean-zero noise vector to $μ_i$. If we only have one data point from every user, subspace recovery is information-theoretically impossible when the covariance matrices of the noise vectors can be non-spherical, necessitating additional restrictive assumptions in previous work. We avoid these assumptions by leveraging at least two data points from each user, which allows us to design an efficiently-computable estimator under non-spherical and user-dependent noise. We prove an upper bound for the estimation error of our estimator in general scenarios where the number of data points and amount of noise can vary across users, and prove an information-theoretic error lower bound that not only matches the upper bound up to a constant factor, but also holds even for spherical Gaussian noise. This implies that our estimator does not introduce additional estimation error (up to a constant factor) due to irregularity in the noise. We show additional results for a linear regression problem in a similar setup.