论文标题
采样对本地私人数据收集的影响
Impact of Sampling on Locally Differentially Private Data Collection
论文作者
论文摘要
随着最近的数据绽放,对个人私人信息的威胁激增。在近年来,研究的各种技术以优化隐私数据分析是研究的重点。在本文中,我们分析了抽样对纯协议下本地私人数据释放的大规模数据分析的频率估计技术效用的影响。我们在数据共享的分布式环境中研究了该案例,其中各个节点向中央服务器(例如,联合学习)报告了值。我们表明,如果我们引入一些随机抽样,以降低通信成本,那么标准的现有估计器就不会保持公正。我们在用某些概率采样每个节点的情况下提出了一个新的无偏估计器,并使用它计算数据的各种统计摘要。我们提出了一种用个性化抽样概率来取样每个节点的方法,作为进一步概括的一步,最终导致了一些有趣的开放问题。我们分析了我们在合成数据集上提出的估计器的准确性,以收集有关通信成本,隐私和公用事业之间权衡的一些见解。
With the recent bloom of data, there is a huge surge in threats against individuals' private information. Various techniques for optimizing privacy-preserving data analysis are at the focus of research in the recent years. In this paper, we analyse the impact of sampling on the utility of the standard techniques of frequency estimation, which is at the core of large-scale data analysis, of the locally deferentially private data-release under a pure protocol. We study the case in a distributed environment of data sharing where the values are reported by various nodes to the central server, e.g., cross-device Federated Learning. We show that if we introduce some random sampling of the nodes in order to reduce the cost of communication, the standard existing estimators fail to remain unbiased. We propose a new unbiased estimator in the context of sampling each node with certain probability and compute various statistical summaries of the data using it. We propose a way of sampling each node with personalized sampling probabilities as a step to further generalisation, which leads to some interesting open questions in the end. We analyse the accuracy of our proposed estimators on synthetic datasets to gather some insight on the trade-off between communication cost, privacy, and utility.