论文标题
可扩展和稀疏感知隐私的K-Means聚类,并应用于欺诈检测
Scalable and Sparsity-Aware Privacy-Preserving K-means Clustering with Application to Fraud Detection
论文作者
论文摘要
K均值是实践中使用最广泛的聚类模型之一。由于数据隔离问题和对高模型性能的要求,如何共同为多方建立实用和安全的K-均值已成为行业许多应用程序的重要主题。现有的工作主要是两种类型。第一种类型具有效率优势,但是信息泄漏增加了潜在的隐私风险。第二种类型是可证明的,但对于大规模数据稀疏方案而言,效率低下,甚至无助。在本文中,我们为具有三个特征的有效稀疏感的K均值提出了一个新的框架。首先,我们的框架分为数据独立的离线阶段和更快的在线阶段,并且离线阶段允许预先计算几乎所有的加密操作。其次,我们利用在线和离线阶段中的矢量化技术。第三,我们采用稀疏的矩阵乘法来对数据稀疏方案进行进一步提高效率。我们对三个合成数据集进行了全面的实验,并将模型部署在现实世界中的欺诈检测任务中。我们的实验结果表明,与最先进的解决方案相比,我们的模型在运行时间和沟通规模方面都能达到竞争性能,尤其是在稀疏数据集上。
K-means is one of the most widely used clustering models in practice. Due to the problem of data isolation and the requirement for high model performance, how to jointly build practical and secure K-means for multiple parties has become an important topic for many applications in the industry. Existing work on this is mainly of two types. The first type has efficiency advantages, but information leakage raises potential privacy risks. The second type is provable secure but is inefficient and even helpless for the large-scale data sparsity scenario. In this paper, we propose a new framework for efficient sparsity-aware K-means with three characteristics. First, our framework is divided into a data-independent offline phase and a much faster online phase, and the offline phase allows to pre-compute almost all cryptographic operations. Second, we take advantage of the vectorization techniques in both online and offline phases. Third, we adopt a sparse matrix multiplication for the data sparsity scenario to improve efficiency further. We conduct comprehensive experiments on three synthetic datasets and deploy our model in a real-world fraud detection task. Our experimental results show that, compared with the state-of-the-art solution, our model achieves competitive performance in terms of both running time and communication size, especially on sparse datasets.