论文标题

FED-TDA:非IID数据的联合表格数据增强

Fed-TDA: Federated Tabular Data Augmentation on Non-IID Data

论文作者

Duan, Shaoming, Liu, Chuanyi, Han, Peiyi, He, Tianyu, Xu, Yifeng, Deng, Qiyuan

论文摘要

非独立和相同分布的(非IID)数据是联邦学习(FL)的关键挑战,通常会阻碍FL的优化收敛性和性能。现有的数据增强方法基于联合生成模型或用于解决非IID问题的原始数据共享策略仍然存在低性能,隐私保护问题和分散式表格数据中的高沟通开销。为了应对这些挑战,我们提出了一种名为fed-tda的联合表格数据增强方法。 FED-TDA的核心思想是使用一些简单的统计信息(例如,每列和全局协方差的分布)合成表格数据以进行数据增强。具体而言,我们根据预测的统计数据,提出了多模式分布变换和反向累积分布映射,分别从噪声中综合了表格数据中的连续和离散列。此外,我们从理论上分析了我们的FED-TDA不仅保留数据隐私,而且还保持原始数据的分布和列之间的相关性。通过对五个实际表格数据集进行的广泛实验,我们证明了Fed-TDA在测试性能和沟通效率方面的优越性。

Non-independent and identically distributed (non-IID) data is a key challenge in federated learning (FL), which usually hampers the optimization convergence and the performance of FL. Existing data augmentation methods based on federated generative models or raw data sharing strategies for solving the non-IID problem still suffer from low performance, privacy protection concerns, and high communication overhead in decentralized tabular data. To tackle these challenges, we propose a federated tabular data augmentation method, named Fed-TDA. The core idea of Fed-TDA is to synthesize tabular data for data augmentation using some simple statistics (e.g., distributions of each column and global covariance). Specifically, we propose the multimodal distribution transformation and inverse cumulative distribution mapping respectively synthesize continuous and discrete columns in tabular data from a noise according to the pre-learned statistics. Furthermore, we theoretically analyze that our Fed-TDA not only preserves data privacy but also maintains the distribution of the original data and the correlation between columns. Through extensive experiments on five real-world tabular datasets, we demonstrate the superiority of Fed-TDA over the state-of-the-art in test performance and communication efficiency.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源