论文标题

FCT-GAN:通过傅立叶变换增强表合成

FCT-GAN: Enhancing Table Synthesis via Fourier Transform

论文作者

Zhao, Zilong, Birke, Robert, Chen, Lydia Y.

论文摘要

合成表格数据是共享知识的替代方法,同时遵守限制性数据访问法规,例如欧洲一般数据保护法规(GDPR)。主流的最先进的表格数据合成器从生成对抗网络(GAN)中汲取由发电机和歧视器组成的生成对抗网络(GAN)。虽然卷积神经网络被证明是比完全连接的网络进行表格数据合成的更好的体系结构,但表格数据的两个关键属性被忽略了:(i)跨列之间的全局相关性,以及(ii)不变综​​合到输入数据的列排列。为了解决上述问题,我们提出了一个条件表达性表格生成对抗网络(FCT-GAN)。我们介绍功能令牌化和傅立叶网络来构建变压器式的生成器和歧视器,并捕获各列之间的本地和全局依赖关系。代币器捕获本地空间特征,并将原始数据转换为令牌。傅立叶网络将令牌转换为频域,元素 - 元素将可学习的过滤器乘以。对基准和现实世界数据的广泛评估表明,FCT-GAN可以使用高机器学习实用程序(比最先进的基线高达27.8%)合成表格数据,并且与原始数据相似(高达26.5%),尤其是在跨列之间的全球相关性,尤其是在高度减小的数据集中,尤其是在跨列数据的情况下,统计数据很高。

Synthetic tabular data emerges as an alternative for sharing knowledge while adhering to restrictive data access regulations, e.g., European General Data Protection Regulation (GDPR). Mainstream state-of-the-art tabular data synthesizers draw methodologies from Generative Adversarial Networks (GANs), which are composed of a generator and a discriminator. While convolution neural networks are shown to be a better architecture than fully connected networks for tabular data synthesizing, two key properties of tabular data are overlooked: (i) the global correlation across columns, and (ii) invariant synthesizing to column permutations of input data. To address the above problems, we propose a Fourier conditional tabular generative adversarial network (FCT-GAN). We introduce feature tokenization and Fourier networks to construct a transformer-style generator and discriminator, and capture both local and global dependencies across columns. The tokenizer captures local spatial features and transforms original data into tokens. Fourier networks transform tokens to frequency domains and element-wisely multiply a learnable filter. Extensive evaluation on benchmarks and real-world data shows that FCT-GAN can synthesize tabular data with high machine learning utility (up to 27.8% better than state-of-the-art baselines) and high statistical similarity to the original data (up to 26.5% better), while maintaining the global correlation across columns, especially on high dimensional dataset.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源