基于ENF时空特征表示学习的数字音频篡改检测

论文标题

基于ENF时空特征表示学习的数字音频篡改检测

Digital Audio Tampering Detection Based on ENF Spatio-temporal Features Representation Learning

论文作者

Zeng, Chunyan, Kong, Shuai, Wang, Zhifeng, Wan, Xiangkui, Chen, Yunfan

论文摘要

大多数基于电网络频率（ENF）的数字音频篡改检测方法仅利用ENF的静态空间信息，忽略了时间序列中ENF的变化，这限制了ENF特征表示的能力并降低了篡改检测的准确性。本文提出了一种基于ENF时空特征表示学习的新方法，用于篡改数字音频检测。使用CNN和BilstM构建了平行时空网络模型，该模型深入提取ENF空间特征信息和ENF时间特征信息，以增强特征表示能力，以提高篡改检测准确性。为了提取ENF的空间和时间特征，本文首先使用数字音频高精度离散傅立叶变换分析来提取ENF的相位序列。通过自适应框架移动将不相等的相位序列分为帧，以获得相同大小的特征矩阵，以表示ENF的空间特征。同时，基于ENF时间更改信息以表示ENF的时间特征，将相位序列分为帧。然后，分别使用CNN和BilstM进一步提取深空间和时间特征，并使用注意力机制将权重适应为深空和时间特征，以获得具有更强表示能力的时空特征。最后，深度神经网络用于确定音频是否已被篡改。实验结果表明，与公共数据库Carioca（新西班牙语）下的最新方法相比，提出的方法将准确性提高了2.12％-7.12％。

Most digital audio tampering detection methods based on electrical network frequency (ENF) only utilize the static spatial information of ENF, ignoring the variation of ENF in time series, which limit the ability of ENF feature representation and reduce the accuracy of tampering detection. This paper proposes a new method for digital audio tampering detection based on ENF spatio-temporal features representation learning. A parallel spatio-temporal network model is constructed using CNN and BiLSTM, which deeply extracts ENF spatial feature information and ENF temporal feature information to enhance the feature representation capability to improve the tampering detection accuracy. In order to extract the spatial and temporal features of the ENF, this paper firstly uses digital audio high-precision Discrete Fourier Transform analysis to extract the phase sequences of the ENF. The unequal phase series is divided into frames by adaptive frame shifting to obtain feature matrices of the same size to represent the spatial features of the ENF. At the same time, the phase sequences are divided into frames based on ENF time changes information to represent the temporal features of the ENF. Then deep spatial and temporal features are further extracted using CNN and BiLSTM respectively, and an attention mechanism is used to adaptively assign weights to the deep spatial and temporal features to obtain spatio-temporal features with stronger representation capability. Finally, the deep neural network is used to determine whether the audio has been tampered with. The experimental results show that the proposed method improves the accuracy by 2.12%-7.12% compared with state-of-the-art methods under the public database Carioca, New Spanish.

下载PDF全文

下载文献需遵守相关版权规定

论文标题