使用发音嵌入的点产物估算口语术语检测和相关性评分估计

论文标题

使用发音嵌入的点产物估算口语术语检测和相关性评分估计

Spoken Term Detection and Relevance Score Estimation using Dot-Product of Pronunciation Embeddings

论文作者

Švec, Jan, Šmídl, Luboš, Psutka, Josef V., Pražák, Aleš

论文摘要

本文描述了使用深层LSTM网络中大型口语档案中的口头术语检测方法（STD）的新方法。这项工作是基于先前使用暹罗神经网络进行性病的方法，并自然扩展了其直接定位口语术语并估算其相关性评分。由音素识别器生成的音素混淆网络由Deep LSTM网络处理，该网络将混淆网络的每个细分市场投射到嵌入式空间中。使用另一个深层LSTM网络将搜索术语投影到相同的嵌入空间中。然后，使用简单的点产物在嵌入空间中计算相关性评分，并使用Sigmoid函数校准以预测发生的概率。然后，从输出概率的序列估算搜索项的位置。深度LSTM网络是通过对单词和音素级别的配对识别假设进行自制的方式进行训练的。该方法对用英语和捷克语的MALACH数据进行了实验评估。

The paper describes a novel approach to Spoken Term Detection (STD) in large spoken archives using deep LSTM networks. The work is based on the previous approach of using Siamese neural networks for STD and naturally extends it to directly localize a spoken term and estimate its relevance score. The phoneme confusion network generated by a phoneme recognizer is processed by the deep LSTM network which projects each segment of the confusion network into an embedding space. The searched term is projected into the same embedding space using another deep LSTM network. The relevance score is then computed using a simple dot-product in the embedding space and calibrated using a sigmoid function to predict the probability of occurrence. The location of the searched term is then estimated from the sequence of output probabilities. The deep LSTM networks are trained in a self-supervised manner from paired recognition hypotheses on word and phoneme levels. The method is experimentally evaluated on MALACH data in English and Czech languages.

下载PDF全文

下载文献需遵守相关版权规定

论文标题