论文标题
通过风险分析来解决实体的自适应深度学习
Adaptive Deep Learning for Entity Resolution by Risk Analysis
论文作者
论文摘要
深度学习已经实现了实体解决方案(ER)的最先进表现。但是,深层模型通常会经过大量准确标记的培训数据进行训练,并且不容易针对目标工作量进行调整。不幸的是,在实际情况下,可能没有足够的标记培训数据,更糟糕的是,即使它们来自同一领域,它们的分布通常与目标工作负载或多或少有所不同。 为了减轻上述局限性,本文提出了一种基于风险的新型方法,以通过其特定特征来调整目标工作量的深度模型。拟议的方法基于最近对ER的风险分析进展的进展,首先训练有关标记培训数据的深层模型,然后通过将其估计的未贴标签目标数据估计错误风险进行微调进行微调。我们的理论分析表明,基于风险的自适应培训可以纠正错误预测实例的标签状态。我们还通过一项比较研究从经验上验证了所提出的方法对实际基准数据的功效。我们的广泛实验表明,它可以大大提高深层模型的性能。此外,在分配未对准的情况下,它可以同样优于最先进的转移学习替代方案。使用ER作为测试案例,我们证明了基于风险的自适应培训是一种有希望的方法,可用于各种具有挑战性的分类任务。
The state-of-the-art performance on entity resolution (ER) has been achieved by deep learning. However, deep models are usually trained on large quantities of accurately labeled training data, and can not be easily tuned towards a target workload. Unfortunately, in real scenarios, there may not be sufficient labeled training data, and even worse, their distribution is usually more or less different from the target workload even when they come from the same domain. To alleviate the said limitations, this paper proposes a novel risk-based approach to tune a deep model towards a target workload by its particular characteristics. Built on the recent advances on risk analysis for ER, the proposed approach first trains a deep model on labeled training data, and then fine-tunes it by minimizing its estimated misprediction risk on unlabeled target data. Our theoretical analysis shows that risk-based adaptive training can correct the label status of a mispredicted instance with a fairly good chance. We have also empirically validated the efficacy of the proposed approach on real benchmark data by a comparative study. Our extensive experiments show that it can considerably improve the performance of deep models. Furthermore, in the scenario of distribution misalignment, it can similarly outperform the state-of-the-art alternative of transfer learning by considerable margins. Using ER as a test case, we demonstrate that risk-based adaptive training is a promising approach potentially applicable to various challenging classification tasks.