CTR预测的合奏知识蒸馏

论文标题

CTR预测的合奏知识蒸馏

Ensemble Knowledge Distillation for CTR Prediction

论文作者

Zhu, Jieming, Liu, Jinyang, Li, Weiqi, Lai, Jincai, He, Xiuqiang, Chen, Liang, Zheng, Zibin

论文摘要

最近，基于深度学习的模型已被广泛研究以进行点击率（CTR）预测，并在许多工业应用中提高了预测准确性。但是，当前的研究主要着重于构建复杂的网络体系结构，以更好地捕获复杂的特征交互和动态用户行为。增加的模型复杂性可能会减慢在线推断，并阻碍其在实时应用中的采用。相反，我们的工作针对基于知识蒸馏（KD）的新模型培训策略的目标。 KD是一个教师学习框架，可以将知识从教师模型转移到学生模型。 KD策略不仅使我们能够简化学生模型为香草DNN模型，而且还可以对最先进的教师模型进行大量准确的改进。因此，好处激发了我们进一步探索强大的教师合奏的使用，以进行更准确的学生模型培训。我们还提出了一些新颖的技术，以促进结合的CTR预测，包括教师门控和蒸馏损失的早期停止。我们针对12种现有模型以及三个工业数据集进行了全面的实验。离线和在线A/B测试结果都显示了我们基于KD的培训策略的有效性。

Recently, deep learning-based models have been widely studied for click-through rate (CTR) prediction and lead to improved prediction accuracy in many industrial applications. However, current research focuses primarily on building complex network architectures to better capture sophisticated feature interactions and dynamic user behaviors. The increased model complexity may slow down online inference and hinder its adoption in real-time applications. Instead, our work targets at a new model training strategy based on knowledge distillation (KD). KD is a teacher-student learning framework to transfer knowledge learned from a teacher model to a student model. The KD strategy not only allows us to simplify the student model as a vanilla DNN model but also achieves significant accuracy improvements over the state-of-the-art teacher models. The benefits thus motivate us to further explore the use of a powerful ensemble of teachers for more accurate student model training. We also propose some novel techniques to facilitate ensembled CTR prediction, including teacher gating and early stopping by distillation loss. We conduct comprehensive experiments against 12 existing models and across three industrial datasets. Both offline and online A/B testing results show the effectiveness of our KD-based training strategy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题