论文标题
多语言查询产品检索的语义对齐系统
A Semantic Alignment System for Multilingual Query-Product Retrieval
论文作者
论文摘要
本文主要描述了我们的获胜解决方案(团队名称:www)到2022年KDD杯的亚马逊ESCI挑战赛,该挑战的NDCG得分为0.9043,并在任务1上获得了第一名:查询产品排名曲目。 在这场比赛中,为参与者提供了真实的大型多语言购物查询数据集,并包含英语,日语和西班牙语的查询产品对。在本竞赛中提出了三个不同的任务,包括将结果列表排名为任务1,将查询/产品对分为精确,替代,补充或无关(ESCI)类别作为任务2,并将给定查询的替代产品识别为任务3。 我们主要关注任务1,并提出一个用于多语言查询产品检索的语义对齐系统。采用预训练的多语言模型(LM)来获得查询和产品的语义表示。我们的模型均经过横向渗透损失的训练,首先将查询产品对分为ESCI 4类,然后我们使用具有4级概率的加权总和来获得排名的分数。为了进一步提高模型,我们还进行了详尽的数据预处理,通过翻译进行数据增强,特别处理英语LMS的英语文本,使用AWP和FGM的对抗性培训,自我蒸馏,伪标签,标签平滑和合奏。最后,我们的解决方案在公共和私人排行榜上的表现优于其他人。
This paper mainly describes our winning solution (team name: www) to Amazon ESCI Challenge of KDD CUP 2022, which achieves a NDCG score of 0.9043 and wins the first place on task 1: the query-product ranking track. In this competition, participants are provided with a real-world large-scale multilingual shopping queries data set and it contains query-product pairs in English, Japanese and Spanish. Three different tasks are proposed in this competition, including ranking the results list as task 1, classifying the query/product pairs into Exact, Substitute, Complement, or Irrelevant (ESCI) categories as task 2 and identifying substitute products for a given query as task 3. We mainly focus on task 1 and propose a semantic alignment system for multilingual query-product retrieval. Pre-trained multilingual language models (LM) are adopted to get the semantic representation of queries and products. Our models are all trained with cross-entropy loss to classify the query-product pairs into ESCI 4 categories at first, and then we use weighted sum with the 4-class probabilities to get the score for ranking. To further boost the model, we also do elaborative data preprocessing, data augmentation by translation, specially handling English texts with English LMs, adversarial training with AWP and FGM, self distillation, pseudo labeling, label smoothing and ensemble. Finally, Our solution outperforms others both on public and private leaderboard.