通过非政策增强学习的顺序搜索

论文标题

通过非政策增强学习的顺序搜索

Sequential Search with Off-Policy Reinforcement Learning

论文作者

Miao, Dadong, Wang, Yanan, Tang, Guoyu, Liu, Lin, Xu, Sulong, Long, Bo, Xiao, Yun, Wu, Lingfei, Jiang, Yunjiang

论文摘要

近年来，顺序推荐（SR）有很大的兴趣，旨在了解和建模用户与用户与项目之间的顺序用户行为以及随着时间的推移的交互。令人惊讶的是，尽管取得了巨大的成功推荐，但对顺序搜索（SS）的研究很少，这是一项双胞胎学习任务，考虑了用户的当前和过去搜索查询，除了对历史查询会话的行为。 SS学习任务比大多数电子承诺公司的在线服务需求和流量量更大，而与大多数电子承诺公司的同行SR任务更为重要。为此，我们提出了一个高度可扩展的混合学习模型，该模型由RNN学习框架组成，该框架利用短期用户ITEM交互中的所有功能以及使用长期交互中选择的仅项项目功能的注意力模型。作为一个新颖的优化步骤，我们通过即时解决贪婪的背包问题，将单个RNN通行中的多个短用户序列拟合在训练批次中。此外，我们探讨了在多课程个性化搜索排名中使用非政策加固学习的使用。具体而言，我们设计了一个成对的深层确定性策略梯度模型，该模型可有效捕获用户在成对分类错误方面的长期奖励。广泛的消融实验表明，每个组件在各种离线和在线指标上都有显着改善为其最先进的基线带来的。

Recent years have seen a significant amount of interests in Sequential Recommendation (SR), which aims to understand and model the sequential user behaviors and the interactions between users and items over time. Surprisingly, despite the huge success Sequential Recommendation has achieved, there is little study on Sequential Search (SS), a twin learning task that takes into account a user's current and past search queries, in addition to behavior on historical query sessions. The SS learning task is even more important than the counterpart SR task for most of E-commence companies due to its much larger online serving demands as well as traffic volume. To this end, we propose a highly scalable hybrid learning model that consists of an RNN learning framework leveraging all features in short-term user-item interactions, and an attention model utilizing selected item-only features from long-term interactions. As a novel optimization step, we fit multiple short user sequences in a single RNN pass within a training batch, by solving a greedy knapsack problem on the fly. Moreover, we explore the use of off-policy reinforcement learning in multi-session personalized search ranking. Specifically, we design a pairwise Deep Deterministic Policy Gradient model that efficiently captures users' long term reward in terms of pairwise classification error. Extensive ablation experiments demonstrate significant improvement each component brings to its state-of-the-art baseline, on a variety of offline and online metrics.

下载PDF全文

下载文献需遵守相关版权规定

论文标题