论文标题
奖励成型以增强推荐人的用户满意度
Reward Shaping for User Satisfaction in a REINFORCE Recommender
论文作者
论文摘要
我们如何设计强化学习(RL)的推荐人,鼓励用户轨迹与基本用户满意度保持一致?三个研究问题是关键:(1)测量用户满意度,(2)对满意度信号的稀疏性,以及(3)调整推荐代理的培训以最大程度地提高满意度。为了进行测量,已经发现,调查明确要求用户对消费物品的经验进行评分,可以为参与/交互数据提供有价值的正交信息,并充当基础用户满意度的代理。对于稀疏性,即只能观察用户与用户相互作用的一小部分的满意度,归纳模型对于预测所有用户所消耗的项目的满意度很有用。为了学习满足建议策略的学习,我们假设RL推荐代理中的奖励成型对于推动满足用户体验的强大。将所有内容整合在一起,我们建议共同学习一个政策网络和满意度归合网络:插补网络的作用是学习哪些动作对用户满意;在加强之上建立的政策网络决定了要推荐的项目,并利用估计的满意度的奖励。我们在工业大规模推荐平台中使用离线分析和实时实验,以证明我们对满足用户体验的方法的希望。
How might we design Reinforcement Learning (RL)-based recommenders that encourage aligning user trajectories with the underlying user satisfaction? Three research questions are key: (1) measuring user satisfaction, (2) combatting sparsity of satisfaction signals, and (3) adapting the training of the recommender agent to maximize satisfaction. For measurement, it has been found that surveys explicitly asking users to rate their experience with consumed items can provide valuable orthogonal information to the engagement/interaction data, acting as a proxy to the underlying user satisfaction. For sparsity, i.e, only being able to observe how satisfied users are with a tiny fraction of user-item interactions, imputation models can be useful in predicting satisfaction level for all items users have consumed. For learning satisfying recommender policies, we postulate that reward shaping in RL recommender agents is powerful for driving satisfying user experiences. Putting everything together, we propose to jointly learn a policy network and a satisfaction imputation network: The role of the imputation network is to learn which actions are satisfying to the user; while the policy network, built on top of REINFORCE, decides which items to recommend, with the reward utilizing the imputed satisfaction. We use both offline analysis and live experiments in an industrial large-scale recommendation platform to demonstrate the promise of our approach for satisfying user experiences.