通过不变代表学习数据驱动的脱机决策

论文标题

通过不变代表学习数据驱动的脱机决策

Data-Driven Offline Decision-Making via Invariant Representation Learning

论文作者

Qi, Han, Su, Yi, Kumar, Aviral, Levine, Sergey

论文摘要

离线数据驱动的决策的目标是合成的决策，该决策使用先前收集的静态数据集优化黑框实用程序功能，而没有主动交互。这些问题以多种形式出现：离线增强学习（RL），我们必须制作优化长期奖励的动作，从记录的数据中提取的匪徒，目标是确定正确的ARM和基于离线模型的优化（MBO）问题，我们必须在其中找到最佳设计提供仅提供静态数据集的访问权限。在所有这些设置中，一个关键的挑战是分配变化：当我们优化从离线数据训练的模型中的输入时，很容易产生出现不错的输入（OOD）输入。与以前利用悲观或保守主义解决此问题的先前方法相反，在本文中，我们将离线数据驱动的决策作为域的适应性，在这种情况下，目的是在仅在数据集（“ source域”上训练时，目标是对优化决策的价值（“目标域”）进行准确的预测。这种观点导致了不变的目标模型（IOM），这是我们通过在培训数据集的学习表现与优化决策之间实施不变性来解决分配转移的方法。在IOM中，如果优化的决策与培训数据集有太大不同，则该表示形式将被迫丢失许多将良好设计与不良设计区分开来的信息，从而使所有选择看起来都平庸。至关重要的是，当优化器意识到这种代表性的权衡时，它应该选择与训练分配相差太远，从而导致分配变化和学习绩效之间的自然权衡。

The goal in offline data-driven decision-making is synthesize decisions that optimize a black-box utility function, using a previously-collected static dataset, with no active interaction. These problems appear in many forms: offline reinforcement learning (RL), where we must produce actions that optimize the long-term reward, bandits from logged data, where the goal is to determine the correct arm, and offline model-based optimization (MBO) problems, where we must find the optimal design provided access to only a static dataset. A key challenge in all these settings is distributional shift: when we optimize with respect to the input into a model trained from offline data, it is easy to produce an out-of-distribution (OOD) input that appears erroneously good. In contrast to prior approaches that utilize pessimism or conservatism to tackle this problem, in this paper, we formulate offline data-driven decision-making as domain adaptation, where the goal is to make accurate predictions for the value of optimized decisions ("target domain"), when training only on the dataset ("source domain"). This perspective leads to invariant objective models (IOM), our approach for addressing distributional shift by enforcing invariance between the learned representations of the training dataset and optimized decisions. In IOM, if the optimized decisions are too different from the training dataset, the representation will be forced to lose much of the information that distinguishes good designs from bad ones, making all choices seem mediocre. Critically, when the optimizer is aware of this representational tradeoff, it should choose not to stray too far from the training distribution, leading to a natural trade-off between distributional shift and learning performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题