基于设计的置信序列：一种在线实验中降低风险的一般方法

论文标题

基于设计的置信序列：一种在线实验中降低风险的一般方法

Design-Based Confidence Sequences: A General Approach to Risk Mitigation in Online Experimentation

论文作者

Ham, Dae Woong, Bojinov, Iavor, Lindon, Michael, Tingley, Martin

论文摘要

随机实验已成为公司评估新产品或服务性能的标准方法。除了增加管理人员的决策外，实验还通过限制暴露于创新的客户比例来减轻风险。由于许多实验是在客户依次到达的客户上，因此潜在的解决方案是让管理者在新数据可用时“窥视”结果，并在结果上有统计学意义时停止测试。不幸的是，窥视使标准统计分析的统计保证无效，并导致不受控制的1型错误。我们的论文提供了有效的基于设计的置信序列，具有均匀类型1误差的置信区间的序列可以随着时间的推移保证以假设方式进行各种顺序实验。特别是，我们专注于根据研究参与者定义的有限样本估计，直接衡量公司的风险。我们提出的置信序列对大量实验有效，包括多臂匪徒，时间序列和面板实验。我们进一步提供了一种降低差异技术，其中包含建模假设和协变量。最后，我们通过仿真研究和Netflix的三个现实应用程序来证明我们提出的方法的有效性。我们的结果表明，通过使用我们的置信序列，只有在观察少数单位之后才能停止有害实验。例如，Netflix在100个观察之前的第一天通过我们的方法停止了Netflix在其注册页面上运行的实验。

Randomized experiments have become the standard method for companies to evaluate the performance of new products or services. In addition to augmenting managers' decision-making, experimentation mitigates risk by limiting the proportion of customers exposed to innovation. Since many experiments are on customers arriving sequentially, a potential solution is to allow managers to "peek" at the results when new data becomes available and stop the test if the results are statistically significant. Unfortunately, peeking invalidates the statistical guarantees for standard statistical analysis and leads to uncontrolled type-1 error. Our paper provides valid design-based confidence sequences, sequences of confidence intervals with uniform type-1 error guarantees over time for various sequential experiments in an assumption-light manner. In particular, we focus on finite-sample estimands defined on the study participants as a direct measure of the incurred risks by companies. Our proposed confidence sequences are valid for a large class of experiments, including multi-arm bandits, time series, and panel experiments. We further provide a variance reduction technique incorporating modeling assumptions and covariates. Finally, we demonstrate the effectiveness of our proposed approach through a simulation study and three real-world applications from Netflix. Our results show that by using our confidence sequence, harmful experiments could be stopped after only observing a handful of units; for instance, an experiment that Netflix ran on its sign-up page on 30,000 potential customers would have been stopped by our method on the first day before 100 observations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题