基于因不匹配错误污染的链接数据的指数家庭回归估算

论文标题

基于因不匹配错误污染的链接数据的指数家庭回归估算

Estimation in exponential family Regression based on linked data contaminated by mismatch error

论文作者

Wang, Zhenbang, Ben-David, Emanuel, Slawski, Martin

论文摘要

在多个文件中识别匹配记录可能是一项具有挑战性且容易出错的任务。链接错误可能会严重影响基于结果链接文件的后续统计分析。最近有几篇论文研究了一个文件中的响应变量，从一个文件中研究了链接后线性回归分析，并从“破裂的样本问题”和“置换数据”的角度从第二个文件中的协变量进行了研究。在本文中，我们将这一研究线的扩展为指数家庭反应，假设不匹配的数量较小至中等匹配。提出了一种基于观察特异性偏移的方法来解释潜在的不匹配和$ \ ell_1 $ - 二纳尔化的方法，并讨论了其统计属性。如果已知回归参数，我们还提出了足够的条件，以恢复协变量和响应之间的正确对应关系。将所提出的方法与已建立的基准线进行了比较，即基于合成和真实数据的理论和经验上的拉希里 - 拉希里和钱伯斯的方法。结果表明，即使仅提供有关链接过程的有限信息，也可以实现对这些方法的实质性改进。

Identification of matching records in multiple files can be a challenging and error-prone task. Linkage error can considerably affect subsequent statistical analysis based on the resulting linked file. Several recent papers have studied post-linkage linear regression analysis with the response variable in one file and the covariates in a second file from the perspective of the "Broken Sample Problem" and "Permuted Data". In this paper, we present an extension of this line of research to exponential family response given the assumption of a small to moderate number of mismatches. A method based on observation-specific offsets to account for potential mismatches and $\ell_1$-penalization is proposed, and its statistical properties are discussed. We also present sufficient conditions for the recovery of the correct correspondence between covariates and responses if the regression parameter is known. The proposed approach is compared to established baselines, namely the methods by Lahiri-Larsen and Chambers, both theoretically and empirically based on synthetic and real data. The results indicate that substantial improvements over those methods can be achieved even if only limited information about the linkage process is available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题