在回归模型中纠正过度拟合偏差

论文标题

在回归模型中纠正过度拟合偏差

Correction of overfitting bias in regression models

论文作者

Massa, Emanuele, Jonker, Marianne, Roes, Kit, Coolen, Anthony

论文摘要

基于许多协变量的回归分析变得越来越普遍。但是，当协变量$ p $的数量与观测值$ n $的订单相同时，由于过度拟合，最大似然回归变得不可靠。这通常会导致系统的估计偏差和增加的估计差异。正确量化这些效果至关重要。文献中已经提出了几种方法来克服过度拟合偏差或调整估计值。其中的绝大多数集中于回归参数。但是，未能正确估计的滋扰参数可能会导致置信陈述和结果预测的重大错误。在本文中，我们提出了一种杰出方法，用于得出一组紧凑的非线性方程组，该方程描述了$ p = o（n）$以及在正常分布的协变量假设下的ML估计器中的统计特性。这些方程式使一个方程式可以计算参数回归模型中最大似然（ML）估计器的过度拟合偏置，为$ζ= p/n $的函数。然后，我们使用这些方程来计算收缩因子，以消除最大似然（ML）估计器的过度拟合偏差。这种新的派生为复制方法提供了各种好处，从透明度提高和假设降低方面。为了说明理论，我们为多元回归模型进行了仿真研究。在所有情况下，我们都会在理论和模拟之间发现极好的一致性。

Regression analysis based on many covariates is becoming increasingly common. However, when the number of covariates $p$ is of the same order as the number of observations $n$, maximum likelihood regression becomes unreliable due to overfitting. This typically leads to systematic estimation biases and increased estimator variances. It is crucial for inference and prediction to quantify these effects correctly. Several methods have been proposed in literature to overcome overfitting bias or adjust estimates. The vast majority of these focus on the regression parameters. But failure to estimate correctly also the nuisance parameters may lead to significant errors in confidence statements and outcome prediction. In this paper we present a jacknife method for deriving a compact set of non-linear equations which describe the statistical properties of the ML estimator in the regime where $p=O(n)$ and under the hypothesis of normally distributed covariates. These equations enable one to compute the overfitting bias of maximum likelihood (ML) estimators in parametric regression models as functions of $ζ= p/n$. We then use these equations to compute shrinkage factors in order to remove the overfitting bias of maximum likelihood (ML) estimators. This new derivation offers various benefits over the replica approach in terms of increased transparency and reduced assumptions. To illustrate the theory we performed simulation studies for multiple regression models. In all cases we find excellent agreement between theory and simulations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题