论文标题
重新访问过度参数化模型中的最小描述长度复杂性
Revisiting minimum description length complexity in overparameterized models
论文作者
论文摘要
复杂性是统计学习理论的基本概念,旨在为概括绩效提供信息。当参数数量超过训练样本的数量时,虽然在低维设置中成功的参数计数并没有得到过度参数化的设置。我们根据Rissanen的最低描述长度(MDL)的原理来重新审视复杂性度量,并定义了一种基于MDL的新型复杂性(MDL-COMP),该复杂性(MDL-Comp)对于过度参数化模型仍然有效。 MDL-comp是通过优质山脊估计器类引起的编码的最佳标准来定义的。我们为线性模型和内核方法提供了MDL-COMP的广泛理论表征,并表明它不仅是参数计数的函数,而且是设计的单数值或内核矩阵的函数,以及信号对噪声比率。对于具有$ n $观测值的线性模型,$ d $参数和i.i.d.高斯预测变量,当$ d <n $时,mdl-comp用$ d $线性缩放,但是缩放率成倍小 - $ \ log d $ for $ d> n $。对于内核方法,我们表明MDL-comp告知最小样本中的误差,并且随着输入的维度的增加,可能会降低。我们还证明,MDL-Comp上限是样本内平方误差(MSE)。通过一系列模拟和真实数据实验,我们表明,数据驱动的PRAC-MDL-Comp为超参数调整提供了用于在有限的数据设置中使用Ridge回归优化测试MSE的超参数调整,有时会在交叉验证和(总是(总是)节省计算成本的情况下改善。最后,我们的发现还表明,在过度参数化模型中,最近观察到的双重现象可能是选择非理想估计器的结果。
Complexity is a fundamental concept underlying statistical learning theory that aims to inform generalization performance. Parameter count, while successful in low-dimensional settings, is not well-justified for overparameterized settings when the number of parameters is more than the number of training samples. We revisit complexity measures based on Rissanen's principle of minimum description length (MDL) and define a novel MDL-based complexity (MDL-COMP) that remains valid for overparameterized models. MDL-COMP is defined via an optimality criterion over the encodings induced by a good Ridge estimator class. We provide an extensive theoretical characterization of MDL-COMP for linear models and kernel methods and show that it is not just a function of parameter count, but rather a function of the singular values of the design or the kernel matrix and the signal-to-noise ratio. For a linear model with $n$ observations, $d$ parameters, and i.i.d. Gaussian predictors, MDL-COMP scales linearly with $d$ when $d<n$, but the scaling is exponentially smaller -- $\log d$ for $d>n$. For kernel methods, we show that MDL-COMP informs minimax in-sample error, and can decrease as the dimensionality of the input increases. We also prove that MDL-COMP upper bounds the in-sample mean squared error (MSE). Via an array of simulations and real-data experiments, we show that a data-driven Prac-MDL-COMP informs hyper-parameter tuning for optimizing test MSE with ridge regression in limited data settings, sometimes improving upon cross-validation and (always) saving computational costs. Finally, our findings also suggest that the recently observed double decent phenomenons in overparameterized models might be a consequence of the choice of non-ideal estimators.