通过二阶信息改善统计模型中的计算复杂性

论文标题

通过二阶信息改善统计模型中的计算复杂性

Improving Computational Complexity in Statistical Models with Second-Order Information

论文作者

Ren, Tongzheng, Zhuo, Jiacheng, Sanghavi, Sujay, Ho, Nhat

论文摘要

众所周知，当统计模型是奇异的，即，真实参数的Fisher信息矩阵是退化的，固定的台阶大小梯度下降算法会根据样本尺寸$ n $收敛到围绕真实参数的最终统计半径，这可能是对应用程序不满意的。为了进一步提高该计算复杂性，我们考虑了优化算法设计中二阶信息的利用。具体而言，我们研究了参数统计模型中求解参数估计的归一化梯度下降（normGD）算法，该统计模型是梯度下降算法的变体，其步骤大小由统计模型经验损失函数的Hessian Matrix的最大特征值缩放。当人口损失功能（即，当$ n $转到无限时，经验损失函数的极限）在各个方向都是均匀的，我们证明，在以$ n $为单位的对数迭代数量之后，Normgd迭代在真实参数上达到了最终的统计半径。因此，对于固定尺寸$ D $，NormGD算法实现了最佳的总体计算复杂性$ \ MATHCAL {O}（N）$以达到最终统计半径。该计算复杂性比固定的台阶梯度下降算法便宜，即$ \ nathcal {o}（n^τ）$的顺序，对于某些$τ> 1 $，可以达到相同的统计半径。我们在两个统计模型下说明了我们的一般理论：广义线性模型和混合模型，实验结果支持我们用一般理论的预测。

It is known that when the statistical models are singular, i.e., the Fisher information matrix at the true parameter is degenerate, the fixed step-size gradient descent algorithm takes polynomial number of steps in terms of the sample size $n$ to converge to a final statistical radius around the true parameter, which can be unsatisfactory for the application. To further improve that computational complexity, we consider the utilization of the second-order information in the design of optimization algorithms. Specifically, we study the normalized gradient descent (NormGD) algorithm for solving parameter estimation in parametric statistical models, which is a variant of gradient descent algorithm whose step size is scaled by the maximum eigenvalue of the Hessian matrix of the empirical loss function of statistical models. When the population loss function, i.e., the limit of the empirical loss function when $n$ goes to infinity, is homogeneous in all directions, we demonstrate that the NormGD iterates reach a final statistical radius around the true parameter after a logarithmic number of iterations in terms of $n$. Therefore, for fixed dimension $d$, the NormGD algorithm achieves the optimal overall computational complexity $\mathcal{O}(n)$ to reach the final statistical radius. This computational complexity is cheaper than that of the fixed step-size gradient descent algorithm, which is of the order $\mathcal{O}(n^τ)$ for some $τ> 1$, to reach the same statistical radius. We illustrate our general theory under two statistical models: generalized linear models and mixture models, and experimental results support our prediction with general theory.

下载PDF全文

下载文献需遵守相关版权规定

论文标题