确定性梯度下降的混沌正则化和重尾限制

论文标题

确定性梯度下降的混沌正则化和重尾限制

Chaotic Regularization and Heavy-Tailed Limits for Deterministic Gradient Descent

论文作者

Lim, Soon Hoe, Wan, Yijun, Şimşekli, Umut

论文摘要

最近的研究表明，当梯度下降（GD）表现出混乱的行为时，可以改善概括。但是，为了获得所需的效果，应选择足够大的阶梯尺寸，这是问题依赖性的，在实践中可能很困难。在这项研究中，我们以受控的方式将混乱的成分与GD结合在一起，并引入了多尺度扰动的GD（MPGD），这是一个新型的优化框架，其中GD递归与混乱的扰动增强，通过独立的动态系统进化。 We analyze MPGD from three different angles: (i) By building up on recent advances in rough paths theory, we show that, under appropriate assumptions, as the step-size decreases, the MPGD recursion converges weakly to a stochastic differential equation (SDE) driven by a heavy-tailed Lévy-stable process. （ii）通过与最近开发的重型过程的概括范围建立连接，我们得出了一个限制限制SDE的概括，并将过程轨迹上的最坏情况概括误差与MPGD的参数相关联。（iii）我们分析了动态正规化带来的隐式正则化效应，并表明，在弱扰动制度中，MPGD引入了损失函数的hessian的术语。提供了经验结果以证明MPGD的优势。

Recent studies have shown that gradient descent (GD) can achieve improved generalization when its dynamics exhibits a chaotic behavior. However, to obtain the desired effect, the step-size should be chosen sufficiently large, a task which is problem dependent and can be difficult in practice. In this study, we incorporate a chaotic component to GD in a controlled manner, and introduce multiscale perturbed GD (MPGD), a novel optimization framework where the GD recursion is augmented with chaotic perturbations that evolve via an independent dynamical system. We analyze MPGD from three different angles: (i) By building up on recent advances in rough paths theory, we show that, under appropriate assumptions, as the step-size decreases, the MPGD recursion converges weakly to a stochastic differential equation (SDE) driven by a heavy-tailed Lévy-stable process. (ii) By making connections to recently developed generalization bounds for heavy-tailed processes, we derive a generalization bound for the limiting SDE and relate the worst-case generalization error over the trajectories of the process to the parameters of MPGD. (iii) We analyze the implicit regularization effect brought by the dynamical regularization and show that, in the weak perturbation regime, MPGD introduces terms that penalize the Hessian of the loss function. Empirical results are provided to demonstrate the advantages of MPGD.

下载PDF全文

下载文献需遵守相关版权规定

论文标题