论文标题
回溯梯度下降允许学习率无界的学习率
Backtracking Gradient Descent allowing unbounded learning rates
论文作者
论文摘要
为了在欧几里得空间上进行无约束的优化,以证明梯度下降过程中的融合(gd)$ x_ {n+1} = x_n-Δ_n\ nabla f(x_n)$通常是学习率$δ_n$的$δ_n$ bounded:$δ_n\ leq leq $ $ $δ$δ$Δ$δ$Δ在此假设下,如果序列$ x_n $收敛到关键点$ z $,则$ n $的大值将很小,因为$ || x_ {n+1} -x_n || \ sillesim || \ nabla || \ nabla f(x_n)|| $。这也可能迫使序列收敛到最低限度。如果我们至少可以从理论上允许学习率$δ_n$的限制,那么我们可能会更好地收敛到更好的最小值。 作者的先前联合论文显示,在非常一般的成本功能$ f $的假设下,通常版本的回溯GD的融合。在本文中,我们允许学习率$Δ_n$不受限制,因为有一个函数$ h:(0,\ infty)\ rightarrow(0,\ infty)$,以至于$ \ lim _ {t \ rightarrow 0}所有$ n $的条件,并在与上述论文相同的假设下证明融合。可以证明,如果一个人想要序列$ \ {x_n \} $的收敛,则最好的增长率是$ h $。 以离散方式选择$δ_n$的特定方法是连接到上述纸中定义的双向回溯GD的。我们提供了一些结果,这些结果要么改善或隐含在上述论文中包含的结果,还有另一篇有关避免鞍点的论文。
In unconstrained optimisation on an Euclidean space, to prove convergence in Gradient Descent processes (GD) $x_{n+1}=x_n-δ_n \nabla f(x_n)$ it usually is required that the learning rates $δ_n$'s are bounded: $δ_n\leq δ$ for some positive $δ$. Under this assumption, if the sequence $x_n$ converges to a critical point $z$, then with large values of $n$ the update will be small because $||x_{n+1}-x_n||\lesssim ||\nabla f(x_n)||$. This may also force the sequence to converge to a bad minimum. If we can allow, at least theoretically, that the learning rates $δ_n$'s are not bounded, then we may have better convergence to better minima. A previous joint paper by the author showed convergence for the usual version of Backtracking GD under very general assumptions on the cost function $f$. In this paper, we allow the learning rates $δ_n$ to be unbounded, in the sense that there is a function $h:(0,\infty)\rightarrow (0,\infty )$ such that $\lim _{t\rightarrow 0}th(t)=0$ and $δ_n\lesssim \max \{h(x_n),δ\}$ satisfies Armijo's condition for all $n$, and prove convergence under the same assumptions as in the mentioned paper. It will be shown that this growth rate of $h$ is best possible if one wants convergence of the sequence $\{x_n\}$. A specific way for choosing $δ_n$ in a discrete way connects to Two-way Backtracking GD defined in the mentioned paper. We provide some results which either improve or are implicitly contained in those in the mentioned paper and another recent paper on avoidance of saddle points.