论文标题

具有Adagrad Stepize的一类非convex算法的高概率范围

High Probability Bounds for a Class of Nonconvex Algorithms with AdaGrad Stepsize

论文作者

Kavis, Ali, Levy, Kfir Yehuda, Cevher, Volkan

论文摘要

在本文中,我们提出了对Adagrad的新的,简化的高概率分析,以解决平滑的非凸问题。更具体地说,我们专注于特定的加速梯度(AGD)模板(LAN,2020),通过该模板,我们通过平均恢复原始的Adagrad及其变体,并证明$ \ MATHCAL O(1/ \ sqrt {T})的收敛速率具有高度的知识,而没有平稳性的知识。我们使用Freedman浓度的特定版本限制为Martingale差异序列(Kakade&Tewari,2008),这使我们能够实现$ \ log(1 /δ)$的最著名依赖性(1 /δ)$。我们以模块化的方式介绍我们的分析,并在确定性设置中获得互补的$ \ Mathcal O(1 / T)$收敛率。据我们所知,这是Adagrad的第一个高概率结果,具有真正的自适应方案,即完全忽略了对平滑度和统一方差绑定的知识,同时具有$ \ log(1/δ)$的最著名依赖性。我们进一步证明了在其他噪声假设下Adagrad的噪声适应属性。

In this paper, we propose a new, simplified high probability analysis of AdaGrad for smooth, non-convex problems. More specifically, we focus on a particular accelerated gradient (AGD) template (Lan, 2020), through which we recover the original AdaGrad and its variant with averaging, and prove a convergence rate of $\mathcal O (1/ \sqrt{T})$ with high probability without the knowledge of smoothness and variance. We use a particular version of Freedman's concentration bound for martingale difference sequences (Kakade & Tewari, 2008) which enables us to achieve the best-known dependence of $\log (1 / δ)$ on the probability margin $δ$. We present our analysis in a modular way and obtain a complementary $\mathcal O (1 / T)$ convergence rate in the deterministic setting. To the best of our knowledge, this is the first high probability result for AdaGrad with a truly adaptive scheme, i.e., completely oblivious to the knowledge of smoothness and uniform variance bound, which simultaneously has best-known dependence of $\log( 1/ δ)$. We further prove noise adaptation property of AdaGrad under additional noise assumptions.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源