具有Adagrad Stepize的一类非convex算法的高概率范围

论文标题

具有Adagrad Stepize的一类非convex算法的高概率范围

High Probability Bounds for a Class of Nonconvex Algorithms with AdaGrad Stepsize

论文作者

Kavis, Ali, Levy, Kfir Yehuda, Cevher, Volkan

论文摘要

在本文中，我们提出了对Adagrad的新的，简化的高概率分析，以解决平滑的非凸问题。更具体地说，我们专注于特定的加速梯度（AGD）模板（LAN，2020），通过该模板，我们通过平均恢复原始的Adagrad及其变体，并证明$ \ MATHCAL O（1/ \ sqrt {T}）的收敛速率具有高度的知识，而没有平稳性的知识。我们使用Freedman浓度的特定版本限制为Martingale差异序列（Kakade＆Tewari，2008），这使我们能够实现$ \ log（1 /δ）$的最著名依赖性（1 /δ）$。我们以模块化的方式介绍我们的分析，并在确定性设置中获得互补的$ \ Mathcal O（1 / T）$收敛率。据我们所知，这是Adagrad的第一个高概率结果，具有真正的自适应方案，即完全忽略了对平滑度和统一方差绑定的知识，同时具有$ \ log（1/δ）$的最著名依赖性。我们进一步证明了在其他噪声假设下Adagrad的噪声适应属性。

In this paper, we propose a new, simplified high probability analysis of AdaGrad for smooth, non-convex problems. More specifically, we focus on a particular accelerated gradient (AGD) template (Lan, 2020), through which we recover the original AdaGrad and its variant with averaging, and prove a convergence rate of $\mathcal O (1/ \sqrt{T})$ with high probability without the knowledge of smoothness and variance. We use a particular version of Freedman's concentration bound for martingale difference sequences (Kakade & Tewari, 2008) which enables us to achieve the best-known dependence of $\log (1 / δ)$ on the probability margin $δ$. We present our analysis in a modular way and obtain a complementary $\mathcal O (1 / T)$ convergence rate in the deterministic setting. To the best of our knowledge, this is the first high probability result for AdaGrad with a truly adaptive scheme, i.e., completely oblivious to the knowledge of smoothness and uniform variance bound, which simultaneously has best-known dependence of $\log( 1/ δ)$. We further prove noise adaptation property of AdaGrad under additional noise assumptions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题