在低旋转状态下的两层网络上的梯度下降的特征选择

论文标题

在低旋转状态下的两层网络上的梯度下降的特征选择

Feature selection with gradient descent on two-layer networks in low-rotation regimes

论文作者

Telgarsky, Matus

论文摘要

这项工作确立了具有标准初始化的两层relu网络上梯度流（GF）和随机梯度下降（SGD）的低测试误差，在三个方案中，重量的关键重量集很少旋转（自然是由于GF和SGD，或者由于人工约束），并将其用作核心分析技术。第一个制度几乎是初始化的，特别是直到权重以$ \ Mathcal {o}（\ sqrt m）$移动为止，其中$ m $表示网络宽度，这与$ \ nathcal {o}（O}（1）$重量运动形成鲜明对比。在这里表明，GF和SGD仅需要网络宽度和样本数量与NTK边缘成反比，此外，GF至少达到了NTK边缘本身，这足以建立避免距离差距目标的不良KKT的逃脱，而先前工作只能确定非股份但不符合小小的小额元素。第二个制度是神经崩溃（NC）设置，其中数据在于极为偏离的组中，样品复杂性尺度与组数量。在这里，对先前工作的贡献是对初始化的整个GF轨迹的分析。最后，如果内层的权重仅限于仅在标准中变化并且无法旋转，则具有较大宽度的GF实现了全球最大边缘，并且其样品复杂度与它们的逆尺度相比；这与先前的工作相反，后者需要无限的宽度和一个棘手的双收敛假设。作为纯粹的技术贡献，这项工作开发了各种潜在功能和其他工具，希望有助于未来的工作。

This work establishes low test error of gradient flow (GF) and stochastic gradient descent (SGD) on two-layer ReLU networks with standard initialization, in three regimes where key sets of weights rotate little (either naturally due to GF and SGD, or due to an artificial constraint), and making use of margins as the core analytic technique. The first regime is near initialization, specifically until the weights have moved by $\mathcal{O}(\sqrt m)$, where $m$ denotes the network width, which is in sharp contrast to the $\mathcal{O}(1)$ weight motion allowed by the Neural Tangent Kernel (NTK); here it is shown that GF and SGD only need a network width and number of samples inversely proportional to the NTK margin, and moreover that GF attains at least the NTK margin itself, which suffices to establish escape from bad KKT points of the margin objective, whereas prior work could only establish nondecreasing but arbitrarily small margins. The second regime is the Neural Collapse (NC) setting, where data lies in extremely-well-separated groups, and the sample complexity scales with the number of groups; here the contribution over prior work is an analysis of the entire GF trajectory from initialization. Lastly, if the inner layer weights are constrained to change in norm only and can not rotate, then GF with large widths achieves globally maximal margins, and its sample complexity scales with their inverse; this is in contrast to prior work, which required infinite width and a tricky dual convergence assumption. As purely technical contributions, this work develops a variety of potential functions and other tools which will hopefully aid future work.

下载PDF全文

下载文献需遵守相关版权规定

论文标题