论文标题

向后功能校正:深度学习如何进行深度(等级)学习

Backward Feature Correction: How Deep Learning Performs Deep (Hierarchical) Learning

论文作者

Allen-Zhu, Zeyuan, Li, Yuanzhi

论文摘要

深度学习也被称为分层学习,其中学习者_learns_通过将其分解为一系列简单函数以降低样本和时间复杂性来表示复杂的目标函数。本文正式分析了多层神经网络如何在培训目标上通过SGD进行效率_和_AUTOMAIGY _AUTOMATIOM _AUTOMATIOM _AUTOMATION。 在概念方面,我们介绍了某些类型的深层(即超恒定层)神经网络的理论特征仍然可以进行样本,并且在没有现有算法(包括层次训练,内核方法等)时,对某些层次任务进行了有效训练。我们建立了一个称为“向后特征校正”的新原理,其中低级特征中的误差与高级层一起训练时可以自动纠正。我们认为,这是深度学习如何进行深入(等级)学习的关键,而不是层次学习或模拟某些非等级方法。 在技​​术方面,我们显示每个输入维度$ d> 0 $,都有一个概念类别$ω(1)$多变量多项式,以便使用$ω(1)$ - 层 - 层神经网络作为学习者,SGD可以从任何$ \ nathsf {d Poly}(d)$ \ mathsf {d poly}(d)中学习任何功能$ \ frac {1} {\ mathsf {poly}(d)} $错误,通过学习将其表示为$ω(1)$ layers $ fiquratic函数的组成,并使用“向后功能校正”。相比之下,我们不知道其他任何简单的算法(包括层次训练,顺序应用内核方法,训练两层网络等)可以在$ \ mathsf {poly}(poly}(d)$时间中学习此概念类别,即使在任何$ d^{ - 0.01} $错误上。作为侧面结果,我们证明了几个非层次学习者(包括任何内核方法)的$ d^{ω(1)} $下限。

Deep learning is also known as hierarchical learning, where the learner _learns_ to represent a complicated target function by decomposing it into a sequence of simpler functions to reduce sample and time complexity. This paper formally analyzes how multi-layer neural networks can perform such hierarchical learning _efficiently_ and _automatically_ by SGD on the training objective. On the conceptual side, we present a theoretical characterizations of how certain types of deep (i.e. super-constant layer) neural networks can still be sample and time efficiently trained on some hierarchical tasks, when no existing algorithm (including layerwise training, kernel method, etc) is known to be efficient. We establish a new principle called "backward feature correction", where the errors in the lower-level features can be automatically corrected when training together with the higher-level layers. We believe this is a key behind how deep learning is performing deep (hierarchical) learning, as opposed to layerwise learning or simulating some non-hierarchical method. On the technical side, we show for every input dimension $d > 0$, there is a concept class of degree $ω(1)$ multi-variate polynomials so that, using $ω(1)$-layer neural networks as learners, SGD can learn any function from this class in $\mathsf{poly}(d)$ time to any $\frac{1}{\mathsf{poly}(d)}$ error, through learning to represent it as a composition of $ω(1)$ layers of quadratic functions using "backward feature correction." In contrast, we do not know any other simpler algorithm (including layerwise training, applying kernel method sequentially, training a two-layer network, etc) that can learn this concept class in $\mathsf{poly}(d)$ time even to any $d^{-0.01}$ error. As a side result, we prove $d^{ω(1)}$ lower bounds for several non-hierarchical learners, including any kernel methods.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源