论文标题
为什么卷积网比完全连接的网络更有效?
Why Are Convolutional Nets More Sample-Efficient than Fully-Connected Nets?
论文作者
论文摘要
卷积神经网络通常在概括性能方面占据完全连接的对应物,尤其是在图像分类任务上。这通常用“更好的归纳偏见”来解释。但是,这尚未在数学上严格进行,而障碍是完全连接的网始终可以模拟卷积网(对于固定任务)。因此,培训算法起作用。当前的工作描述了可以在标准培训算法上显示可证明的样本复杂性差距的自然任务。我们在$ \ mathbb {r}^d \ times \ {\ pm 1 \} $上构建单个自然分布,任何正交性不变算法(即,通过高斯初始化的大多数基于梯度的方法培训了完全连接的网络,需要从高斯初始化的大多数基于梯度的方法)$ω(d^2)$ω(d^2)$ω$($ o)$ o(1)$ o(1)$ o(1)$ o(1)$ o(1)$ o(1)$ o(1)$ o(1)$ o(1)$ o(1)。此外,我们演示了一个单个目标功能,学习所有可能的分布都会导致$ o(1)$ vs $ω(d^2/\ varepsilon)$差距。证明依赖于完全连接的网络上的SGD是正交等效的事实。对于$ \ ell_2 $回归和自适应培训算法,也可以实现类似的结果。 Adam和Adagrad,仅是置换的。
Convolutional neural networks often dominate fully-connected counterparts in generalization performance, especially on image classification tasks. This is often explained in terms of 'better inductive bias'. However, this has not been made mathematically rigorous, and the hurdle is that the fully connected net can always simulate the convolutional net (for a fixed task). Thus the training algorithm plays a role. The current work describes a natural task on which a provable sample complexity gap can be shown, for standard training algorithms. We construct a single natural distribution on $\mathbb{R}^d\times\{\pm 1\}$ on which any orthogonal-invariant algorithm (i.e. fully-connected networks trained with most gradient-based methods from gaussian initialization) requires $Ω(d^2)$ samples to generalize while $O(1)$ samples suffice for convolutional architectures. Furthermore, we demonstrate a single target function, learning which on all possible distributions leads to an $O(1)$ vs $Ω(d^2/\varepsilon)$ gap. The proof relies on the fact that SGD on fully-connected network is orthogonal equivariant. Similar results are achieved for $\ell_2$ regression and adaptive training algorithms, e.g. Adam and AdaGrad, which are only permutation equivariant.