论文标题
关于灾难性遗忘的挑战的共同假设
Challenging Common Assumptions about Catastrophic Forgetting
论文作者
论文摘要
建立可以逐步学习和积累知识的学习者是持续学习(CL)研究领域的核心目标。不幸的是,对新数据培训模型通常会损害过去数据的性能。在CL文献中,这种效果被称为灾难性遗忘(CF)。 CF已在很大程度上进行了研究,并且已经提出了许多方法来以简短的非重叠任务序列解决。在这样的设置中,CF始终导致过去任务中的性能快速下降。然而,尽管CF,最近的工作表明,线性模型的SGD培训在CL回归设置中积累了知识。当任务重新发生时,这种现象变得尤为明显。然后,我们可能会怀疑是否接受过SGD培训的DNN或任何基于标准梯度的优化都以这种方式积累了知识。这种现象将对真实的持续场景应用DNN会产生有趣的后果。实际上,基于标准的基于梯度的优化方法的计算在计算上明显低于现有CL算法。在本文中,我们研究了通过基于梯度的算法训练的DNN中的渐进知识积累(KA),这些算法是长期通过数据重新出现的任务。我们提出了一个新的框架,即Scole(扩展持续学习),以研究KA并发现灾难性遗忘对接受SGD训练的DNN的影响有限。当通过稀疏重新出现数据的长序列进行长序列训练时,总体准确性会提高,考虑到CF现象,这可能是违反直觉的。我们在各种数据出现频率下经验研究了DNN中的KA,并提出了简单且可扩展的策略,以增加DNN中知识的积累。
Building learning agents that can progressively learn and accumulate knowledge is the core goal of the continual learning (CL) research field. Unfortunately, training a model on new data usually compromises the performance on past data. In the CL literature, this effect is referred to as catastrophic forgetting (CF). CF has been largely studied, and a plethora of methods have been proposed to address it on short sequences of non-overlapping tasks. In such setups, CF always leads to a quick and significant drop in performance in past tasks. Nevertheless, despite CF, recent work showed that SGD training on linear models accumulates knowledge in a CL regression setup. This phenomenon becomes especially visible when tasks reoccur. We might then wonder if DNNs trained with SGD or any standard gradient-based optimization accumulate knowledge in such a way. Such phenomena would have interesting consequences for applying DNNs to real continual scenarios. Indeed, standard gradient-based optimization methods are significantly less computationally expensive than existing CL algorithms. In this paper, we study the progressive knowledge accumulation (KA) in DNNs trained with gradient-based algorithms in long sequences of tasks with data re-occurrence. We propose a new framework, SCoLe (Scaling Continual Learning), to investigate KA and discover that catastrophic forgetting has a limited effect on DNNs trained with SGD. When trained on long sequences with data sparsely re-occurring, the overall accuracy improves, which might be counter-intuitive given the CF phenomenon. We empirically investigate KA in DNNs under various data occurrence frequencies and propose simple and scalable strategies to increase knowledge accumulation in DNNs.