论文标题
对发现的错误的实证研究,同时重复了预先培训的自然语言处理模型
An Empirical Study on the Bugs Found while Reusing Pre-trained Natural Language Processing Models
论文作者
论文摘要
在NLP中,重复培训的模型而不是从头开始训练已获得知名度。但是,NLP模型主要是黑匣子,非常大,通常需要大量资源。为了放松,可以提供接受大型语料库培训的模型,开发人员将其重新用于不同的问题。相比之下,开发人员主要从头开始建立与传统DL相关问题的模型。通过这样做,他们可以控制算法,数据处理,模型结构,调整超级标准等的选择。而在NLP中,由于预先训练的模型的重复使用,NLP开发人员几乎不限于对此类设计决策的控制。他们要么在预训练的模型上应用调整或转移学习以满足其要求。同样,NLP模型及其相应的数据集明显大于传统的DL模型,并且需要大量计算。这些原因通常会导致系统中的错误,同时重复预训练的模型。尽管已经深入研究了传统DL软件中的错误,但广泛的再利用和黑盒结构的性质促使我们了解重复使用NLP模型时发生的不同类型的错误?这些错误的根本原因是什么?这些错误如何影响系统?为了回答这些问题,我们研究了报告的错误,同时重复了11种流行的NLP模型。我们从GitHub存储库中挖掘了9,214个问题,并确定了984个错误。我们创建了一种使用错误类型,根本原因和影响的分类法。我们的观察结果导致了几项发现,包括有限的模型内部验证,导致缺乏鲁棒性,缺乏投入验证,导致算法和数据偏见的传播以及高资源消费量导致更多的崩溃等。我们的观察结果表明了几种漏洞模式,这将极大地促进降低型号的进一步努力,以降低型号的漏洞和代码型号,并促进了型号的进一步努力。
In NLP, reusing pre-trained models instead of training from scratch has gained popularity; however, NLP models are mostly black boxes, very large, and often require significant resources. To ease, models trained with large corpora are made available, and developers reuse them for different problems. In contrast, developers mostly build their models from scratch for traditional DL-related problems. By doing so, they have control over the choice of algorithms, data processing, model structure, tuning hyperparameters, etc. Whereas in NLP, due to the reuse of the pre-trained models, NLP developers are limited to little to no control over such design decisions. They either apply tuning or transfer learning on pre-trained models to meet their requirements. Also, NLP models and their corresponding datasets are significantly larger than the traditional DL models and require heavy computation. Such reasons often lead to bugs in the system while reusing the pre-trained models. While bugs in traditional DL software have been intensively studied, the nature of extensive reuse and black-box structure motivates us to understand the different types of bugs that occur while reusing NLP models? What are the root causes of those bugs? How do these bugs affect the system? To answer these questions, We studied the bugs reported while reusing the 11 popular NLP models. We mined 9,214 issues from GitHub repositories and identified 984 bugs. We created a taxonomy with bug types, root causes, and impacts. Our observations led to several findings, including limited access to model internals resulting in a lack of robustness, lack of input validation leading to the propagation of algorithmic and data bias, and high-resource consumption causing more crashes, etc. Our observations suggest several bug patterns, which would greatly facilitate further efforts in reducing bugs in pre-trained models and code reuse.