四处移动：研究将文档的效率研究到记忆神经IR模型的记忆

论文标题

四处移动：研究将文档的效率研究到记忆神经IR模型的记忆

Moving Stuff Around: A study on efficiency of moving documents into memory for Neural IR models

论文作者

Câmara, Arthur, Hauff, Claudia

论文摘要

当使用大语言模型训练神经排名者时，预计从业者将利用多个GPU来加速训练时间。通过使用更多的设备，诸如Pytorch之类的深度学习框架，允许用户大大增加可用的VRAM池，从而在训练时使更大的批次成为可能，从而减少了培训时间。同时，在运行数据模型时通常会忽略的最关键过程之一是如何在磁盘，主内存和VRAM之间管理数据。大多数开源研究实现都忽略了此内存层次结构，而是诉诸于从磁盘到主内存的所有文档，然后允许框架（例如Pytorch）将移动数据处理到VRAM中。因此，随着专门用于IR研究的数据集的尺寸越来越大，出现了一个自然的问题：S这是优化训练时间的最佳解决方案？我们在这里研究如何处理IR数据集的三种不同流行方法，以及它们如何使用多个GPU进行扩展。也就是说，将文档直接加载到内存中，直接从带有查找表的文本文件中读取文档，并使用库处理IR数据集（IR_DATASET）的库有所不同，既有性能（即每秒处理的样本）和内存足迹。我们表明，当将最受欢迎的库用于神经排名研究（即Pytorch和Hugging Face的变形金刚）时，将所有文档加载到主内存中的实践并不总是最快的选择，并且对于具有超过几个GPU的设置而言并不可行。同时，从磁盘流进行良好的数据流可以更快，同时更可扩展。我们还展示了改善加载时间的流行技术，例如内存固定，多个工人和Ramdisk使用，可以在较小的内存开销中进一步减少训练时间。

When training neural rankers using Large Language Models, it's expected that a practitioner would make use of multiple GPUs to accelerate the training time. By using more devices, deep learning frameworks, like PyTorch, allow the user to drastically increase the available VRAM pool, making larger batches possible when training, therefore shrinking training time. At the same time, one of the most critical processes, that is generally overlooked when running data-hungry models, is how data is managed between disk, main memory and VRAM. Most open source research implementations overlook this memory hierarchy, and instead resort to loading all documents from disk to main memory and then allowing the framework (e.g., PyTorch) to handle moving data into VRAM. Therefore, with the increasing sizes of datasets dedicated to IR research, a natural question arises: s this the optimal solution for optimizing training time? We here study how three different popular approaches to handling documents for IR datasets behave and how they scale with multiple GPUs. Namely, loading documents directly into memory, reading documents directly from text files with a lookup table and using a library for handling IR datasets (ir_datasets) differ, both in performance (i.e. samples processed per second) and memory footprint. We show that, when using the most popular libraries for neural ranker research (i.e. PyTorch and Hugging Face's Transformers), the practice of loading all documents into main memory is not always the fastest option and is not feasible for setups with more than a couple GPUs. Meanwhile, a good implementation of data streaming from disk can be faster, while being considerably more scalable. We also show how popular techniques for improving loading times, like memory pining, multiple workers, and RAMDISK usage, can reduce the training time further with minor memory overhead.

下载PDF全文

下载文献需遵守相关版权规定

论文标题