成长+UP：使用预训练的网页网络的图表表示

论文标题

成长+UP：使用预训练的网页网络的图表表示

GROWN+UP: A Graph Representation Of a Webpage Network Utilizing Pre-training

论文作者

Yeoh, Benedict, Wang, Huijuan

论文摘要

大型预训练的神经网络无处不在，对于自然语言处理和计算机视觉中许多下游任务的成功至关重要。但是，在Web信息检索领域内，缺乏类似灵活且强大的预训练模型可以正确解析网页存在鲜明的对比。因此，我们认为，诸如内容提取和来自网页的信息挖掘之类的常见机器学习任务的收益较低，但仍未开发。我们的目标是通过引入不可知论的深图神经网络提取器来缩小差距，该图形提取器可以摄入网页结构，对大量未标记的数据进行自我监督，并对网页上的任意任务进行微调。最后，我们表明，我们的预训练模型使用两个非常不同的基准上的多个数据集实现了最先进的结果：网页样板删除和流派分类，从而在不同的下游任务中提供了对其潜在应用的贷款支持。

Large pre-trained neural networks are ubiquitous and critical to the success of many downstream tasks in natural language processing and computer vision. However, within the field of web information retrieval, there is a stark contrast in the lack of similarly flexible and powerful pre-trained models that can properly parse webpages. Consequently, we believe that common machine learning tasks like content extraction and information mining from webpages have low-hanging gains that yet remain untapped. We aim to close the gap by introducing an agnostic deep graph neural network feature extractor that can ingest webpage structures, pre-train self-supervised on massive unlabeled data, and fine-tune to arbitrary tasks on webpages effectually. Finally, we show that our pre-trained model achieves state-of-the-art results using multiple datasets on two very different benchmarks: webpage boilerplate removal and genre classification, thus lending support to its potential application in diverse downstream tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题