评估生成专利语言模型

论文标题

评估生成专利语言模型

Evaluating Generative Patent Language Models

论文作者

Lee, Jieh-Sheng

论文摘要

生成语言模型很有希望，可以协助在各个领域的人类写作。该手稿旨在在专利领域建立生成语言模型，并从以人为中心的角度评估模型性能。观点是衡量基于生成专利语言模型可以通过自动完成可以保存的击键之比。更高的比例意味着更有效的模型可以节省更多的击键。该指标可用于基准模型性能。该指标不同于基于令牌的常规以机器为中心的指标，而不是基于击键的指标。在模型大小方面，本手稿中建造的最大模型是6B，它是专利领域的最新模型。基于指标，发现最大的模型不一定是以人为中心的度量的最佳模型。该发现意味着，如果目的是通过自动完成协助人类写作，则可能不必要在专利领域中增加模型大小。在这项研究中，几种专利语言模型是从头开始的。预先培训的模型将为未来的研究人员发布。还提供了几种可视化工具。在专利领域中建立生成语言模型的重要性是促进未来创造力和创新的潜力。

Generative language models are promising for assisting human writing in various domains. This manuscript aims to build generative language models in the patent domain and evaluate model performance from a human-centric perspective. The perspective is to measure the ratio of keystrokes that can be saved by autocompletion based on generative patent language models. A higher ratio means a more effective model which can save more keystrokes. This metric can be used to benchmark model performance. The metric is different from conventional machine-centric metrics that are token-based instead of keystroke-based. In terms of model size, the largest model built in this manuscript is 6B, which is state-of-the-art in the patent domain. Based on the metric, it is found that the largest model is not necessarily the best for the human-centric metric. The finding means that keeping increasing model sizes in the patent domain might be unnecessary if the purpose is to assist human writing with autocompletion. Several patent language models are pre-trained from scratch in this research. The pre-trained models are released for future researchers. Several visualization tools are also provided. The importance of building a generative language model in the patent domain is the potential to facilitate creativity and innovations in the future.

下载PDF全文

下载文献需遵守相关版权规定

论文标题