基于域的预处理对主题特定聚类的影响

论文标题

基于域的预处理对主题特定聚类的影响

The Influence of Domain-Based Preprocessing on Subject-Specific Clustering

论文作者

Gkolia, Alexandra, Fernandes, Nikhil, Pizzo, Nicolas, Davenport, James, Nair, Akshar

论文摘要

由于全球Covid-19的大流行，在大学在线教学的突然变化导致学者的工作量增加了。促成因素之一是回答来自学生的大量查询。由于这些查询不仅限于讲座的同步时间框架，因此许多查询很有可能与之相关甚至等效。解决此问题的一种方法是根据其主题聚集这些问题。在我们以前的工作中，我们旨在找到一种改进的聚类方法，该方法将使用经常出现的LDA模型，从而使我们具有很高的效率。我们的数据集包含在BATH大学的计算机科学课程中在线发布的问题。这些问题中有很大一部分包含代码摘录，我们发现这在聚类中引起了一个问题，因为某些术语被认为是英语中的常见单词，而不是被认为是特定的代码术语。为了解决这个问题，我们使用Python实施了这些技术术语的标签，作为预处理数据集的一部分。在本文中，我们探讨了标记数据集的领域，专注于识别代码摘录并提供经验结果以证明我们的推理是合理的。

The sudden change of moving the majority of teaching online at Universities due to the global Covid-19 pandemic has caused an increased amount of workload for academics. One of the contributing factors is answering a high volume of queries coming from students. As these queries are not limited to the synchronous time frame of a lecture, there is a high chance of many of them being related or even equivalent. One way to deal with this problem is to cluster these questions depending on their topic. In our previous work, we aimed to find an improved method of clustering that would give us a high efficiency, using a recurring LDA model. Our data set contained questions posted online from a Computer Science course at the University of Bath. A significant number of these questions contained code excerpts, which we found caused a problem in clustering, as certain terms were being considered as common words in the English language and not being recognised as specific code terms. To address this, we implemented tagging of these technical terms using Python, as part of preprocessing the data set. In this paper, we explore the realms of tagging data sets, focusing on identifying code excerpts and providing empirical results in order to justify our reasoning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题