论文标题

使用外部来源进行流数据流数据的主动查询扩展

Proactive Query Expansion for Streaming Data Using External Source

论文作者

Alshanik, Farah, Apon, Amy, Du, Yuheng, Herzog, Alexander, Safro, Ilya

论文摘要

查询扩展是通过添加相关词来重新设计原始查询的过程。选择要添加的术语以提高查询扩展方法的性能或提高检索结果的质量是任何信息检索系统的重要方面。添加可以积极影响搜索查询质量或足够信息的单词在返回或收集涵盖某个主题的相关文档中起重要作用,可以提高信息检索系统的效率。通常,查询扩展技术用于将单词添加或代替给定的搜索查询以收集相关数据。在本文中,我们设计并实施了自动查询扩展的管道。我们使用不同的方法概述了几种工具来扩展查询。我们的方法取决于随着时间的推移而定位流数据中的新兴事件,并使用概率主题模型从目标文档中找到隐藏的主题。我们采用动态特征向量的中心性来触发新兴事件,而潜在的差异分配来发现主题。另外,我们使用外部数据源作为辅助流,用相关单词来补充主流,并使用主要和次级流的单词扩展查询。在2015年巴尔的摩在巴尔的摩抗议期间发生的事件的Twitter数据(主要流)进行了一项实验研究。检索结果的质量是使用流数据的质量指标来衡量的:Tweets计数,标签计数和哈塔格集群。

Query expansion is the process of reformulating the original query by adding relevant words. Choosing which terms to add in order to improve the performance of the query expansion methods or to enhance the quality of the retrieved results is an important aspect of any information retrieval system. Adding words that can positively impact the quality of the search query or are informative enough play an important role in returning or gathering relevant documents that cover a certain topic can result in improving the efficiency of the information retrieval system. Typically, query expansion techniques are used to add or substitute words to a given search query to collect relevant data. In this paper, we design and implement a pipeline of automated query expansion. We outline several tools using different methods to expand the query. Our methods depend on targeting emergent events in streaming data over time and finding the hidden topics from targeted documents using probabilistic topic models. We employ Dynamic Eigenvector Centrality to trigger the emergent events, and the Latent Dirichlet Allocation to discover the topics. Also, we use an external data source as a secondary stream to supplement the primary stream with relevant words and expand the query using the words from both primary and secondary streams. An experimental study is performed on Twitter data (primary stream) related to the events that happened during protests in Baltimore in 2015. The quality of the retrieved results was measured using a quality indicator of the streaming data: tweets count, hashtag count, and hashtag clustering.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源