论文标题
相关文本挖掘技术的调查
A Survey of Relevant Text Mining Technology
论文作者
论文摘要
文本挖掘和自然语言处理技术的最新进展使研究人员能够通过自动分析语言特征的变化来检测作者的身份或人口统计学特征,例如年龄和性别。但是,在野外应用此类技术,即在网络犯罪分子和常规的在线社交媒体中,与更一般的应用不同,因为其定义特征既取决于域和过程。这引起了许多挑战,当代研究仅刮擦了表面。更具体地说,在社交媒体通信上应用的文本挖掘方法通常无法控制数据集大小,可用通信的数量会因用户而异。因此,该系统必须在有限的数据可用性方面坚固。此外,无法保证数据的质量。结果,该方法需要在一定程度的语言噪声上耐受性(例如,缩写,非标准语言使用,拼写变化和错误)。最后,在网络犯罪法律的背景下,它必须对欺骗性或对抗性行为,即试图隐藏犯罪意图(混淆)或假定虚假数字角色(模仿)的罪犯,可能使用编码语言。 在这项工作中,我们提出了一项综合调查,讨论了当前文献中已经解决的问题并审查潜在解决方案。此外,我们重点介绍了哪些领域需要更多关注。
Recent advances in text mining and natural language processing technology have enabled researchers to detect an authors identity or demographic characteristics, such as age and gender, in several text genres by automatically analysing the variation of linguistic characteristics. However, applying such techniques in the wild, i.e., in both cybercriminal and regular online social media, differs from more general applications in that its defining characteristics are both domain and process dependent. This gives rise to a number of challenges of which contemporary research has only scratched the surface. More specifically, a text mining approach applied on social media communications typically has no control over the dataset size, the number of available communications will vary across users. Hence, the system has to be robust towards limited data availability. Additionally, the quality of the data cannot be guaranteed. As a result, the approach needs to be tolerant to a certain degree of linguistic noise (for example, abbreviations, non-standard language use, spelling variations and errors). Finally, in the context of cybercriminal fora, it has to be robust towards deceptive or adversarial behaviour, i.e. offenders who attempt to hide their criminal intentions (obfuscation) or who assume a false digital persona (imitation), potentially using coded language. In this work we present a comprehensive survey that discusses the problems that have already been addressed in current literature and review potential solutions. Additionally, we highlight which areas need to be given more attention.