论文标题
西蒙文本模型的修改
Modifications of Simon text model
论文作者
论文摘要
我们讨论概率文本模型及其修改。我们在文本中构建了不同单词和独特单词的过程。这些模型将与真实的文本统计信息相对应。无限的URN模型(Karlin模型)和Simon模型是最著名的文本模型,但它们没有能力正确模拟唯一单词的数量。无限的urn模型有时会产生独特和不同单词相对数量的不正确限制。 Simon模型表示不同单词和独特单词数量的线性增长。我们提出了Karlin和Simon模型的三个修改。第一个是脱机变体,Simon模型在完成无限urn方案完成后开始。我们证明仅在嵌入式时间中限制了此修改的定理。第二个修改涉及无限骨折模型中的复合泊松过程。我们证明它限制了定理。第三个修改是在线变体,Simon重新分配可在Karlin模型的任何折磨中起作用。与复合泊松模型相反,我们没有用于此修改的分析。我们通过模拟测试所有修改,并与真实文本具有良好的对应关系。
We discuss probability text models and their modifications. We construct processes of different and unique words in a text. The models are to correspond to the real text statistics. The infinite urn model (Karlin model) and the Simon model are the most known models of texts, but they do not give the ability to simulate the number of unique words correctly. The infinite urn model give sometimes the incorrect limit of the relative number of unique and different words. The Simon model states a linear growth of the numbers of different and unique words. We propose three modifications of the Karlin and Simon models. The first one is the offline variant, the Simon model starts after the completion of the infinite urn scheme. We prove limit theorems for this modification in embedded times only. The second modification involves the compound Poisson process in the infinite urn model. We prove limit theorems for it. The third modification is the online variant, the Simon redistribution works at any toss of the Karlin model. In contrast to the compound Poisson model, we have no analytics for this modification. We test all the modifications by the simulation and have a good correspondence to the real texts.