论文标题
绩效分析:从事件日志中发现半马尔可夫模型
Performance Analysis: Discovering Semi-Markov Models From Event Logs
论文作者
论文摘要
流程挖掘是一门公认的数据分析学科,重点是从信息系统事件日志发现过程模型。最近,新兴的工艺挖掘子区域(称为随机过程发现)已开始发展。随机过程发现考虑事件数据中事件的频率,并允许进行更全面的分析。特别是,当活动日志中显示活动的持续时间时,可以分析发现的随机模型的性能特征,例如,可以估算整个过程的执行时间。现有的性能分析技术通常会从事件数据中发现随机过程模型,然后模拟这些模型以评估其执行时间。这些方法依赖于经验方法。本文提出了用于性能分析的分析技术,该技术允许在存在由半马尔可夫过程建模的事件的任意时间分布的情况下衍生整个过程的执行时间的统计特征。提出的方法包括明确的分析,侧重于平均执行时间估计以及以连续和离散形式构建过程执行时间的概率密度函数(PDF)的完整分析技术。这些方法是在现实世界事件数据上实施和测试的,通过提供解决方案而无需诉诸于模拟,证明了它们的潜力。具体而言,我们证明了与仿真技术相比,离散方法的时间效率更高。此外,我们证明了连续的方法,PDF表示为高斯模型(GMM)的混合物,促进了发现更紧凑,更可解释的模型的发现。
Process mining is a well-established discipline of data analysis focused on the discovery of process models from information systems' event logs. Recently, an emerging subarea of process mining, known as stochastic process discovery, has started to evolve. Stochastic process discovery considers frequencies of events in the event data and allows for a more comprehensive analysis. In particular, when the durations of activities are presented in the event log, performance characteristics of the discovered stochastic models can be analyzed, e.g., the overall process execution time can be estimated. Existing performance analysis techniques usually discover stochastic process models from event data, and then simulate these models to evaluate their execution times. These methods rely on empirical approaches. This paper proposes analytical techniques for performance analysis that allow for the derivation of statistical characteristics of the overall processes' execution times in the presence of arbitrary time distributions of events modeled by semi-Markov processes. The proposed methods include express analysis, focused on the mean execution time estimation, and full analysis techniques that build probability density functions (PDFs) of process execution times in both continuous and discrete forms. These methods are implemented and tested on real-world event data, demonstrating their potential for what-if analysis by providing solutions without resorting to simulation. Specifically, we demonstrated that the discrete approach is more time-efficient for small duration support sizes compared to the simulation technique. Furthermore, we showed that the continuous approach, with PDFs represented as Mixtures of Gaussian Models (GMMs), facilitates the discovery of more compact and interpretable models.