Leveraging Sub-Trace Clustering in Predictive Process Monitoring Tasks

Introduction and Background

For the analysis of business processes, we can apply techniques from the domain of Process Mining and/or Predictive Process Monitoring to gather insights into drivers of (future) process performance, possible bottlenecks or anomalies. When looking at the components of business processes in terms of the event log data that represents them, we find single events, shorter and longer sequences of events (i.e. traces) and different process attributes accompanying those events.

During execution, the process instances traverse through different phases or stages, where smaller chunks of the process traces could represent the starting and ending phases or different paths and splitways inside a trace that might bear information about how the underlying real world process is performing or which kind of overall process is being carried out. In an unsupervised setting where we have no information about those different phases, i.e. subtraces of a process, we either need to apply domain knowledge to correctly classify or separate those subtraces from each other or turn to databased approaches to cluster the distinct (atomic or composite) parts of process traces.

This thesis is concerned with identifying important patterns (subtraces) inside of process data. Possible methods to apply range from the domain of Text Representation or (Deep) Clustering in general. Subsequently, the goal is to use the identified subtraces to reason about the respective process traces in terms of predictions/explanations with respect to process KPIs, outcomes, etc. You will investigate and analyze different ways of representing process trace data to extract the sequential information and cluster them into meaningful - and possibly hierarchically arranged - subtraces/subtrace patterns and further analyze the identified subtraces within common process monitoring tasks like Process Duration Prediction, Outcome Prediction or Next Activity (Next Sub-trace) Prediction.

Research Question

Main Research Question

How can important sub-traces of processes be effectively identified under
the use of text representation techniques and can the notion of sub-traces
be leveraged to improve the results of common Predictive Process Monitoring
(PPM) tasks?”

Sub-questions

  • How can subtrace information be efficiently used in predictive process monitoring tasks?
  • Are there dataset-specific cluster granularities/subtrace lengths that lead to a larger improvement regarding the downstream task at hand?
  • What is the quantitative benefit - if there is any - in terms of predictive performance and/or explainability/comprehensability of the trained predictive process monitoring models?

Tasks & Goals

  • Literature Review
    Study (sub-)trace clustering, (deep) clustering in general as well as the current state of the art in predictive process monitoring
  • Data Preparation
    Select benchmark event log datasets (e.g. BPI Challenges) and preprocess them if needed.
  • Pipeline Development
    Develop a pipeline for subtrace clustering (e.g. with techniques of Text Representation, (Deep) Clustering) followed by predictive model training and inference for different predictive process monitoring tasks
  • Analysis Benchmarking
    Compare different configurations of your implemented pipeline and identify benefits and drawbacks of model combinations. Compare your results with the current state of the art predictive process monitoring models from literature
  • Thesis Documentation
    Summarize findings, discuss limitations, and outline future research directions.

Expected Outcome

  • A prototypical subtrace clustering + PPM pipeline for an easily configurable and adjustable event log analysis
  • Insights into which models work well for certain data characteristics and how they compare with existing techniques
  • Accompanying code base and reproducible experiments.

Requirements

  • Study in the field of computer science, statistics or mathematics
  • Good understanding of machine learning and probabilistic modelling
  • Good programming skills in Python
  • Beneficial: Experience in the topics of (deep) clustering methods, text representation and/or NLP tasks (NLTK, etc.)

Contact

Simon Rauch

Please attach your CV and Transcript of Records.

References

Philipp Siebers, Christian Janiesch, and Patrick Zschech. A survey of text representation methods and their genealogy. IEEE Access, 10:96492–96513, 2022.
Minseok Song, Christian W. G¨unther, and Wil M. P. van der Aalst. Trace Clustering in Process Mining, page 109–120. Springer Berlin Heidelberg, 2009.