The 2019 Comparison of Tools for the Analysis of Quantitative Formal Models (QComp 2019) was part of the TACAS 2019 TOOLympics. It is the first friendly competition among verification and analysis tools for quantitative formal models. Based on a curated subset of benchmarks from the Quantitative Verification Benchmark Set, QComp compares the performance, versatility, and usability of the participating tools.
The results of QComp 2019 are published in
Ernst Moritz Hahn, Arnd Hartmanns, Christian Hensel, Michaela Klauck, Joachim Klein, Jan Křetínský, David Parker, Tim Quatmann, Enno Ruijters, and Marcel Steinmetz: The 2019 Comparison of Tools for the Analysis of Quantitative Formal Models. Proceedings of the 24th International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS 2019). Lecture Notes in Computer Science, vol. 11429, Springer, 2019.
The complete data of the performance evaluation is available via the interactive results table. The tools and scripts used in the performance evaluation as well as the generated data are available in the replication package.
All deadlines are "anywhere on Earth" (UTC-12).
We ask benchmark contributions as well as tool participation to be announced to the organisers by October 20. This is to ensure that all benchmarks are properly tested and annotated with metadata in time by the submission deadline on October 27. Between October 27 and November 24, the organisers and participants will select a subset of the models and properties in the benchmark set for use in the performance comparison and discuss the reference results. Participants submit their tool description by December 15, which will then be edited by the organisers for inclusion in the competition report. We currently expect the Toolympics deadline for a first complete version of the competition report to be around January 10, 2019. The purpose of the re-evaluation phase in January 2019 is to allow participants to fix bugs in their tools that were not discovered in their own testing prior to the original tool submission. It is not intended for significant changes with the aim of improving the tools' performance. We expect reviews for the competition report to arrive in February 2019, with the final version deadline being the same as for the ETAPS 2019 conferences.
Every tool that solves a significant subset of the benchmark set was welcome, even if it may not support all model types, the JANI format, or all included kinds of properties. For example, a tool specialising in the analysis of stochastic Petri nets was not expected to solve JANI Markov chain models. Every participating tool is associated to exactly one participant (with exceptions), who co-authored the competition report. A participant was allowed to participate with multiple tools, or multiple variants (e.g. analysis engines) of the same tool.
The following tools participated in QComp 2019:
|Ernst Moritz Hahn||ePMC [HLSTZ14]: an extensible probabilistic model checker for CTMC, DTMC, and MDP specified in the PRISM language or JANI, formerly IscasMC|
|Arnd Hartmanns||mcsta [HH15]: a disk-based explicit-state model checker for MDP, MA, and PTA specified in Modest or JANI, part of the Modest Toolset [HH14]|
|modes [BDHS18]: a statistical model checker for MDP, MA, and PTA specified in Modest or JANI, part of the Modest Toolset [HH14]|
|Christian Hensel||Storm [DJKV17]: a high-performance model checker for DTMC, CTMC, MDP, and MA specified in various modelling languages|
|Michaela Klauck||Modest FRET-π LRTDP: a model checker based on probabilistic planning algorithms for MDP specified in Modest or JANI, using infrastructure of the Modest Toolset [HH14]|
|Joachim Klein, David Parker||PRISM [KNP11]: a model checker for a wide range of probabilistic models including DTMC, CTMC, MDP, and PTA specified in the PRISM language|
|Jan Křetínský||PRISM-TUMheuristics [BCCFKKPU14]: an extension of PRISM with simulation-based learning heuristics and mean-payoff properties|
|Enno Ruijters||DFTRES [RRBS17]: a rare-event simulator using importance sampling for dynamic fault trees specified in the Galileo language or JANI|
|Marcel Steinmetz||Probabilistic Fast Downward [SHB16]: a probabilistic extension of the classical heuristic planner Fast Downward for MDP specified in PPDDL|
QComp 2019 was a flexible and friendly event. As a comparison, not a competition, there is no global scoring or ranking of tools. Instead, we collected data and publish comparison tables and diagrams about the performance of the participating tools on a curated selection of models from the Quantitative Verification Benchmark Set. We analyse and highlight every tool's strengths and tradeoffs. Equally prominently, based on input and feedback from the participants, we describe and compare the tools in terms of versatility and usability. The entire performance evaluation and report-writing process was performed in close collaboration with the participants to quickly include feedback and resolve any setup errors or misunderstandings early. All participants co-authored the competition report.
All tools were evaluated by the organisers on a central server running an up-to-date 64-bit version of Ubuntu Linux 18.04. We used a standard desktop machine with an Intel Core i7-920 CPU and 12 GB of RAM. We used a timeout of 30 minutes per model and tool. The organisers had made the benchmarking scripts that were executed on the central server available to participants for testing. Participants provided a tool package, installation instructions, and command lines/scripts to run the selected benchmarks and deliver the tool's results in a simple common format. We evaluated all tools once with the command lines provided by the participants, and once with default parameters for comparison. For every benchmark, participants were asked to briefly explain the tool settings used, why they were chosen for the particular benchmark, and what the capabilities of their tool are (e.g. which properties of the benchmark are supported and what kind of guarantee is attached to the results provided by the tool). We expected tools to provide results with a relative error of ± 1e-3. Statistical tools (e.g. simulators) should provide results outside of this error bound with a probability of at most 5 %. The competition report does not contain a performance ranking, but provides tables, diagrams, and a discussion reporting on the runtimes achieved by the tools. We did not directly evaluate memory usage, but note that the amount of RAM available in the central server is not high. Thus memory usage was evaluated indirectly by considering for which (large) models a tool fails due to an out-of-memory situation.
For quantitative benchmarks, it is often not clear what the true, correct numeric result for a property is. This is due to most tools that scale to large models using inexact floating-point arithmetic, and because any result may be affected by bugs in whatever "reference" tool is used. At the same time, it does not make sense to report performance data when a tool provides an incorrect result as this may be due to an error that drastically reduces or increases the analysis time. For this reason, QComp adopts the following pragmatic approach to evaluate the correctness of the results produced by the participating tools: After the submission deadline for benchmarks, the organisers will use the most trustworthy analysis approach available (e.g. an exact-arithmetic solver for small and a model checker using a sound iterative numerical method for large models) to produce reference results for all selected models. Participants may use any other tool to try to refute the correctness of any of those reference results. If this happens, all participants will discuss and agree on a new reference result, or decide to exclude the affected benchmark. During the performance evaluation phase, every result that does not agree with the reference result up to the desired precision is considered as incorrect. The number of incorrect results produced by a tool is listed in the competition report.
As part of the competition report, we also describe the versatility and usability of all participating tools, based on a description provided by the participants. Versatility is, for example, the support for different input formats and formalisms, providing different and complementary analysis engines, and the tool's configurability (e.g. to make speed/precision tradeoffs). Usability is the quality of the tool's documentation (online, or provided as help text or messages within the tool), the availability of an easy-to-use graphical interface, how involved the installation process is, what platforms the tool runs on, whether it provides sensible default parameter settings, and so on. This also includes whether the tool was able to achieve its results in the performance evaluation with few and consistent parameters for all models, or whether every model needed specifically tuned values.
As the first iteration of a quantitative analysis competition, QComp 2019 is intentionally limited to a small set of mathematical formalisms (from Markov chains to probabilistic timed automata) and properties (probabilistic reachability, expected rewards, and steady-state queries only). However, as part of the QComp 2019 evaluation, we also identify advanced and upcoming areas of interest in quantitative modelling and analysis that may be of interest for the next edition of QComp.