The 2020 Comparison of Tools for the Analysis of Quantitative Formal Models (QComp 2020) was part of ISoLA 2020. It is the second friendly competition among verification and analysis tools for quantitative formal models. Based on a curated subset of benchmarks from the Quantitative Verification Benchmark Set, QComp compares the performance, versatility, and usability of the participating tools.
QComp 2020 was a flexible and friendly event. There is no global scoring or ranking of tools. Instead, we collected data and publish comparison tables, diagrams, and text about the performance, versatility, and usability of the participating tools. We analyse and highlight every tool's strengths and tradeoffs. The entire performance evaluation and report-writing process was performed in close collaboration with the participants to quickly include feedback and resolve any setup errors or misunderstandings early. All participants co-authored the competition report.
The results of QComp 2020 are published in
Carlos E. Budde, Arnd Hartmanns, Michaela Klauck, Jan Křetínský, David Parker, Tim Quatmann, Andrea Turrini, and Zhen Zhang: On Correctness, Precision, and Performance in Quantitative Verification. Proceedings of the 9th International Symposium on Leveraging Applications of Formal Methods (ISoLA 2020). Lecture Notes in Computer Science, vol. 12479, Springer, 2021.
The complete data of the performance evaluation is available via the interactive results tables for the individual tracks:
The tools and scripts used in the performance evaluation as well as the generated data are available in the replication package.
All deadlines are "anywhere on Earth" (UTC-12) and all dates are in 2020.
The organisers will make the tool packaging instructions available to the participants around February 24. The purpose of the re-evaluation phase in May is to allow participants to fix bugs in their tools that were not discovered in their own testing prior to the original tool submission. It is not intended for significant changes with the aim of improving the tools' performance.
Every tool that solves a significant subset of the competition benchmarks was welcome, even if it may not support all model types, the JANI format, or all included kinds of properties. Every participating tool is associated to exactly one participant (with exceptions), who co-authored the competition report. A participant was allowed to participate with multiple tools, or multiple variants (e.g. analysis engines) of the same tool.
The following tools participated in QComp 2020:
|Carlos E. Budde||DFTRES [RRBS19]: a rare-event simulator using importance sampling for dynamic fault trees specified in the Galileo language or JANI|
|Andrea Turrini||ePMC [HLSTZ14]: an extensible probabilistic model checker for CTMC, DTMC, and MDP specified in the PRISM language or JANI, formerly IscasMC|
|Arnd Hartmanns||mcsta [HH15]: a disk-based explicit-state model checker for MDP, MA, and PTA specified in Modest or JANI, part of the Modest Toolset [HH14]|
|modes [BDHS18]: a statistical model checker for MDP, MA, and PTA specified in Modest or JANI, part of the Modest Toolset [HH14]|
|Michaela Klauck||Modest FRET-π LRTDP: a model checker based on probabilistic planning algorithms for MDP specified in Modest or JANI, using infrastructure of the Modest Toolset [HH14]|
|Jan Křetínský||PRISM-TUMheuristics [BCCFKKPU14]: an extension of PRISM with simulation-based learning heuristics and mean-payoff properties|
|David Parker||PRISM [KNP11]: a model checker for a wide range of probabilistic models including DTMC, CTMC, MDP, and PTA specified in the PRISM language|
|Tim Quatmann||Storm [HJKQV20]: a high-performance model checker for DTMC, CTMC, MDP, and MA specified in various modelling languages|
|Zhen Zhang||STAMINA [NMMZZ19]: a model checker for infinite-state CTMC based on state space truncation built atop PRISM and using the PRISM language|
The evaluation of performance, versatility, and usability was performed by the organisers.
All tools were evaluated on a standard desktop machine running an up-to-date 64-bit version of Ubuntu Linux 18.04. We used the same machine with an Intel Core i7-920 CPU and 12 GB of RAM that was previously used for QComp 2019. We used a timeout of 30 minutes per model and tool. The organisers had made the benchmarking scripts available to participants for testing. Participants provided a tool package, installation instructions, and scripts that generate command lines to run the selected benchmarks and extract the tool's results. We did not directly evaluate memory usage, but note that the amount of RAM available in the central server is not high. Thus memory usage was evaluated indirectly by considering for which (large) models a tool fails due to an out-of-memory situation.
Due to the different types of correctness guarantees on results provided by different analysis methods, we compare tools in three different tracks:
All tools were evaluated in two settings:
For quantitative benchmarks, it is often not clear what the true, correct numeric result for a property is. This is due to most tools that scale to large models using inexact floating-point arithmetic, and because any result may be affected by bugs in whatever "reference" tool is used. At the same time, it does not make sense to report performance data when a tool provides an incorrect result as this may be due to an error that drastically reduces or increases the analysis time. For this reason, QComp adopts the following pragmatic approach to evaluate the correctness of the results produced by the participating tools: The organisers have used the most trustworthy analysis approach available (e.g. an exact-arithmetic solver for small and a model checker using a sound iterative numerical method for large models) to produce reference results for all selected models. Participants may use any other tool to try to refute the correctness of any of those reference results. If this happens, all participants will discuss and agree on a new reference result, or decide to exclude the affected benchmark. During the performance evaluation phase, every result that does not agree with the reference result up to the desired precision is considered as incorrect. The number of incorrect results produced by a tool is listed in the competition report.
Several participants of QComp 2019 expressed an interest in comparing their tools' performance with respect to computing the probabilities of formulas in linear temporal logic (LTL). QComp 2020 will include such an evaluation if it includes tools from at least three participants, and those participants agree on a set of LTL benchmark instances by the tool package submission deadline. These benchmarks have to be available in JANI format, and eventually be added to the Quantitative Verification Benchmark Set. The organisers will help coordinate such an effort and perform the LTL performance evaluation.
QComp 2020 ultimately did not include an LTL evaluation.