QComp 2020

The 2020 Comparison of Tools for the Analysis of Quantitative Formal Models (QComp 2020) was part of ISoLA 2020. It is the second friendly competition among verification and analysis tools for quantitative formal models. Based on a curated subset of benchmarks from the Quantitative Verification Benchmark Set, QComp compares the performance, versatility, and usability of the participating tools.

QComp 2020 was a flexible and friendly event. There is no global scoring or ranking of tools. Instead, we collected data and publish comparison tables, diagrams, and text about the performance, versatility, and usability of the participating tools. We analyse and highlight every tool's strengths and tradeoffs. The entire performance evaluation and report-writing process was performed in close collaboration with the participants to quickly include feedback and resolve any setup errors or misunderstandings early. All participants co-authored the competition report.

Results

The results of QComp 2020 are published in

Carlos E. Budde, Arnd Hartmanns, Michaela Klauck, Jan Křetínský, David Parker, Tim Quatmann, Andrea Turrini, and Zhen Zhang: On Correctness, Precision, and Performance in Quantitative Verification. Proceedings of the 9th International Symposium on Leveraging Applications of Formal Methods (ISoLA 2020). Lecture Notes in Computer Science, vol. 12479, Springer, 2021.

The complete data of the performance evaluation is available via the interactive results tables for the individual tracks:

The tools and scripts used in the performance evaluation as well as the generated data are available in the replication package.

Timeline

All deadlines are "anywhere on Earth" (UTC-12) and all dates are in 2020.

February: Registration (announce intention to participate, via )
March 11: Tool package submission for performance evaluation
April 18: Tool and algorithm description submission for competition report
April 18: Announcement of the results of the performance evaluation
May 02: Deadline for objections to performance evaluation results
May 09: Tool package resubmission (for re-evaluation in case of objections)
June 02: Announcement of the results of the performance re-evaluation

The organisers will make the tool packaging instructions available to the participants around February 24. The purpose of the re-evaluation phase in May is to allow participants to fix bugs in their tools that were not discovered in their own testing prior to the original tool submission. It is not intended for significant changes with the aim of improving the tools' performance.

Participation

Every tool that solves a significant subset of the competition benchmarks was welcome, even if it may not support all model types, the JANI format, or all included kinds of properties. Every participating tool is associated to exactly one participant (with exceptions), who co-authored the competition report. A participant was allowed to participate with multiple tools, or multiple variants (e.g. analysis engines) of the same tool.

The following tools participated in QComp 2020:

Participant	Tools
Carlos E. Budde	DFTRES [RRBS19]: a rare-event simulator using importance sampling for dynamic fault trees specified in the Galileo language or JANI
Andrea Turrini	ePMC [HLSTZ14]: an extensible probabilistic model checker for CTMC, DTMC, and MDP specified in the PRISM language or JANI, formerly IscasMC
Arnd Hartmanns	mcsta [HH15]: a disk-based explicit-state model checker for MDP, MA, and PTA specified in Modest or JANI, part of the Modest Toolset [HH14]
Arnd Hartmanns	modes [BDHS18]: a statistical model checker for MDP, MA, and PTA specified in Modest or JANI, part of the Modest Toolset [HH14]
Michaela Klauck	Modest FRET-π LRTDP: a model checker based on probabilistic planning algorithms for MDP specified in Modest or JANI, using infrastructure of the Modest Toolset [HH14]
Jan Křetínský	PRISM-TUMheuristics [BCCFKKPU14]: an extension of PRISM with simulation-based learning heuristics and mean-payoff properties
David Parker	PRISM [KNP11]: a model checker for a wide range of probabilistic models including DTMC, CTMC, MDP, and PTA specified in the PRISM language
Tim Quatmann	Storm [HJKQV20]: a high-performance model checker for DTMC, CTMC, MDP, and MA specified in various modelling languages
Zhen Zhang	STAMINA [NMMZZ19]: a model checker for infinite-state CTMC based on state space truncation built atop PRISM and using the PRISM language

Evaluation

The evaluation of performance, versatility, and usability was performed by the organisers.

Hardware

All tools were evaluated on a standard desktop machine running an up-to-date 64-bit version of Ubuntu Linux 18.04. We used the same machine with an Intel Core i7-920 CPU and 12 GB of RAM that was previously used for QComp 2019. We used a timeout of 30 minutes per model and tool. The organisers had made the benchmarking scripts available to participants for testing. Participants provided a tool package, installation instructions, and scripts that generate command lines to run the selected benchmarks and extract the tool's results. We did not directly evaluate memory usage, but note that the amount of RAM available in the central server is not high. Thus memory usage was evaluated indirectly by considering for which (large) models a tool fails due to an out-of-memory situation.

Performance Evaluation Tracks

Due to the different types of correctness guarantees on results provided by different analysis methods, we compare tools in three different tracks:

correct results: tools/algorithms that unconditionally guarantee a difference of zero between the tool's result and the correct value. These tools could e.g. use techniques based on parametric model checking, state elimination with rational arithmetic, or rational search. Tools return a reduced fraction (as a string) that must be equal to the reference result.
floating-point correct results: tools/algorithms that unconditionally guarantee a difference of zero between the tool's result and the correct value modulo imprecisions of floating-point arithmetic. These tools could e.g. use techniques based on state elimination with floating-point arithmetic. We checked whether the results provided by tools are within a relative error of ± 1e-14 compared to the reference results. Tools that participate in the correct results track should participate in this track, too.
ε-correct results: tools/algorithms that unconditionally guarantee an a priori fixed bound on the difference between the tool's result and the correct value. They must provide results within a relative error of ± 1e-6 compared to the reference results. These tools typically use sound iterative algorithms with floating-point arithmetic like interval iteration. Tools that participate in the floating-point correct results track should participate in this track, too.
probably ε-correct results: tools/algorithms that guarantee an a priori fixed bound on the difference between the tool's result and the correct value with a certain probability. They must provide results within a relative error of ± 5e-2 compared to the reference results with probability 0.95. These tools typically use simulation-based approaches such as statistical model checking. Tools that participate in the ε-correct results track should participate in this track, too.
often ε-correct results: tools/algorithms that do not guarantee any a priori fixed bound on the difference between the tool's result and the correct value, but often deliver results that meet such a bound. These tools typically use unsound iterative algorithms like standard value iteration, or compute over- or underapproximations. Tools that participate in the probably ε-correct results track should participate in this track, too. We performed two evaluations in this track:
- one in which we report runtime and the number of incorrect results, for a relative error of ± 1e-3, and
- one where we only report the error that the tool achieved within a 10-minute timeout.

Tool Parameters

All tools were evaluated in two settings:

once with default parameters that must be the same for all benchmark instances (but that can differ per track, model type, and property type), and
once with specific parameters that may be different for each benchmark instance.

Correctness of Results

For quantitative benchmarks, it is often not clear what the true, correct numeric result for a property is. This is due to most tools that scale to large models using inexact floating-point arithmetic, and because any result may be affected by bugs in whatever "reference" tool is used. At the same time, it does not make sense to report performance data when a tool provides an incorrect result as this may be due to an error that drastically reduces or increases the analysis time. For this reason, QComp adopts the following pragmatic approach to evaluate the correctness of the results produced by the participating tools: The organisers have used the most trustworthy analysis approach available (e.g. an exact-arithmetic solver for small and a model checker using a sound iterative numerical method for large models) to produce reference results for all selected models. Participants may use any other tool to try to refute the correctness of any of those reference results. If this happens, all participants will discuss and agree on a new reference result, or decide to exclude the affected benchmark. During the performance evaluation phase, every result that does not agree with the reference result up to the desired precision is considered as incorrect. The number of incorrect results produced by a tool is listed in the competition report.

Linear Temporal Logic

Several participants of QComp 2019 expressed an interest in comparing their tools' performance with respect to computing the probabilities of formulas in linear temporal logic (LTL). QComp 2020 will include such an evaluation if it includes tools from at least three participants, and those participants agree on a set of LTL benchmark instances by the tool package submission deadline. These benchmarks have to be available in JANI format, and eventually be added to the Quantitative Verification Benchmark Set. The organisers will help coordinate such an effort and perform the LTL performance evaluation.

QComp 2020 ultimately did not include an LTL evaluation.

Organisers

Organisation & communication: Arnd Hartmanns (University of Twente, The Netherlands)
Technical setup & tool execution: Michaela Klauck (Saarland University, Germany)