The 2020 Comparison of Tools for the Analysis of Quantitative Formal Models (QComp 2020) will be part of ISoLA 2020. It is the second friendly competition among verification and analysis tools for quantitative formal models. Based on a curated subset of benchmarks from the Quantitative Verification Benchmark Set, QComp will compare the performance, versatility, and usability of the participating tools.
QComp 2020 is a flexible and friendly event. There is no global scoring or ranking of tools. Instead, we collect data and publish comparison tables, diagrams, and text about the performance, versatility, and usability of the participating tools. We analyse and highlight every tool's strengths and tradeoffs. The entire performance evaluation and report-writing process will be performed in close collaboration with the participants to quickly include feedback and resolve any setup errors or misunderstandings early. We expect all participants to co-author the competition report.
All deadlines are "anywhere on Earth" (UTC-12) and all dates are in 2020.
The organisers will make the tool packaging instructions available to the participants around February 24. The purpose of the re-evaluation phase in May is to allow participants to fix bugs in their tools that were not discovered in their own testing prior to the original tool submission. It is not intended for significant changes with the aim of improving the tools' performance.
Every tool that solves a significant subset of the competition benchmarks is welcome, even if it may not support all model types, the JANI format, or all included kinds of properties. Every participating tool must be associated to exactly one participant (with exceptions), who we expect to co-author the competition report. A participant may participate with multiple tools, or multiple variants (e.g. analysis engines) of the same tool. The currently registered participants are:
|Carlos E. Budde||DFTRES [RRBS17]: a rare-event simulator using importance sampling for dynamic fault trees specified in the Galileo language or JANI|
|Ernst Moritz Hahn||ePMC [HLSTZ14]: an extensible probabilistic model checker for CTMC, DTMC, and MDP specified in the PRISM language or JANI, formerly IscasMC|
|Arnd Hartmanns||mcsta [HH15]: a disk-based explicit-state model checker for MDP, MA, and PTA specified in Modest or JANI, part of the Modest Toolset [HH14]|
|modes [BDHS18]: a statistical model checker for MDP, MA, and PTA specified in Modest or JANI, part of the Modest Toolset [HH14]|
|Michaela Klauck||Modest FRET-π LRTDP: a model checker based on probabilistic planning algorithms for MDP specified in Modest or JANI, using infrastructure of the Modest Toolset [HH14]|
|Jan Křetínský||PRISM-TUMheuristics [BCCFKKPU14]: an extension of PRISM with simulation-based learning heuristics and mean-payoff properties|
|David Parker||PRISM [KNP11]: a model checker for a wide range of probabilistic models including DTMC, CTMC, MDP, and PTA specified in the PRISM language|
|Tim Quatmann||Storm [HJKQV20]: a high-performance model checker for DTMC, CTMC, MDP, and MA specified in various modelling languages|
|Marcel Steinmetz||Probabilistic Fast Downward [SHB16]: a probabilistic extension of the classical heuristic planner Fast Downward for MDP specified in PPDDL|
|Zhen Zhang||STAMINA [NMMZZ19]: a model checker for infinite-state CTMC based on state space truncation built atop PRISM and using the PRISM language|
The evaluation of performance, versatility, and usability will be performed by the organisers.
All tools will be evaluated on a standard desktop machine running an up-to-date 64-bit version of Ubuntu Linux 18.04. We plan to use the same machine with an Intel Core i7-920 CPU and 12 GB of RAM that was previously used for QComp 2019. We will use a timeout of 30 minutes per model and tool. The organisers will make the benchmarking scripts available to participants for testing. Participants provide a tool package, installation instructions, and scripts that generate command lines to run the selected benchmarks and extract the tool's results. We will not directly evaluate memory usage, but note that the amount of RAM available in the central server is not high. Thus memory usage will be evaluated indirectly by considering for which (large) models a tool fails due to an out-of-memory situation.
Due to the different types of correctness guarantees on results provided by different analysis methods, we will compare tools in three different tracks:
All tools will be evaluated in two settings:
We expect tools to provide results with a relative error of ± 1e-3. Statistical tools (e.g. simulators) should provide results outside of this error bound with a probability of at most 5 %.
For quantitative benchmarks, it is often not clear what the true, correct numeric result for a property is. This is due to most tools that scale to large models using inexact floating-point arithmetic, and because any result may be affected by bugs in whatever "reference" tool is used. At the same time, it does not make sense to report performance data when a tool provides an incorrect result as this may be due to an error that drastically reduces or increases the analysis time. For this reason, QComp adopts the following pragmatic approach to evaluate the correctness of the results produced by the participating tools: The organisers have used the most trustworthy analysis approach available (e.g. an exact-arithmetic solver for small and a model checker using a sound iterative numerical method for large models) to produce reference results for all selected models. Participants may use any other tool to try to refute the correctness of any of those reference results. If this happens, all participants will discuss and agree on a new reference result, or decide to exclude the affected benchmark. During the performance evaluation phase, every result that does not agree with the reference result up to the desired precision will be considered as incorrect. The number of incorrect results produced by a tool will be listed in the competition report.
As part of the competition report, we will also evaluate the versatility and usability of all participating tools. Versatility is determined by the types and sizes of models that the tool is able to successfully analyse in the performance evaluation. Usability is the quality of the tool's documentation (online, or provided as help text or messages within the tool), the availability of an easy-to-use graphical interface, how involved the installation process is, what platforms the tool runs on, whether it provides sensible default parameter settings, and so on. The usability evaluation will follow clearly defined criteria that will be agreed upon among the participants before the performance evaluation starts.
Several participants of QComp 2019 expressed an interest in comparing their tools' performance with respect to computing the probabilities of formulas in linear temporal logic (LTL). QComp 2020 will include such an evaluation if it includes tools from at least three participants, and those participants agree on a set of LTL benchmark instances by the tool package submission deadline. These benchmarks have to be available in JANI format, and eventually be added to the Quantitative Verification Benchmark Set. The organisers will help coordinate such an effort and perform the LTL performance evaluation.