1 Introduction
Static analysis tools play an important role in different stages of the software development cycle, such as code verification and optimization. However, building modern analyzers for programing languages presents significant challenges since these systems are typically large and complex, making them prone to bugs. This is a limitation to their applicability in real-life production compilers and development environments, where they are used in critical tasks that need reassurance about the soundness of the analysis results.
However, the validation of static analyzers is a challenging problem, which is not well covered in the literature or by existing tools. This is probably due to the fact that direct application of formal methods is not always straightforward with code that is so complex and large, even without considering the problem of having precise specifications to check against —a clear instance of the classic problem of who checks the checker. In current practice, extensive testing is the most extended and realistic validation technique, but it poses some significant challenges too. Testing separate components of the analyzer misses integration testing, and designing proper oracles for testing the complete tool is challenging.
In this paper we propose checkification, a simple, automatic, technique for testing static analyzers. We believe the approach is general in nature, and can be applied effectively to a wide class of static analyzers, provided some kind of run-time checking is feasible for the properties inferred. Herein we develop the proposal for concreteness in the context of the Ciao (Hermenegildo et al. Reference Hermenegildo, Bueno, Carro, Lopez-Garcia, Mera, Morales and Puebla2012) logic programing-based, multi-paradigm language. The Ciao programing environment includes CiaoPP, a large and complex abstract interpretation-based static analysis tool which faces the specific challenges that we are addressing. Recently, there has been some interesting work (Stade et al. Reference Stade, Tilscher and Seidl2024) aimed at verifying the partial correctness of the PLAI analysis algorithm (also referred to as “the top-down solver”) that lies at the heart of CiaoPP using the Isabelle prover (Paulson, Reference Paulson and Odifreddi1990), but verification of the actual implementation remains a challenge. Like other “classic” analyzers, the CiaoPP formal framework has evolved for a long time, incorporating a large number of abstract domains, features, and techniques, adding up to over half a million lines of code. These components have in turn reached over the years different levels of maturity. While the essential parts, such as the fixpoint algorithms and the classic abstract domains, have been used routinely for a long time now and it is unusual to find bugs, other parts are less developed and yet others are prototypes or even proofs of concept (see Table 1 for an overview of some of the abstract domains that are bundled with the system and their maturity status). We show in Section 4.3 how our proposed method reveals bugs, not only in the less-developed parts of the system but also in corner cases of the more mature components, such as the handling of built-ins, run-time checking instrumentation, etc.
Table 1. Abstract domains

A feature of Ciao that will be instrumental to our approach is the use of a unified assertion language across the components of the Ciao framework (Hermenegildo et al. Reference Hermenegildo, Puebla, Bueno, Apt, Marek, Truszczynski and Warren1999, Reference Hermenegildo, Puebla, Bueno and Lopez-Garcia2003), which together implement a unique blend of static and dynamic assertion checking. These components (and their algorithms) include:
-
1. The static analyzer (Muthukumar and Hermenegildo, Reference Muthukumar and Hermenegildo1992; Hermenegildo et al. Reference Hermenegildo, Puebla, Marriott and Stuckey2000; Garcia-Contreras et al. Reference Garcia-Contreras, Morales and Hermenegildo2020), (top-down analysis framework) which expresses the inferred information as assertions interspersed within the original program.
-
2. The assertion run-time checking framework (Stulova et al. Reference Stulova, Morales and Hermenegildo2015, Reference Stulova, Morales and Hermenegildo2016), which instruments the code to ensure that any assertions remaining after static verification are not violated at run time.
-
3. The (random) test case generation framework (Casso et al. Reference Casso, Morales, Lopez-Garcia, Hermenegildo and Gabbrielli2020), which generates random test cases satisfying the properties present in assertion preconditions.
-
4. The unit-test framework (Mera et al. Reference Mera, Lopez-Garcia and Hermenegildo2009), which executes those test cases.
In this paper, we propose an algorithm that combines these four basic components in a novel way that allows testing the static analyzer almost for free. Intuitively, it consists in checking, over a suite of benchmarks, that the properties inferred statically are satisfied dynamically. The overall testing process, for each benchmark, can be summarized as follows: first, the code is analyzed, obtaining the analysis results expressed as assertions interspersed within the original code. Then, the status of these assertions is switched into run-time checks, that will ensure that violations of those assertions are reported at execution time. Finally, random test cases are generated and executed to exercise those run-time checks.
Given that these assertions (the analyzer output) must cover all possible concrete executions (and assuming the correctness of our checking algorithm implementation), if any assertion violation is reported, assuming that the run-time checks are correct, it means that the assertion was incorrectly inferred by the analyzer, thus revealing an error in the analyzer itself. The error can of course sometimes also be in the run-time checks, but typically run-time checking is simpler than inference. This process is automatable, and, if it is repeated for an extensive and varied enough suite of benchmarks, it can be used to effectively validate (even if not fully verify) the analyzer or to discover new bugs. Furthermore, the implementation, when framed within a tool environment that follows Ciao assertion model, is comparatively simple, at least conceptually.
The idea of checking at run time the properties or assertions inferred by the analysis for different program points is not new. For example, Wu et al. (Reference Wu, Hu, Tang and Yang2013) successfully applied this technique for checking a range of different aliasing analyses. However, these approaches require the development of tailored instrumentation or monitoring, and significant effort in their design and implementation. We argue that the testing approach is made more applicable, general, and scalable by the use of a unified assertion-based framework for static analysis and dynamic debugging, as the Ciao assertions model. As mentioned before, by developing the approach within such a framework, it can be implemented with many of the already existing algorithms and components in the system, in a very simple way. As a result, our initial prototype was quite simple, even if, inevitably, the current working version has of course grown quite a bit in order to add functionality, make it easier to use, include specific instrumentation, collect performance data, etc. Moreover, the components and algorithms of the Ciao system used by our implementation have been extensively validated, providing greater confidence in the correctness of our approach. Consequently, when our method flags a runtime checking error, we can be more certain that it identifies an actual error in the analyzer. If no error exists in the analyzer, such runtime checking errors help us locate and fix bugs in our implementation. These fixes will likely address bugs in other Ciao system components used by our implementation. Importantly, these components can also serve other purposes in the software development process. In conclusion, any runtime checking error flagged by our approach contributes to improving the entire Ciao system.
We also argue that our approach is particularly useful in a mixed production and research setting like that of CiaoPP, in which there is a mature and domain-parametric abstract interpretation framework used routinely, but new, experimental abstract domains and overall improvements are in constant development. Those domains can easily be tested relying only on the existing abstract-interpretation framework, run-time checking framework, and unified assertion language, provided only that the assertion language is extended to include the properties that are inferred by the domains.
The rest of the paper is structured as follows: Section 2 provides the background needed for describing the main ideas and contributions of the paper. In particular, we describe the basic components used in our approach. Section 3 then presents and discusses our proposed “checkification” algorithm for testing static analyzers, with an initial illustrative example, the basic reasoning behind the approach (Section 3.1), the operation of the algorithm (Section 3.2), and discussions of some of additional issues involved (Sections 3.3 and 3.4). In Section 4 we present our experimental evaluation and results. We explain the evaluation setup (Section 4.1) including experiments, analyzer configuration, abstract domains and properties studied, and programs analyzed. We then present and discuss the results of this evaluation (Section 4.2), in terms of the errors found and cost of the technique. Section 4.3 then presents further discussion with examples of the classes of errors detected which also serves to go over some of the practical uses of the approach. We conclude by discussing additional related work in Section 5, and presenting some conclusions and perspectives in Section 6.
2 Basic components
2.1 Assertion Language
Assertions are syntactic objects which allow expressing properties of programs that should hold at certain points of program execution. Assertions are used everywhere in Ciao, from documentation and foreign interface definitions to static analysis and dynamic debugging. Two types of Ciao assertions that are relevant herein are predicate assertions (pred for short) and program-point assertions:Footnote 1 The first ones are declarations that provide partial specifications of a predicate. They have the following syntax:

and express that a) calls to predicate Head that satisfy precondition Calls are admissible and b) that, for such calls, the predicate must satisfy post-condition Success if it succeeds, and global computational properties Comp. If there are several pred assertions, the set of Calls fields define the admissible calls to the predicate. Program-point assertions are reserved literals that appear in the body of clauses and describe properties that hold in the run-time constraint store every time execution reaches that point in the clause at run time. Their syntax is Status(State). Both of these kinds of assertions can have different values in the Status field depending on their origin and intended use. Assertion statuses relevant herein include:
-
• true, which is the status of the assertions that are output from the analysis (and thus must be safe approximations of the concrete semantics);
-
• check, which indicates that the validity of the assertion is unknown and it must be checked, statically or dynamically, and is the default value of Status when not indicated; and,
-
• trust, which indicates that the analyzer should assume this assertion to be correct, even if it cannot be automatically inferred.
Example 2.1 (Some assertions). The following code fragment provides examples of both types of assertions, predicate and program-point; all these assertions have status check:

The first two pred assertions define two different ways in which append/3 is expected to be called. The first one states that append/3 may be called with the two first arguments instantiated to lists and the third a variable, and that, if such a call succeeds, then the third argument should be bound to a list. This first assertion also states that when called like this, the predicate should have only one solution and should not fail (det, a global computational property). The second pred assertion states that append/3 may also be called with the third argument instantiated to a list and the first two variables, and that, if such a call succeeds, then the first and second arguments should be bound to lists, and that in this case the predicate should produce one or more solutions (multi, also a global computational property), but, again, not fail. There is also a program-point assertion in the second clause of append/3 that states that if execution reaches that point in that clause, all of Xs, Ys, and Zs should be bound to lists. For all these assertions the check status indicates that these are desired properties that need to be checked, statically or dynamically, but have not been proven true or false yet.
Assertion fields Calls, Success, Comp and State are conjunctions of properties. Such properties are predicates, typically written in the source language (user-defined or in libraries), and thus runnable, so that they can be used as run-time checks. For our purposes herein, we will consider typically properties that are native to CiaoPP, that is that can be abstracted and inferred by some domain in CiaoPP. This includes a wide range of properties, from types, modes and variable sharing, to determinacy, (non)failure, and resource consumption. We refer the reader to Puebla et al. (Reference Puebla, Bueno, Hermenegildo, Deransart, Hermenegildo and Maluszynski2000); Hermenegildo et al. (Reference Hermenegildo, Puebla, Bueno and Lopez-Garcia2005, Reference Hermenegildo, Bueno, Carro, Lopez-Garcia, Mera, Morales and Puebla2012) and their references for a full description of the Ciao assertion language.
In the Ciao assertion syntax, properties can also be in-lined in the predicate arguments, also referred to as using modes. Such modes are property macros that serve to specify in a compact way several properties referring to a predicate argument. A specific syntax, resembling that of predicate assertions is used to define modes.
Example 2.2 (Modes). For example, if the following modes are defined:Footnote 2

then the assertions in the previous program can be expressed equivalently as follows:


Fig. 1. The Ciao assertion framework (CiaoPP’s verification/testing architecture).
Figure 1 depicts the overall architecture of the Ciao unified assertion framework. Hexagons represent tools, and arrows indicate the communication paths among them. Most of this communication is performed in terms of assertions. The input to the process is the user program, optionally including some assertions. Such assertions always include any assertions available in the libraries for built-ins (the basic operations of the source language), or for predicates exported by such libraries that are used by the code being analyzed (left part of Figure 1).
2.2 Static Program Analysis
Abstract interpretation is a formal framework for static analysis that allows inferring program properties that hold for all possible program executions. Different abstractions, called abstract domains, are used in this process for approximating sets of concrete run-time states. The CiaoPP analyzer is abstract interpretation-based, and its design consists of a common abstract-interpretation framework based on fixpoint computation parameterized by different, pluggable abstract domains (the Static Analysis hexagon in Figure 1). This means that the set of properties that are used in assertions is extensible with new abstract domains defined as plug-ins to support them. Depending on the selected domain or combination of domains for analysis, CiaoPP constructs a program analysis graph, starting from the program entry points. In this graph, nodes represent the different ways in which predicates are called. A predicate can have multiple nodes associated with it if it is called in different ways (calling contexts). For each calling context, properties are inferred that hold if the predicate succeeds (and also global properties). These properties will be emitted also as assertions, which will have status true (represented by the Analysis Info box in Figure 1). Optionally, a new source file is generated for the analyzed program, which is identical to the original but with true program-point assertions interspersed between every two consecutive literals of each clause, and with one or more true predicate assertions added for each predicate. In particular, if there are several different calling contexts for a given predicate, there will typically be a predicate assertion added for each of these contexts (this is also referred to as multivariance in the analysis). Further details of this process can be found in Muthukumar and Hermenegildo (Reference Muthukumar and Hermenegildo1989, Reference Muthukumar and Hermenegildo1990, Reference Muthukumar and Hermenegildo1992). In any case, when discussing the identified bugs later in the paper, we will provide more explanations as needed.
2.3 Run-Time Checking
Static analysis is used for compile-time checking of assertions. However, due to the inherent undecidability of static analysis, sometimes properties cannot be verified statically. In those cases, the remaining unproved (parts of) assertions are written into the output program with check status and then this output program can (optionally) be instrumented with run-time checks to make it run-time safe. These dynamic checks will encode the semantics of the check assertions, ensuring that an error is reported at run time if any of these remaining assertions is violated (the dynamic part of the Ciao assertion model). Note that almost all current abstract interpretation systems assume in their semantics that the run-time checks will always be executed. However, CiaoPP does not make this assumption by default, that is it is configurable as an option, since in some use cases run-time checks may in fact be disabled by the user for deployment.
Checking at run time program state properties, such as traditional types and modes, can be performed relatively easily: as mentioned before, most properties are runnable, and the check/1 wrapper will ensure that the check will succeed or raise an error, without binding any arguments. For example, calling check(list(X)) with X = [], X = [a], or X = [A,B], will succeed, without binding any variables, while calling check(list(X)) with X = a, X = f(a), or X = A, will raise an error.Footnote 3 In practice a quite rich set of properties is checkable, including types, modes, variable sharing, exceptions, determinacy, (non-)failure, choice-points, and more, blending smoothly static and dynamic techniques. On the other hand, checking at run time other global properties such as cost and, specially, termination, is obviously less straightforward. While checking these types of properties could conceptually be done with our proposed algorithm, in this paper we concentrate on the other properties mentioned.
2.4 Unit Tests, Test Case Generation, and Assertion-based Testing
Test inputs can be provided by the user, by means of test assertions (unit tests). The run-time checking mechanism can test these assertions but also any other assertion in any predicate called by the test case, that was not verified in the static checking. The unit-testing framework in principle requires the user to manually write individual test cases for each assertion to be tested. However, the Ciao model also includes mechanisms for generating test cases automatically from the assertion preconditions, using the corresponding property predicates as generators. For example, calling list(X) with X uninstantiated generates lazily, through backtracking, an infinite set of lists, X = [], X = [_], x = [_,_], etc. Stating a type for the list argument will then also generate concrete values for the list elements. This enumeration process can be combined in Ciao with the different supported search rules (breadth-first, iterative deepening, random search, etc.) to produce, for example fair enumerations. This idea has been extended recently (Casso et al. Reference Casso, Morales, Lopez-Garcia, Hermenegildo and Gabbrielli2020) to a full random test case generation framework, which automatically generates, using the same technique, random test cases that satisfy assertion preconditions. We refer to the combination of this test generation mechanism with the run-time checking of the intervening assertions as assertion-based testing. In other words, assertion-based testing involves generating and running relevant test cases that exercise the run-time checks of the assertions in a program to test if those assertions are correct. This technique (present in the Ciao model since its origins) yields similar results to property-based testing (Claessen and Hughes, Reference Claessen and Hughes2000) but in a more integrated way within the overall assertion model and within CiaoPP, rather than as a separate technique. Such automatic generation is currently supported for native properties, regular types, and user-defined properties as long as they are restricted to pure Prolog with arithmetic or mode and sharing constraints. In addition, users can also write their own generators and of course other test generation techniques and tools can be used (Fortz et al. Reference Fortz, Mesnard, Payet, Perrouin, Vanhoof, Vidal, Nakano and Sagonas2020).
3 The checkification algorithm
This section provides a detailed overview of the proposed algorithm for testing the static analyzer, which incorporates all the components mentioned above.
Illustrative example. Let us start by sketching the main idea of our approach with a motivating example. Assume we have the following simple Prolog program, where we use an entry assertion to define the entry point for the analysis. The entry assertion indicates that the predicate is called with its second argument instantiated to a list and the third a free variable (we use the mode definitions of Example 2.2):

Assume that we analyze it with a simple modes abstract domain that assigns to each variable in an abstract substitution one of the following abstract values:
-
• ground (the variable is ground),
-
• var (the variable is free),
-
• nonground (the variable is not ground),
-
• nonvar (the variable is not free),
-
• ngv (the variable is neither ground nor free), or
-
• any (nothing can be said about the variable).
Assume also that the analysis is incorrect because it does not consider sharing (aliasing) between variables, so when updating the abstract substitution after the Rest=Xs literal, the abstract value for Ys is not modified at all.Footnote 4 The result of the analysis will be represented, as explained in the previous section, as a new source file with interspersed assertions, as shown in Figure 2 (lines 3-5, 8, 10 and 12). Note that the correct result, if the analysis considered aliasing, would be that there is no groundness information for Ys at the end of the clause (line 12), since there is none for X or Xs at the beginning either. Ys could only be inferred to be nonvar, but instead is incorrectly inferred to be nonground too (line 10). Normally any/1 properties (i.e., top, or unknown) would not actually be included in the analysis output for conciseness, but are included in Figure 2 for clarity.

Fig. 2. An incorrect simple mode analysis.
The objective of our approach is to check dynamically the validity of these true assertions from the analyzer, that in this case contain an error. The insight is that, thanks to the different capabilities of the Ciao model presented previously, this can be achieved by (1) turning the status of the true assertions produced by the analyzer into check , as shown in Figure 3.Footnote 5 This would normally not make any sense since these true assertions have been proved by the analyzer. But that is exactly what we want to check, that is whether the information inferred is incorrect. To do this, (2) we run the transformed program (Figure 3) again through CiaoPP (Figure 1) but without performing any analysis. In that case, the check literals (stemming from the true literals of the previous run) will not be simplified in the comparator (since there is no abstract information to compare against) and instead will be converted directly to run-time tests. In other words, the check(Goal ) literals will be expanded and compiled to code that, every time that this program point is reached, in every execution, will check dynamically if the property (or properties) within the check literal (i.e., those in Goal) succeed, and an error message will be emitted if they do not. The only missing step to complete the automation of the approach is to (3) run prepend/3 on a set of test cases. These may in general be already available as test assertions in the program or, alternatively, the random test case generator can be used to generate them. for example for prepend/3 the test generation framework will ensure that instances of the goal prepend(X,Xs,Ys) are generated, where Xs is constrained to be a list, and Ys remains a free variable. However, X and the elements of Xs will otherwise be instantiated to random terms. In this example, as soon as a test case is generated where both X and all elements in Xs are ground, the program will report a run-time error in the check in line 12, letting us know that the third program point, and thus the analysis, is incorrect.Footnote 6

Fig. 3. The instrumented program.
The same procedure can be followed to debug different analyses with different benchmarks. If the execution of any test case reports a run-time error for one assertion, it will mean that the assertion was not correct and the analyzer computed an incorrect over-approximation of the semantics of the program. Alternatively, if this experiment, which can be automated easily, is run for an extensive suite of benchmarks without errors, we can gain more confidence that our analysis implementation is correct, even if perhaps imprecise (although of course, we cannot have actual correctness in general by testing).
3.1 Basic reasoning behind the approach
We start by establishing more concretely the basic reasoning behind the approach in terms of abstract interpretation and safe upper and lower approximations. The mathematical notation in this subsection is meant for providing a more precise explanation, rather than deep formalization, which is arguably not really necessary, thanks to the simplicity of the approach that builds on the different parts of the system that act as trusted base.
An abstract interpretation-based static analysis computes an over-approximation
$S_P^+$
of the collecting semantics
$S_P$
of a program
$P$
. Such collecting semantics can be broadly defined as a control flow graph for the program decorated at each node with the set of all possible states that could occur at run time at that program point. Different approximations of this semantics will have smaller or larger sets of possible states at each program point. Let us denote by
$S'_P \subset _P S''_P$
the relation that establishes that an approximation of
$S_P$
,
$S''_P$
, is an over-approximation of another,
$S'_P$
. The analysis will be correct if indeed
$S_P \subset _P S^+_P$
.
Since
$S_P$
is undecidable, this relation cannot be checked in general. However, if we had a good enough under-approximation
$S^-_P$
of
$S_P$
, it can be tested as
$S^-_P \subset _P S^+_P$
. If it does not hold and
$S^-_P \not \subset _P S^+_P$
, then it would imply that
$S_P \not \subset _P S^+_P$
, and thus, the results of the analysis would be incorrect, that is the computed
$S_P^+$
would not actually be an over-approximation of
$S_P$
.
An under-approximation of the collecting semantics of
$P$
is easy to compute: it suffices with running the program with a subset
$I^-$
of the set
$I$
of all possible initial states. We denote the resulting under-approximation
$S^{I^-}_P$
, and note that
$S_P=S^I_P$
, which would be computable if
$I$
is finite and
$P$
always terminates. That is the method that we propose for testing the analysis: selecting a large and varied enough
$I^-$
, computing
$S^{I^-}_P$
and checking that
$S^{I^-}_P \subset _P S^+_P$
.
A direct implementation of this idea is challenging. It would require tailored instrumentation and monitoring to build and deal with a partially constructed collecting semantic under-approximation as a programing structure, which then would need to be compared to the one the analysis handles. However, as we have seen the process can be greatly simplified by reusing some of the components already in the system, following these observations:
-
• We can work with one initial state
$i$ at a time, following this reasoning:
$S^{I^-}_P \subset _P S^+_P \iff \forall i \in I^-, \, S^{\{i\}}_P \subset _P S^+_P$ .
-
• We can use the random test case generation framework for selecting each initial state
$i$ .
-
• Instead of checking
$S^{\{i\}}_P \subset _P S^+_P$ , we can instrument the code with run-time checks to ensure the execution from initial state
$i$ does not contradict the analysis at any point. That is, to make sure that the state of the program at any program point is contained in the over-approximation of the set of possible states that the analysis inferred and output as Ciao assertions.
3.2 Operation of the algorithm
We now show the concrete algorithm for implementing our proposal, that is the driver that combines and inter-operates the different components of the framework to achieve the desired results. The essence of the algorithm (Alg. 1) is the following: non-deterministically choose a program
$P$
and a domain
$\mathcal{D}$
from a collection of benchmarks and domains, and execute the AnaTest(
$P,\mathcal{D}$
) procedure until an error is found or a limit is reached. Unless the testing part is ensured to explore the complete execution space, it could in principle be useful to revisit the same
$(P,\mathcal{D})$
pair more than once.Footnote 7 There is no restriction regarding the number of entry points or inputs to a program to be analyzed for. It is common in tools related to ours to use as benchmark programs with a single entry point with no inputs (for example just a single void main() function as entry point for C). In Ciao program signatures and types are optional. Admissible inputs (i.e., the initial set of possible states for analysis or test case generation) can be specified by writing assertions for the exported predicates or skipped altogether. Note also that if the program
$P$
had the restriction mentioned above (in our case, exporting only a main/0 predicate), then test case generation would not be needed for our algorithm. In the absence of assertions, the test case generation framework has already some mechanisms to generate relevant test cases, instead of random, nonsensical inputs which would exercise few run-time checks before failing. However, these generators have limitations, and the assertion-based testing framework is in fact best used with assertions that have descriptive-enough calling contexts, or with custom user-defined generators in their absence.
Algorithm 1 The “Checkification” Analysis Testing Algorithm

When the algorithm detects a faulty program-point assertion for some
$input$
(Error(
$input,l$
)), it means that the concrete execution reaches a state not captured by the (potentially safely over-approximated) result of analysis. It is important to note that although error diagnosis and debugging are primarily left for the user to manually perform, our tool facilitates the task in some aspects. Firstly, it is possible to reconstruct (or store together with the test output) additional information comparing the concrete execution trace (which is logged during testing) with the analysis graph (recoverable from
$P_{an}$
, the program annotated with analysis results), domain operations (inspecting the analysis graph), and transfer functions (from predicates that are native to each domain). Secondly, the assertion-based testing tool supports shrinking of failed test cases, so we can expect reasonably small variable substitutions in the errors reported. Lastly, as sketched in Algorithm 1, the error location and trace reported by the run-time verification framework provide an approximate idea of the point where the analysis went wrong, even if not necessarily of the original reason why (which requires a different step of diagnosis). If the run-time check error points to a program-point assertion right after a call to an imported predicate, then the analysis erred in applying the entry declaration for the predicate, the entry declaration was wrong, or if there was no entry declaration, the analysis failed to compute the “topmost” abstract state reachable from the call abstract state.
3.3 Some considerations on properties
In order to test an analysis with the algorithm proposed, two conditions must be met. The first one is to have a translation from the internal representation of the abstract values in the domain to Ciao user-level properties. These properties that can represent the information inferred by a given domain or domains are called the native properties of the domain(s). Note that these are already requirements for any abstract domain intended to make full use of the framework, so normally all implemented domains include the definition of the corresponding native properties and the translation from abstract domain values to them.
The other condition is to be able to perform run-time checks for those properties, that is that they can be used by the run-time checking framework. As discussed in Section 2, such run-time checks can range from very simple or even already built into the language (like, e.g., var/1), to intermediate (like, e.g., aliasing or groundness), to more complex and costly (like, e.g., costs or side-effects), to theoretically impossible (like, e.g., termination). But they can also be safely approximated to detect errors or to issue warnings (e.g., in the case of termination by detecting repeated identical calls which lead to non-termination or by timeouts). It is also important to note that complex analyses such as termination are typically dependent on a number of other instrumental analyses which can themselves be checked.
In general, the availability of run-time checks is a standard requirement for domains to be able to make full use of the framework, in order to support the dynamic checks that are generated when properties cannot be proved statically. This functionality is normally implemented when a new abstract domain is added to the system, by also defining the related properties to be used in assertions. If the definition of these native properties is provided directly in the source language, then such properties are typically already runnable and thus available for run-time checking; however, it is also possible to provide an implementation specialized for run-time checking if desired. For properties that are declared native but are not written in the source language, then a run-time test version must be provided. In practice, most current Ciao abstract domains include the mentioned functionalities and can be tested as is with the proposed approach.
3.4 Multivariance and path-sensitivity
As presented, it could appear that our approach could miss some analysis errors even if the right test cases are used, since we have, to all appearances, disregarded multi-variance and path-sensitivity. In fact, in CiaoPP the information inferred is fully multi-variant, and separate path information is kept for each variant (i.e., calling context). However, by default, the analyzer produces an output that is easy for the programmer to inspect, that is close to the source program. This means that when outputting the analysis results, by default the different versions of each predicate (and the associated information) are combined into a single code version and a single combined assertion for each program point and predicate. If this default output is used when implementing our approach, it is indeed entirely possible that the analysis errs at a program point in one path but the algorithm never detects it: this can happen if, for example, in another path leading to the same program point (such that the two paths and their corresponding analysis results are collapsed –lubbed– together at the same program point) the analysis infers a too general value (higher in the domain lattice) at that program point and thus, the error is not detected. This issue is controlled by a flag that, when enabled, ensures that the different versions are not collapsed and are instead materialized into different predicate instances. This way, multiple versions may be generated for a given predicate, if there are separate paths to them with different abstract states, and the corresponding analysis information will be annotated separately for each abstract path through the program in the program text of the different versions, avoiding the problem mentioned above.
4 Evaluation
In this section we report on the different experiments that we have conducted in order to benchmark the checkification approach and assess its practicality.
4.1 Evaluation setup
4.1.1 The experiments
The experiments have consisted in, for a set of benchmarks, analyzing them with different abstract domains, performing the checkification transformation on the analysis results, and testing the resulting programs on sample inputs with run-time checks activated. The objective has been to assess whether we can indeed find errors using the technique and to estimate the cost involved in detecting those errors.
4.1.2 Analyzer configuration
The experiments were run with Ciao/CiaoPP version 1.23. The tested analyses used the standard, default configuration of the abstract interpretation framework, that is the default values of the different flags, such as, for example, using multi-variance on calls, using the original PLAI fixpoint algorithm, etc., but, of course, differ in the abstract domain selected for performing each analysis.
4.1.3 Properties and domains
In the experiments we used a wide range of analyses for different properties that are typically of interest when describing or verifying logic programs (the list of all the abstract domains used is provided in Table 1):Footnote 8
-
• The first class of properties is aimed at capturing variable instantiation state, that is which variables are bound to ground terms, or free, and, if they are not ground, the variable sharing relationships among them. These properties are approximated using aliasing and modes-style abstract domains.
-
• The second set of properties refers to the shapes of the data structures constructed by the program in memory. These properties are tracked by the term structure and types classes of abstract domains.
-
• The third class of properties that we have considered refers to the numerical relations among program variables, which are useful to describe properties of numerical parts of programs. For these, we use in our experiments numerical-style abstract domains.
-
• Finally, we also evaluate the approach on analyses for computational properties, that is properties of whole computation subtrees, in particular determinacy and (non)failure. These analyses sometimes do not provide the information at the program points between literals, but rather at the predicate level. In these cases, the transformation and tests are done at the predicate level.
4.1.4 Programs analyzed
The programs used in our experiments can be divided into two different groups:
The first group, listed in Table 2, comprises a number of well-known, classic benchmarks. Some of them also represent kernels of applications. For example, aiakl is the main part of an analyzer for the AKL language; boyer is the kernel of a theorem prover; and witt is the central part of a conceptual clustering application.
Table 2. Benchmarks

The second group comprises complete systems that are in current use:
-
• A filtering tool (Ferreiro et al. Reference Ferreiro, Morales, Abreu and Hermenegildo2023), used regularly for creating teaching materials. This tool contains 1.1K lines of code in its kernel, and uses also a number of Ciao libraries. It is an interesting example since it includes different built-in predicates for handling files and streams.
-
• Deepfind (Garcia-Contreras et al. Reference Garcia-Contreras, Morales and Hermenegildo2016), a tool that facilitates searching code repositories and libraries by querying for semantic characteristics of the code. Its kernel consists of around 10 files where the Prolog code is about 1.5K lines, and uses in addition a good number of CiaoPP libraries to perform program analysis and assertion comparison.
-
• The classic chat-80 program (Warren and Pereira Reference Warren and Pereira1982), a natural language interface to a geographical database. It comprises 4.8k lines of code across 22 files. While typically used more as a demo than a real application, we have included it in this group because it is of good size, contains a number of system libraries using different Prolog built-ins and library predicates, and is known to stress several abstract interpretation domains.
-
• LPdoc, a documenter for LP systems used by Ciao and XSB (Hermenegildo, Reference Hermenegildo2000; Hermenegildo and CLIP Group, Reference Hermenegildo1997). This is the largest example with its kernel Prolog code being analyzed comprising about 22K lines, plus the use of many Ciao libraries.
-
• Spectector (Guarnieri et al. Reference Guarnieri, Köpf, Morales, Reineke and Sánchez2020), a tool for automatically detecting leaks introduced by speculatively executed instructions in x64 assembly programs. It consists of 15 modules with around 1.6K lines of code.
-
• The s(CASP) (Arias et al. Reference Arias, Carro, Salazar, Marple and Gupta2018) system is a top-down interpreter for ASP programs with constraints. The Prolog code is distributed into 44 modules.
No special analysis-related criteria were used in benchmark selection, and the code was analyzed and run as is, without modifications. The classic benchmarks, by default, include annotations with program assertions that describe the expected behavior, while no additional information is provided in the real-world programs.
4.2 Results
4.2.1 Cost of the technique
While run-time overhead is not our primary focus, we have evaluated this aspect in order to study whether the algorithm has an acceptable cost. This cost obviously has two components: the analysis time and the testing time (the transformation time is negligible). Regarding the testing time, note that the execution time of run-time tests can be reduced significantly through caching techniques (Koukoutos and Kuncak, Reference Koukoutos, Kuncak, Bonakdarpour and Smolka2014; Stulova et al. Reference Stulova, Morales and Hermenegildo2015). However, we decided not to use these optimizations for a number of reasons: to simplify the implementation; to avoid dependence on the implementation of other parts; and to avoid any bugs that optimizations could potentially hide, making it harder to identify them. The results are presented in Tables 3, 4, 5 and 6 for the different benchmarks and domains. The experiments were run on a MacBook Air with the Apple M1 chip and 16 GB of RAM. Each column in these tables corresponds to an abstract domain and in turn contains two sub-columns. The first sub-column is the absolute execution time in seconds for each benchmark once checkification has been applied, for the properties inferred by the corresponding abstract domain. This time includes both the transformation process, in which the annotations of the analysis are modified by replacing the status of true assertions by check status and inserting run-time checks, as well as the testing time of the instrumented program. The numbers in parentheses in the second sub-column provide the analysis times in seconds for the benchmarks, again for the different abstract domains.
Table 3. Timings and errors detected (1)

Classes of bugs found are marked with: = abstract domain implementation,
$\ldots$
= fixpoint algorithms,
= semantic inconsistencies,
$_{\sim }$
= run-time check instrumentation,
$_{{=\!=}}$
= third-party libraries.
Table 4. Timings and errors detected (2)

Classes of bugs found are marked with: = abstract domain implementation,
$\ldots$
= fixpoint algorithms,
= semantic inconsistencies,
$_{\sim }$
= run-time check instrumentation,
$_{{=\!=}}$
= third-party libraries.
Table 5. Timings and errors detected (3)

Classes of bugs found are marked with: = abstract domain implementation,
$\ldots$
= fixpoint algorithms,
= semantic inconsistencies,
$_{\sim }$
= run-time check instrumentation,
$_{{=\!=}}$
= third-party libraries.
Table 6. Timings and errors detected (4)

Classes of bugs found are marked with: = abstract domain implementation,
$\ldots$
= fixpoint algorithms,
= semantic inconsistencies,
$_{\sim }$
= run-time check instrumentation,
$_{{=\!=}}$
= third-party libraries.
Regarding the testing times, when an error is found during a test run, the time to do so is typically negligible. In these cases, we report instead the testing time after fixing the analysis so that no bugs are detected and the testing runs to completion. This is obviously also the case when no bug is detected to begin with. Thus, the testing times reported are always for complete runs, which we feel are more useful for estimating the testing cost.
If the time is labeled as n/a, it indicates that there is no time recorded due to the presence of a timeout or an unresolved bug, or that no analysis results were available. The latter can be caused by a crash during analysis or by the analysis output being malformed, for example missing assertions.
The results show that execution time of tests is quite reasonable, typically taking no more than around 60 s. Performance can be improved if needed by activating only the run-time semantics of the predicate assertions and/or disabling multi-variance.
4.2.2 Errors found
We now turn our attention to the most important point of whether the technique can indeed find errors in the analyses. We have manually analyzed the root causes of the errors found and classified them into different categories. The classification is indicated in Tables 3 to 6 by surrounding the numbers with different patterns. These bug categories include (I) defects in the implementation of the abstract domain, (II) defects in the implementation of the fixpoint algorithms, (III) semantic inconsistencies between components of the framework, (IV) run-time check instrumentation issues, and (V) defects related to third-party libraries. Table 7 provides a summary of identified bugs. The “Status” column indicates whether the bug is new or it was already a known issue at the time of running the experiments. The “Class” column lists the bug category, while the “Description” column provides a more detailed description of each bug. We will discuss stylized examples from each bug category in Section 4.3. In each column, bugs are additionally marked with symbols (e.g.,
$\dagger$
) to group those that correspond to the same issue.
Table 7. Details of defects found. The root causes of bugs (Class) again include: (I) abstract domain implementation; (II) fixpoint algorithms; (III) semantic inconsistencies; (IV) run-time check instrumentation; (V) third-party libraries

The results of the experiments conducted so far are promising, allowing us to draw several significant conclusions and observations.
First and foremost, a good number of bugs and inconsistencies were found using the technique. These bugs were quite diverse, illustrating the power of the algorithm in finding all sorts of issues of different nature.
No bugs were found for the most mature domains. On the other hand, bugs were indeed found in the more experimental and prototype domains, that is domains which were only partially developed and/or they or their run-time tests supported only a subset of the language at the time of the experiments (e.g., the polyhedra abstract domain). This included for example no support being available for certain built-in operations, or no run-time behavior being defined for some properties used by such domains. All this has greatly helped complete and strengthen these less mature domains, since most of the bugs found have now been fixed. One thing to take into account when reading the results in the tables is that sometimes many reported errors are due to just one bug. For example, almost all errors reported with the shfr+clique+def domain refer to the same bug found that affected the analysis of several benchmarks (see the markings in the tables mentioned before for other examples). Also, it is important to note that each program detects a maximum of one bug at a time, since the process halts when it detects a run-time error.
Another conclusion from the experiments is that benchmark selection is very important when testing specific domains, since each example uses different built-ins and library predicates, exhibits different properties, etc. Nevertheless, while complex examples such as witt have resulted most capable at identifying a wide range of bugs across different domains, a drawback is that for very complex benchmarks the analysis using the least mature domains is more prone to fail, and then sometimes no bug is detected.
As mentioned before, in addition to standard benchmarks, we have also applied our tool to real-life applications. In this case, rather than generating test cases, the experiment consists in compiling the application using all the source files as transformed by the algorithm and using the application as usual, for example for the LPdoc documenter, generating with it full manuals.
An interesting observation is that the benchmark program suite showed similar effectiveness to the real-world applications in exposing errors, that is most bugs identified in real applications had already been detected by the set of smaller benchmarks.
Also, the presence of duplicate bugs is indicative that the algorithm can identify problems consistently.
Finally, the fact that some bugs have already been fixed suggests that identifying the source of the detected errors in the implementation is relatively straightforward.
4.3 Further discussion of the bugs detected
This section illustrates further some of the defects found by the proposed technique during our experiments reported in Section 4.1. The programs presented are representative of the actual code fragments that triggered the detection of the bugs; however, we generally show simplified and distilled versions for brevity and clarity, since the actual code would need significant context to be understood. We divide the discussion according to the different classes of bugs (I to V) from Section 4.1. The bug numbers refer to Table 7.
4.3.1 Abstract domain implementation (Class I)
As mentioned before, our testing technique can be seen as a sanity or coherence check, and thus it can be targeted to test different components of the system depending on which ones are assumed to be trusted. In general, the Ciao abstract interpretation engine (the fixpoint algorithms and all the surrounding infrastructure of the system, into which the domains are “plugged-in”) includes the components of the analyzer we trust most since they have been used and refined for a long time. Thus, it makes sense to start by taking this as the trusted base and aiming the error-finding task at the different abstract domains. This makes sense specially since CiaoPP is at the same time a production and a research tool, and new domains are constantly being developed.
Example 4.1 (Bug #4: Missing assertion in library predicate description). Abstract descriptions of the behavior of the built-ins and library predicates are provided for each abstract domain, either in the file(s) defining the domain, in the libraries themselves, or in both. A first use of our algorithm is in order to find errors in these specifications and in the implementations of these built-ins and library predicates for a given abstract domain. In particular, if during testing a run-time error is found in a program-point assertion right after a call to a built-in, this can be due to an error in either the abstract description of that builtin or in its actual implementation. For instance, the code of the length/2 library predicate includes the following (simplified) assertions:

where int/1 is a primitive property (in this case a primitive type) and list/1 is defined in a library by the standard list predicate, and also declared to be a property (in particular, a regular type) in the usual way:

During the tests, in a call to length/2, illustrated by the following code:

that is a call with the second argument uninstantiated, the following analysis output was obtained:

that is the abstract interpreter (wrongly) inferred failure or error in the call to length/2 and thus that the point after that would be unreachable. The checkification algorithm translated this output into:

During run-time testing the check(unreachable) literal was actually reached and executed, which threw the corresponding error.
This led to detecting that the last declaration in the abstract description for length/2 (i.e., “:- pred length(?,?) + nondet.” in the set of assertions for length/2 above), had been deleted by mistake while refactoring some analyzer code. That assertion is the only one that allows calling length/2 with a variable in the second argument (in fact, in both arguments).
Example 4.2 (Bug #10: Combined abstract domain outputs unknown variables). While experimenting with the CLIQUE-Sharing+Freeness+Def combined domain an error was detected which can be illustrated with the following code:

for which CiaoPP inferred the following information:

where _A and _B were superfluous variables that have the first occurrence in the true(ground(…)) literals, and thus cannot be ground. A run-time error was thus produced by the checkified program in the check(ground([X,_A])) call, which made us realize that the CLIQUE-Sharing+Freeness+Def abstract domain was not implementing correctly the combination of analyses. In many combined domains, the combination typically reuses the component analyses. For example, all abstract functions for CLIQUE-Sharing+Freeness+Def initially compute results for the sharefree_clique and def functions. Then, they compose the information from the different domains, eliminating redundancies over the information inferred by each analysis Codish et al. (Reference Codish, Mulkers, Bruynooghe, García de la Banda and Hermenegildo1995). This is performed by the reduce functions, which essentially compute the reduced product. The problem detected stemmed from an accidental reversal of the sequence of operations in the implementation. At first, the abstract function of def was executed. However, the reduce function was applied before running the abstract function of sharefree_clique. The problem with this is that in some abstract functions a renaming of variables is performed. This renaming substitution replaces each variable in the term it is applied to with distinct fresh variables. If the reduce function is applied before the renaming substitution, it incorrectly treats some variables as distinct when, in fact, they are identical before renaming.
4.3.2 Fixpoint algorithms (Class II)
Another possible application of the approach is for testing the abstract interpretation engine (the fixpoint algorithms and all the surrounding infrastructure of the framework) instead of the domains. This can be done by using domains that are simple or developed enough to be used as a trusted base. While the classic fixpoint algorithms are quite stable, new fixpoint algorithms or modifications of existing fixpoint algorithms are sometimes added to the system. Some recent examples include a new modular and incremental fixpoint (Garcia-Contreras et al. Reference Garcia-Contreras, Morales and Hermenegildo2021) and new program transformations that speed-up set sharing-based analyses by reducing the number of variables in abstractions (Jurjo et al. Reference Jurjo, Morales, Lopez-Garcia and Hermenegildo2024). Clearly, these new contributions may introduce new bugs into the system.
A first abstract domain that is useful for this type of checks is the concrete domain itself. To this end, we give the analysis a singleton set of initial states as entry point, that is a concrete value, and the analyzer then behaves as a (tabling) interpreter for the program, starting from the entry point as initial concrete state. This test will then detect if the analyzer incorrectly marks reachable parts of the program as unreachable.
Example 4.3 (Bug #21: Code refactoring). A code refactoring in the implementation of Jurjo et al. (Reference Jurjo, Morales, Lopez-Garcia and Hermenegildo2024) introduced a bug in the handling of built-ins and cuts. While no issues were identified in the initial version of the code, the refactored version (where most changes involved renaming operations and variables) revealed a problem when reapplying the algorithm.Footnote 9 Specifically, the analyzer incorrectly inferred that some predicates containing built-ins and cuts were dead code, although they were indeed reachable in the concrete domain.
4.3.3 Semantic inconsistencies between components of the framework (Class III)
Even if every part of the system is validated separately, our tool can still help find inconsistencies among these parts. Most components in our system interact via assertions and thus the semantics of the assertions and of the properties used in the assertions need to be consistent across all parts. An interesting case that can occur is when there is a mismatch between the way properties are understood by the analyzer and the actual definition of the property that is used in the run-time checks. Checkification helps detect such inconsistencies.
Example 4.4 (Bug #7: Wrong linear/1 run-time semantics). When the analyzer outputs linear(X), the semantics is that X is the list of all program variables that analysis can guarantee to be bound to linear terms at the program point, that is that the terms that each variable in X is bound to do not contain any repeated variables. The property was inferred correctly by CiaoPP. In particular, for a substitution such as {X/f(A,B), Y/g(A,C)}, CiaoPP was inferring correctly true(linear([X,Y])). However, after the checkification conversion to check(linear([X,Y])), a run-time error was being thrown. This allowed us to notice that in the implementation of the property as a run-time check the wrong predicate had been used and what was checked instead of linearity was that the terms in the list did not share any variable, which then failed in this case since they share variable A.
Example 4.5 (Bug #13: Different representation of parametric types). This problem was found when analyzing with type domains. Assume that the analyzer has inferred, using, for example the eterms domain, that after success of a call to p(X), the argument X is bound to a list of a’s. This is expressed in the output as:

This posed no problems with the run-time checking. However, since this style of output can sometimes be verbose, for readability the user can optionally switch the output so that simple cases are expressed using quoting (^/1) as follows:

At some point the output had been switched to this format by default and then we discovered that when executing the check(list(^(a), Xs)) the run-time check instrumentation did not implement this abridged syntax correctly. This was another inconsistency between a possible representation(s) of the abstraction and what was understood by the run-time checks.
4.3.4 Inconsistencies between properties and their specialized run-time checks (Class IV)
This is a special case of the previous class, related to the fact that it is possible to write an alternative version of a property to be used specifically in run-time checks, while also keeping the general definition which may perhaps be easier to read or to be understood by the static analyzer.
Example 4.6 (Bug #9 and #18: Undefined run-time behavior). In a few instances test case failures stemmed from the specialized run-time behavior of some properties simply being declared but undefined. For example, testing flagged that the implementations for run-time checking of the comparatively less-used clique/1 and instance/2 properties were missing.
Example 4.7 (Bug #8: Incompatibility between run-time behaviors). The approach also detected more subtle issues that only arise when several properties appear together. While addressing the issue of the previous example, another problem was detected when the clique/1 and mshare/2 properties appeared in the same assertion. Set sharing domains approximate all possible variable sharing (aliasing) that occurs at a given program point. The property typically used to denote such sharing among variables is mshare/1, where the argument contains a set of sets of variables. For instance, let V = {X, Y, Z, W} be the set of variables of interest, normally the variables of the clause. Consider the abstraction
$\lambda =$
{{X}, {X,Y}, {X,Y,Z}, {X,Z}, {Y}, {Y,Z}, {Z}, {W}}. Here a set {X,Y} represents that, in the terms that X and Y are bound to at run time, there may be variables that appear in both X and Y; {W} that there may be a variable that appears only in W; etc. This
$\lambda$
will be expressed in the analyzer output as: true(mshare([[X], [X,Y], [X,Y,Z], [X,Z], [Y], [Y,Z], [Z], [W]])). We can see that for variables {X, Y, Z} no information (i.e., top) has been inferred. Indeed for this set of variables any aliasing may be possible since there may be run-time variables shared by any pair of the three program variables, by the three of them, or not shared at all. The idea of the clique property is to use a more compact representation for abstractions that contain a powerset, including as a widening when abstractions are too large. This has been shown to pay off in practice (Navas et al. Reference Navas, Bueno and Hermenegildo2006). In our example, the clique that will convey the same information with respect to {X, Y, Z} as the sharing set
$S = \wp ({\{X, Y, Z\}})$
is simply clique([X,Y,Z]). The elements of
$S$
are then eliminated from the full sharing set
$\lambda$
, since the clique makes them redundant. However, when checkifying the output, it was detected that the specialized mshare/1 run-time checking implementation was still considering all the variables in the clause ({X, Y, Z, W} in our case). Therefore, X, Y, and Z, were incorrectly being tested for groundness since they are in no sharing set, and thus are interpreted as being ground. Clearly, this is not what the analysis inferred. The mshare/1 run-time test was modified to stop considering such variables.
Example 4.8 (Bug #3: mshare/1 sensitivity to variable ordering). This case involved a subtle bug in the specialized run-time check for the sharing abstract domain. As mentioned before, mshare([X,Y]) means that X may share variables with Y at run time. Sharing is a symmetric property which does not depend on variable ordering, that is mshare([X,Y]) has the same meaning as mshare([Y,X]). Some cases were detected in which a run-time checking error was flagged even though the mshare/1 property inferred was correct. The problem was found in an update of the specialized run-time definition of mshare/1, which introduced an optimization which made the result sensitive to the ordering of variables.
Example 4.9 (Bug #16: Incorrect definition of constraint/1). The constraint/1 property is used by some numerical (e.g., polyhedra-related) domains. Its argument is a list of linear (in)equalities that relate variables and integer values. However, the specialized run-time check of this property was incorrect: it checked that the constraints were valid but also whether the arguments were variables; instead it should have been checking that they were numerical values.
4.3.5 Integration testing of the analyzer with libraries and third-party tools (Class V)
We have also used the approach to conduct integration tests of the Regular Types Library (an independent Ciao bundle) and check its correct use within the system. This library implements fundamental operations and procedures for regular types, such as type inclusion, equivalence, union, intersection, widening, simplification, etc., as well as storing and manipulating regular types. These operations are used in CiaoPP domains, analyses, and program transformations.
Example 4.10 (Bug #14: Regular types library type equivalence/simplification bug). The eterms domain includes the inferred regular types in the analysis output. In this process, eterms performs type equivalence and simplification operations in order to present the results to users in the most readable form possible. The following (simplified) fragment is from one of the test programs:

Here repeat_elem_list/1 is (incorrectly!) declared as a regular type. Checkification detected that, in the analysis output, eterms erroneously inferred that diff_elem_list/1 produces a repeat_elem_list/1 upon success:

This inference is obviously incorrect. This allowed us to notice that after some changes the Regular Types Library was failing to check if properties declared to be regular types were actually regular types, which consequently affected the inference of types.
In addition to detecting problems due to the integration with libraries, the checkification approach also detected incorrectness due to the integration with different external or third party solvers which can be used by the analyzer. For example, the polyhedra domain uses the Parma Polyhedra Library (PPL) as back-end solver for the handling of numeric approximations. Using this domain we can detect errors stemming from the Ciao-PPL integration.
Example 4.11 (Bug #15: Parma Polyhedra Library). We found that polyhedra did not properly handle some non-numerical parts of programs. A term like X = Y is translated into the constraint
$1\times \texttt {X}-1\times \texttt {Y}=0$
when analyzing with the abstract domain. If X and Y are numerical variables, they satisfy the equality dynamically. However, we discovered that the analysis did not take into account the types of X and Y, that is regardless of whether X and Y were numerical or non-numerical, the analysis treated them as numerical values, not considering their actual types.
4.3.6 Debugging trust assertions and custom transfer functions
Finally, the approach has also been useful in finding errors in other aspects of the framework. As an example, a feature of CiaoPP is that its analyses can be guided by the user by providing the analyzer with information that can be assumed to be true at points where otherwise the analysis would lose precision. We have already mentioned one of these mechanisms, entry assertions, which allow providing information on the entry points to the module being analyzed (i.e., on the calls to the predicates exported by the module). Entry assertions are a special case of trust assertions. In addition to guiding the analyzer, trust assertions are used to define custom abstract transfer functions, similar to those that need to be implemented for abstracting each built-in within each domain. trust assertions allow the user to do this for any predicate. Checkification can be used to detect errors in these assertions.
Example 4.12 (Defining language semantics). Let us consider this (again, distilled) part of the specification of the exponentiation predicate, where we use the trust assertion in line 1 to state that an integer to the power of an integer yields an integer:

However, in the ISO-Prolog standard, the outcome of this operation is always a floating-point number. When analyzing this code, CiaoPP trusts the assertion previously defined and infers that the result of the operation in the exp_list/3 predicate is an integer. In the checkified version:

when we execute a test case with two integers an error is flagged due to this discrepancy, since the implementation correctly produces a float. This is a practical application of the proposed algorithm since even a completely sound analyzer can produce unsound results if it assumes some assertion to be true when it is not, and thus there will always be the need to test such properties.
In addition to being able to detect errors in trust assertions, the approach can also be used to find errors in the mechanisms used internally by the analyzer to apply trust assertions, if we assume instead the information provided in the trust assertion to be correct.
5 Other related work
In this section we mention other related work in addition to the references interspersed throughout the previous sections. The fact that the reliability of program analyzers has become crucial as they have become increasingly practical and widely adopted in recent years, is now widely recognized (Cadar and Donaldson, Reference Cadar and Donaldson2016). This has led to significant research interest recently.
On the formal verification side, there have been some pen-and-paper proofs, such as that of the Astrée analyzer (Cousot et al. Reference Cousot, Cousot, Feret, Mauborgne, Miné, Monniaux and Rival2005), some automatic and interactive proofs, such as Dubois (Reference Dubois, Aagaard and Harrison2000); Shao et al. (Reference Shao, Saha, Trifonov and Papaspyrou2002). As mentioned in the introduction, a recent relevant effort has been aimed at verifying the partial correctness of the CiaoPP analysis algorithm (also referred to as the “top-down solver”), using the Isabelle prover (Paulson, Reference Paulson and Odifreddi1990; Stade et al. Reference Stade, Tilscher and Seidl2024), but mechanical verification of the actual implementations remains a challenge. An approximation to this problem is Blazy et al. (Reference Blazy, Laporte, Maroneze, Pichardie, Logozzo and Fähndrich2013) and Jourdan et al. (Reference Jourdan, Laporte, Blazy, Leroy and Pichardie2015) who have developed and proved the soundness of a static analyzer, extracting several abstract domains and a fixed-point verifier directly from Coq formalizations.
In the context of testing static analyzers significant work has been done. Midtgaard and Møller (Reference Midtgaard and MØller2017) and Bugariu et al. (Reference Bugariu, Wüstholz, Christakis and Müller2018) use mathematical properties of abstract domains as test oracles. Specifically, Bugariu et al. (Reference Bugariu, Wüstholz, Christakis and Müller2018) employ such properties to validate the soundness and precision of numerical abstract domains, while Midtgaard and Møller (Reference Midtgaard and MØller2017) apply that approach to checking type analysis using QuickCheck (Claessen and Hughes Reference Claessen and Hughes2000).
The closest works to ours are those that cross-check dynamically observed and statically inferred properties (Cuoq et al. Reference Cuoq, Monate, Pacalet, Prevosto, Regehr, Yakobowski, Yang, Goodloe and Person2012; Wu et al. Reference Wu, Hu, Tang and Yang2013; Andreasen et al. Reference Andreasen, Møller and Nielsen2017; Zhang et al. Reference Zhang, Su, Yan, Zhang, Pu and Su2019). In Wu et al. (Reference Wu, Hu, Tang and Yang2013) the actual pointer aliasing in concrete executions is cross-checked with the pointer aliasing inferred by an aliasing analyzer. They are also able to deal with multi-variance and path-sensitivity. Compared to us, they require significant tailored instrumentation which cannot be reused for testing other analyses. However, their approach is agnostic to the (C) aliasing analyzer. Another cross-check is done in Zhang et al. (Reference Zhang, Su, Yan, Zhang, Pu and Su2019) for C model checkers and the reachability property, but they obtain the assertions dynamically, and check them statically, complementarily to our approach. Unlike us, they again need tailored instrumentation that cannot be reused to test other analyses, and their benchmarks must be deterministic and with no input, the latter limiting the power of the approach as a testing tool. However, their approach is agnostic to the (C) model checker. In Cuoq et al. (Reference Cuoq, Monate, Pacalet, Prevosto, Regehr, Yakobowski, Yang, Goodloe and Person2012) a wide range of static analysis tests are performed over randomly generated programs as input. Among others, they check dynamically, at the end of the program, one assertion inferred statically, and they perform the sanity check to ensure that the analyzer behaves as an interpreter when run from a singleton set of initial states. Our approach differs from these previous works by identifying issues throughout the entire static analyzer framework, rather than focusing on specific components. In this setting, Klinger et al. (Reference Klinger, Christakis and Wüstholz2019) propose an automatic technique to evaluate soundness of program analyzers based on differential testing. From seed programs, they generate program analysis benchmarks and compare the overall frameworks of software model checkers. Differential analysis presents a significant challenge, requiring multiple analyzers with identical input/output behavior to be applied. A potential problem arises if one or more analyzers behave differently, making it difficult to determine which is correct. He et al. (Reference He, Di, Ming, Zhang, Su, Li and Sui2024) apply two oracles to compare static analyzers. The first oracle is constructed using dynamic program executions. A random program generated by Csmith Lidbury et al. (Reference Lidbury, Lascu, Chong and Donaldson2015), a generation-based C fuzzer, serves as test input, and its dynamic execution result (e.g., a run-time error on some program path) is used as a test oracle to validate the static analyzer. The second oracle is a static oracle based on metamorphic relations. The approach involves selecting a conditional expression of some statement in the program as a target and generating an equivalent boolean expression. The static analyzer should produce identical truth values for these equivalent expressions, any discrepancy indicates a potential defect in the analyzer.
There is work on testing program analyzers in other domains. For instance, Brummayer and Biere (Reference Brummayer and Biere2009) introduce a grammar-based fuzzer to identify crashes in SMT solvers, using delta debugging to minimize generated instances. Additionally, Kapus and Cadar (Reference Kapus and Cadar2017) combine random program generation with differential testing to evaluate symbolic execution engines. Daniel et al. (Reference Daniel, DIG, Garcia and Marinov2007) implement Java programs as test oracles for abstract syntax trees for testing refactoring engines. There is also considerable work in testing compilers. Le et al. (2014); Le et al. (Reference Le, Sun and Su2015); Sun et al. (Reference Sun, Le and Su2016) apply Equivalence Modulo Inputs testing, which involves mutating unexecuted statements of an existing program under certain inputs to produce new equivalent test programs. Regehr et al. (Reference Regehr, Chen, Cuoq, Eide, Ellison and Yang2012) propose a test-case reduction technique to construct minimal inputs that trigger compiler bugs. Finally, Building on Csmith Yang et al. (Reference Yang, Chen, Eide and Regehr2011), Lidbury et al. (Reference Lidbury, Lascu, Chong and Donaldson2015) introduce CLsmith, which designs six modes to generate test programs.
We argue that, in comparison with these related proposals, our approach provides a solution that is simpler, elegant, more general, and easier to implement, specially when the different system components are integrated as in the Ciao assertion model.
6 Conclusions
We have proposed and studied checkification, an automatic method for testing static analysis tools by checking that the properties inferred statically are satisfied dynamically. A fundamental strength of our approach lies in its simplicity, which stems from framing it within the Ciao assertion model and using its assertion language. We have shown how checkification can be implemented with comparatively little effort by combining the static analyzer, the run-time checking framework, the random test case generator, and the unit-test framework, together with a reduced amount of glue code. This code implements the proposed algorithm, and pilots the combination and interplay of the intervening components to effectively implement the overall approach.
Following this approach, we have constructed a quite complete implementation of the checkification method and applied it to testing a large number of the abstract interpretation-based analyses in CiaoPP, representing different levels of code maturity, as well as to the framework itself, the interaction with libraries, the interaction with third-party code, etc. In the study we analyzed both standard benchmarks and real-world tools. We applied our technique not only to state properties such as variable sharing/aliasing, modes, linearity, numerical properties, types, and term structure, but also to computational properties such as determinacy and (non)failure. As we discussed, the technique should also be applicable to other, more complex, computational properties such as cost and even termination, although they obviously bring in specific challenges and theoretical limits for run-time checking.
The experimental results show that our tool can effectively discover and locate interesting, non-trivial and previously undetected bugs, with reasonable overhead, not only in the less-developed parts of the system but also in corner cases of the more mature components, such as the handling of built-ins, run-time checking instrumentation, etc. The approach has also proven useful for detecting issues in auxiliary stages of analysis and verification, including assertion simplification, pretty printing, abstract program optimizations and transformations, etc. We have also observed that it is generally not too hard to locate the source of the errors from the information produced by the tests, and as a result the vast majority of the detected issues were in fact fixed during the experimental study. While testing approaches are obviously ultimately insufficient for proving the correctness of analyzers, and thus it is clearly worthwhile to also pursue the avenue of code verification, we believe that our results show that the checkification approach can be a practical and effective technique for detecting errors in large and complex analysis tools.
Acknowledgements
Partially funded by MICINN projects PID2019-108528RB-C21 ProCode, TED2021-132464B-I00 PRODIGY, and FJC2021-047102-I, and by the Tezos foundation. We would also like to thank the anonymous reviewers for their very useful feedback. This paper is an extended version of our previous work published in LOPSTR’20.