Synthesizing Memory Models from Framework Sketches and Litmus Tests

James Bornholt and Emina Torlak
PLDI 2017

This document contains the artifact instructions for our paper, Synthesizing Memory Models from Framework Sketches and Litmus Tests, which appeared at PLDI 2017.

Artifact

Artifact overview

We have provided a VirtualBox image containing the implementation of MemSynth, which is based on Racket and Rosette. The image contains implementations for the paper’s experimental evaluation (Section 6).

After installing VirtualBox and the VirtualBox extension pack, import our VirtualBox image (File > Import Appliance).

When booted, the virtual machine will log in automatically, open a terminal to the ~/memsynth directory containing MemSynth, and open a copy of this instruction page. Should you need credentials, the virtual machine user is memsynth with password memsynth.

Directory layout

The ~/memsynth directory contains the source code for MemSynth and the case studies described below. It is laid out as follows:

Running via Docker

Some of our experiments (notably uniqueness experiments in the synthesis section) take several hours, and so may not be convenient to run locally in a virtual machine. We provide smaller versions of these experiments where possible. But we also offer a Docker image, identical to the virtual machine, that can be used to run these experiments on bare metal hardware or remotely (e.g., on EC2).

After installing Docker on a system in the usual way, pull our image from the Docker hub:

$ sudo docker pull memsynth/memsynth

and then run it:

$ sudo docker run -ti --name memsynth memsynth/memsynth bash

This command will drop you into a bash shell in the Docker container. The container’s file system is identical to the virtual machine’s (i.e., the code is based in ~/memsynth), so the below instructions will work on either platform. You might be interested in Docker’s attach support to detach and re-attach to the container without stopping its work.

Case studies

The artifact addresses all three research questions in the paper’s main evaluation (Section 6):

Model synthesis

Section 6.1 of the paper demonstrates that MemSynth scales to real-world memory models by using it to synthesize specifications for PowerPC and x86. The results also show that MemSynth can identify ambiguities and redundancies in litmus tests and documentation for these architectures.

PowerPC

Section 6.1.1 applies MemSynth to the PowerPC architecture. We show that we can synthesize a model for PowerPC in 16 seconds, and automatically identify several ambiguities in that model.

To begin, enter the directory for this case study:

$ cd ~/memsynth/case-studies/synthesis/ppc
$ ls
compiled  ppc0.rkt  ppc0-unique.rkt  ppc1.rkt  sketch.rkt

The sketch.rkt file contains the sketch we use to synthesize PowerPC models. Most notably, the sketch includes the following definition for ab (note that ab is called fences in the paper for readability):

(define ab (+ (^ ab-sync) (^ ab-lwsync)))

As the paper notes, this definition captures the insight that Fence and LWFence (i.e., sync and lwsync) do not interact.


Claim: MemSynth synthesizes a model, which we call PPC0, that agrees with the hand-written model by Alglave et al. on all 768 tests. The synthesis takes 16 seconds, and … uses only 18 of the 768 tests

To reproduce:

$ racket ppc0.rkt

This program will first print some information about the litmus tests and size of the search space, and then print a message of the form

Synthesis complete!
time: 20177 ms
tests used: 18/768

followed by the synthesized model. This result says synthesis completed in 20 seconds, and only 18 of the 768 litmus tests were actually used by the incremental synthesis algorithm.

It also verifies that the model agrees with Alglave et al’s model on all 768 tests:

Verifying solution...
Verified 768 litmus tests

Claim: We apply MemSynth’s uniqueness query to enlarge the set [of 768 tests] until it identifies a single model … MemSynth finds 10 new tests to add to the set

To reproduce (takes ~6 hours; see below for a smaller version):

$ racket ppc0-unique.rkt

This program first synthesizes PPC0 again, then outputs some information about the symbolic litmus test being used for uniqueness. It will then begin identifying new tests, which it outputs in the form:

New distinguishing test [#1]:
Test disambig0
=============================
P0            | P1           
-----------------------------
r1 <- [B]     | r2 <- [A]    
sync          | r3 <- r2 ^ r2
[A] <- 1      | [B+r3] <- 1  
=============================
r1=1 /\ r2=1
Allowed by oracle? #t

This output says a new test was found with two threads, and that the test is allowed by the oracle (Alglave et al) model. The unusual r3 <- r2 ^ r2 instruction (an XOR operation that always evaluates to 0) introduces a data dependency between the store to B and the read from A.

Due to changes since paper submission, and solver non-determinism, this experiment finds 9 tests rather than the 10 stated in the paper.

Note that the program does not produce any output between discovering distinguishing tests (and there can be over an hour between each such test). Pass the -v argument to the program to see more frequent verbose output, tracking the process of the metasketch solving (as described in the “Concretization” paragraph of Section 5.3).

Smaller version: run

$ racket ppc0-unique.rkt -s

to execute a version of the experiment with smaller bounds on the symbolic litmus test. This version will complete in 10 minutes, but finds only 2 distinguishing litmus tests rather than the 9 found by the full experiment.


Claim: MemSynth is able to use the results of [hardware] experiments as an oracle, and synthesizes a new model PPC1 in 22 seconds; the resulting model is not equivalent to PPC0.

To reproduce:

$ racket ppc1.rkt

This program will first synthesize PPC0 (with the same output as for ppc0.rkt above), and then synthesize PPC1. The synthesis for PPC1 uses fewer tests than for PPC0, because some litmus tests are inconclusive when run on the original hardware. The synthesis for PPC1 again prints a message of the form:

Synthesis complete!
time: 16383 ms
tests used: 15/588

Then, to prove that PPC0 and PPC1 are not equivalent, the program synthesizes a new test that distinguishes the two:

Test equiv0
=============================
P0            | P1           
-----------------------------
r1 <- [B]     | r2 <- [A]    
sync          | r3 <- r2 ^ r2
[A] <- 1      | [B+r3] <- 1  
=============================
r1=1 /\ r2=1
Allowed by PPC_0: #f
Allowed by PPC_1: #t

This test is allowed by PPC1, but not by PPC0, proving the two models are not equivalent.


Experimenting: we encourage experimentation with the PowerPC synthesis case study. The file ppc0.rkt defines the set of tests used, drawn from the set of all PowerPC tests in ~/memsynth/litmus/tests/ppc-all.rkt. This collection can be modified to, for example, exclude all tests containing certain instructions. See ~/memsynth/litmus/tests/ppc.rkt for examples.

The sketch in sketch.rkt can also be modified. For example, the numeric arguments to make-ppo-sketch etc. control the depth of expression sketch used. The ab sketch could also be modified to not contain the domain knowledge about the two barriers being independent (though this would require adding ^ to the list of non-terminals in the grammar).

x86

Section 6.1.2 applies MemSynth to the x86 architecture. We show that we can synthesize a model for x86 in 2 seconds using litmus tests from the Intel manual, and automatically identify several missing tests from that documentation.

To begin, enter the directory for this case study:

$ cd ~/memsynth/case-studies/synthesis/x86
$ ls
compiled  redundancy.rkt  sketch.rkt  tso0.rkt  tso0-unique.rktcd

The sketch.rkt file contains the sketch we use to synthesize x86 models. Unlike for PowerPC, we do not model fences for x86, and so the sketch is much simpler.


Claim: MemSynth synthesizes a formalization TSO0 that is correct on the Intel manual’s 10 litmus tests in under two seconds.

To reproduce:

$ racket tso0.rkt

As with the PowerPC experiments, this program prints a message of the form

Synthesis complete!
time: 2363 ms
tests used: 6/10

followed by the synthesized model, and verifies that the model agrees with the documentation:

Verifying solution...
Verified 10 litmus tests

Claim: MemSynth’s uniqueness query determines that at least one other memory model also satisfies all 10 tests, while disagreeing with TSO0 on a new distinguishing test … repeating the uniqueness query after adding this new test finds 2 more distinguishing tests.

To reproduce (takes ~4 hours; see below for a smaller version):

$ racket tso0-unique.rkt

As with the PowerPC uniqueness experiment above, this program first synthesizes TSO0 again, and then begins identifying new tests, which it outputs in the form:

New distinguishing test [#1]:
Test disambig0
=============================
P0            | P1           
-----------------------------
r1 <- [B]     | LOCK [A] <- 1
r2 <- [A]     | LOCK [B] <- 1
=============================
r1=1 /\ r2=0
Allowed by oracle? #f

The search finds a total of 3 such tests.

As with the PowerPC uniqueness experiment, no output is produced between distinguishing tests. Pass the -v argument to the program to see more frequent verbose output.

Smaller version: run

$ racket tso0-unique.rkt -s

to execute a version of the experiment with smaller bounds on the symbolic litmus test. This version will complete in 4 minutes, and finds the same 3 distinguishing tests as the full version, but the uniqueness claim is less strong because fewer candidate tests are considered.


Claim: Sarkar et al. [32] write that “P8 may be redundant” … we found that if we omit [the] two tests [associated with P8] … the ambiguity experiment rediscovers them, suggesting they are needed to uniquely identify x86’s memory model.

To reproduce (takes ~4 hours; see below for a smaller version):

$ racket redundancy.rkt

This program first synthesizes an x86 model without the tests in question, and then searches for new distinguishing tests. While the previous x86 uniqueness experiment finds 3 such tests, this experiment finds 5. The two additional tests are exactly the two that were omitted.

As with the other TSO uniqueness experiment, no output is produced between distinguishing tests. Pass the -v argument to the program to see more frequent verbose output.

Smaller version: run

$ racket redundancy.rkt -s

to execute a version of the experiment with smaller bounds on the symbolic litmus test. This version will complete in 4 minutes, and finds the same 5 distinguishing tests as the full version, but the uniqueness claim is less strong because fewer candidate tests are considered.

Framework repair

Section 6.2 of the paper uses MemSynth to identify and repair a bug in a paper by Mador-Haim et al.

To begin, enter the directory for this case study:

$ cd ~/memsynth/case-studies/repair
$ ls
allowed.rkt  axioms.rkt  compiled  enumerate.rkt  repair.rkt

The file axioms.rkt contains an implementation of the Mador-Haim memory model framework, but with a hole (called X) standing in for the precondition of the suspect “Ignore local” axiom.

Running repair.rkt tries various completions of this hole, and then synthesizes a completion for the whole that makes the model correct.

To reproduce:

$ racket repair.rkt

This program will produce output, described below, that addresses all the claims in Section 6.2.2. The program itself is quite readable, and we encourage reviewing it to determine exactly how it addresses the claims.

Claim: The paper states there should be 82 distinct models, but our implementation found only 12 distinct models

The first output from repair.rkt tries using the version of the axiom written in the paper, and reports:

Original axiom, as stated in paper: 12 distinct models

Claim: Omitting this axiom from our implementation gave 86 distinct models, not 82 as expected

The next output tries omitting the axiom (by settings its precondition to false), and reports:

With axiom omitted: 86 distinct models

Claim: Our implementation reported this test [test mh/L2] to be disallowed by all 90 memory models.

The program reports:

Models that allow test/madorhaim/L2: 0
Models that forbid test/madorhaim/L2: 90

Claim: We replaced [the precondition] with an expression sketch of depth 3, and synthesized a completion that gave the correct outcomes on 9 litmus tests from the original paper on both TSO and RMO memory models. We were able to synthesize the … repair in 15 seconds

The actual expression sketch is on lines 45–46 of repair.rkt:

(define X (expression-sketch 3 2 (list + - & -> SameAddr ~)
                                 (list MemoryEvent Reads Writes po rf)))

The output from repair.rkt first reports the correct outcomes for the 9 litmus tests from the original paper, as determined using the Alglave et al framework. Then, it synthesizes a completion for X using those outcomes as the specification, and reports:

Found a repair: (& (~ (- po rf)) (~ (-> Writes Reads)))
Time: 3268 ms

This repair is equivalent to the one stated in the paper after simplification (lifting ~ outside &), and due to changes since submission, the performance is much improved.

Claim: With the repaired axiom, our pairwise comparison results produce 82 distinct models, identical to the original paper.

The final output from repair.rkt states:

Produces 82 distinct models.

Models that allow test/madorhaim/L2: 36
Models that forbid test/madorhaim/L2: 54

These results confirm 82 distinct models, and that some models now allow the test mh/L2 that was forbidden by every model when using the original axiom.

Performance

Section 6.3 characterizes the performance of the MemSynth engine compared to existing memory model and relational logic tools.

Verification

To begin, enter the directory for this experiment:

$ cd ~/memsynth/case-studies/performance/verification
$ ls
alloy  alloy.sh  herd  herd.sh  memsynth.sh

The verification experiment compares the performance of verifying whether the Alglave et al PowerPC model allows or forbids the 768 litmus tests used in the PowerPC synthesis experiments above.

The paper compares MemSynth to two different tools: Alloy (in the alloy directory) uses a PowerPC model exported from MemSynth; Herd (the herd directory) is a custom-built memory model tool by Alglave and Maranget that already supports PowerPC.

Claim: The results show that MemSynth outperforms Alloy by 10×, and is comparable to herd’s custom decision procedure for memory models.

To run each tool on the 768 tests, execute its respective shell script:

$ ./memsynth.sh
Verifying with Memsynth...

real    0m2.769s
user    0m2.592s
sys 0m0.164s

$ ./herd.sh
Verifying with Herd...

real    0m2.538s
user    0m2.332s
sys 0m0.108s

$ ./alloy.sh
Verifying with Alloy...

real    0m49.119s
user    1m19.312s
sys 0m1.556s

These results show that MemSynth and Herd have very similar performance, while Alloy is at least 10× slower than both tools.

Equivalence

To begin, enter the directory for this experiment:

$ cd ~/memsynth/case-studies/performance/equivalence
$ ls
alloy  alloy.sh  memsynth  memsynth.sh  models  plot.py

The equivalence experiment compares the performance of synthesizing distinguishing litmus tests for each pair of 10 candidate PowerPC models (which were generated during synthesis of the PowerPC model above).

The paper compares MemSynth to Alloy*, a higher-order version of Alloy. The experiment tries to synthesize a distinguishing test for all 45 pairs of models using both tools, and compares the distribution of synthesis times for each tool.

Claim: MemSynth outperforms Alloy* on most of these queries: MemSynth can solve twice as many queries in under one second, and the hardest problem takes 25 seconds for MemSynth versus 18 min for Alloy*.

Both experiments are run by shell scripts, which output the performance for each pair of candidate models to a CSV file in the current directory.

To run the experiment for MemSynth (takes ~3 min):

$ ./memsynth.sh

To run the experiment for Alloy* (takes ~15 min):

$ ./alloy.sh

Finally, to plot and analyze the results:

$ ./plot.py

This command outputs the maximum time taken to compare two models by both MemSynth and Alloy, and opens a plot that corresponds to Figure 8(b) in the paper. Note that for tractability, we set a lower timeout for the Alloy experiment (2 minutes here versus 1 hour in the paper), so the x-axis is shorter than in the paper, the maximum time is lower, and Alloy* does not not successfully complete all 45 pairs.

Synthesis

To begin, enter the directory for this experiment:

$ cd ~/memsynth/case-studies/performance/synthesis
$ ls
alloy  alloy.sh  memsynth  memsynth.sh

The synthesis experiment compares the performance of synthesizing a memory model from a trivial sketch. The experiment uses three litmus tests from the Intel manual, and tries to synthesize a model from a sketch that simply consists of binary choices between two or three models.

The paper compares MemSynth to Alloy* on the synthesis task. For both tools, we first try to synthesize a model from a sketch that chooses between sequential consistency (SC) and total store order (TSO). Then, in a second experiment, we add partial store order (PSO) to the set of choices.

Claim: When given {SC, TSO}, both MemSynth and Alloy* return in under a second. However, when given {SC, TSO, RMO}, MemSynth still returns in under a second, but Alloy* times out after one hour.

To run the experiment for MemSynth:

$ ./memsynth.sh

The output shows both models can be synthesized extremely quickly (under 20 milliseconds).

To run the experiment for Alloy*:

$ ./alloy.sh

The output shows that the choice between SC and TSO can be made rapidly, but choosing between SC, TSO, and PSO times out after 5 minutes (we set the timeout lower than in the paper for tractability).

Note that the experiment uses PSO rather than RMO as mentioned in the paper, for simplicity of export to Alloy*, but the results are the same.