The llvm memory model was extended with better diagnostics and configurable
handling of undefined behavior. macaw-symbolic uses no undefined behavior
checking, as those operations are only undefined in C.
Before, we just discarded them during the translation. They are useful metadata
for generating diagnostics in Crucible, so this commit translates them. They
are no-ops during symbolic evaluation.
To make them truly useful, they need to include the address of the block that
they belong to (their data payload in macaw is just an offset from the start of
a block). This information wasn't available before, so it has to be plumbed
through in macaw-x86.
This cleanup consolidates the interface to macaw symbolic into two (and a half)
modules:
- Data.Macaw.Symbolic for clients who just need to symbolically simulate
machine code
- Data.Macaw.Symbolic.Backend for clients that need to implement new
architectures
- Data.Macaw.Symbolic.Memory provides a reusable example implementation of
machine pointer to LLVM memory model pointer mapping
Most functions are now documented and are grouped by use case. There are two
worked (compiling) examples in the haddocks that show how to translate Macaw
into Crucible and then symbolically simulate the results (including setting up
all aspects of Crucible). The examples are included in the symbolic/examples
directory and can be loaded with GHCi to type check them.
The Data.Macaw.Symbolic.Memory module still needs a worked example.
There were very few changes to actual code as part of this overhaul, but there
are a few places where complicated functions were hidden behind newtypes, as
users never need to construct the values themselves (e.g., MacawArchEvalFn and
MacawSymbolicArchFunctions). There was also a slight consolidation of
constraint synonyms to reduce duplication. All callers will have to be updated.
There is also now a README for macaw-symbolic that explains its purpose and
includes pointers to the new haddocks.
This commit also fixes up the (minor) breakage in the macaw-x86-symbolic
implementation from the API changes.
The refinement library provides supplemental functionality for
discovery of elements that macaw-symbolic is not able to discover via
pattern matching. This library will use crucible symbolic analysis to
attempt to determine elements that could not be identified by
macaw-symbolic. The identification provided by macaw-symbolic is
incomplete, and so is the identification by this macaw-refinement, but
macaw-refinement attempts to additionally "refine" the analysis to
achieve even more information which can then be provided back to the
macaw analysis.
* Terminator effects for incomplete blocks. For example, the target
IP address by symbolic evaluation (e.g. of jump tables). If the
current block does not provide sufficient information to
symbolically identify the target, previous blocks can be added to
the analysis (back to the entry block or a loop point).
* Argument liveness (determining which registers and memory
locations are used/live by a block allows determination of ABI
compliance (for transformations) and specific block
requirements (which currently start with a full register state and
blank memory).
* Call graphs. Determination of targets of call instructions that
cannot be identified by pattern matching via symbolic evaluation,
using techniques similar to those for identifying incomplete blocks.
The use of `Data.Parameterized.Map.fromList` in `mkRegStateM` was
showing up in profiling as a huge time sink. We don't actually need to
build the map from scratch there, though, since the keys are known ahead
of time. Adding an `archRegSet` variable to the `RegisterInfo` class
(with the obvious default implementation) ensures that a `MapF` with the
right keys will be built once and then reused.
This mostly affects x86. Previously, we threw away the write of the return
address to the stack when identifying calls for macaw-x86. This was partly for
hygiene and partly to support the "addresses written to memory are function
pointers" heuristic. Treating the return address as a potential function
pointer breaks function identification, so that is important.
The problem comes in the translation of macaw into crucible - we never write the
return address to the stack, but returns still read the return address from the
stack. If it wasn't written in the first place, this leads to a read
from (potentially) uninitialized memory, which causes errors in the symbolic
simulator. There are two solutions:
1. Make returns not read from the stack
2. Keep the write of the return address to the stack
Solution 1 is a problem, as we have a data dependency on the read. Eliding it
breaks Crucible generation later and produces an invalid CFG.
Solution 2 works well. The implementation is actually simple. We can keep
identifyCall the same for x86 and just construct the basic block not from the
return value but from the original list of statements (unaltered). We do need
to have identifyCall still give us the reduced statement list, which we use for
identifying possible function pointers written onto the stack (but not the
return address, which we do not want to treat as a function pointer).