The caller (e.g. macaw-refinement) can provide an additional Rewriter
operation that can operate on TermStmts for blocks (typically those
for which a previous Discovery was unable to determine a transfer
target). There is an additional
entrypoint ('addDiscoveredFunctionBlockTargets') that will allow this
additional rewriter to be supplied for updating an existing
DiscoveryState.
Some of the TermStmt and other elements might generate new blocks as
part of the Rewrite operation (e.g. adding a new 'Branch' TermStmt) so
this change allows the rewrite context to update the generated block
labels and collect these newly generated blocks for inclusing in the
results passed to the parser.
Before, we just discarded them during the translation. They are useful metadata
for generating diagnostics in Crucible, so this commit translates them. They
are no-ops during symbolic evaluation.
To make them truly useful, they need to include the address of the block that
they belong to (their data payload in macaw is just an offset from the start of
a block). This information wasn't available before, so it has to be plumbed
through in macaw-x86.
The use of `Data.Parameterized.Map.fromList` in `mkRegStateM` was
showing up in profiling as a huge time sink. We don't actually need to
build the map from scratch there, though, since the keys are known ahead
of time. Adding an `archRegSet` variable to the `RegisterInfo` class
(with the obvious default implementation) ensures that a `MapF` with the
right keys will be built once and then reused.
This mostly affects x86. Previously, we threw away the write of the return
address to the stack when identifying calls for macaw-x86. This was partly for
hygiene and partly to support the "addresses written to memory are function
pointers" heuristic. Treating the return address as a potential function
pointer breaks function identification, so that is important.
The problem comes in the translation of macaw into crucible - we never write the
return address to the stack, but returns still read the return address from the
stack. If it wasn't written in the first place, this leads to a read
from (potentially) uninitialized memory, which causes errors in the symbolic
simulator. There are two solutions:
1. Make returns not read from the stack
2. Keep the write of the return address to the stack
Solution 1 is a problem, as we have a data dependency on the read. Eliding it
breaks Crucible generation later and produces an invalid CFG.
Solution 2 works well. The implementation is actually simple. We can keep
identifyCall the same for x86 and just construct the basic block not from the
return value but from the original list of statements (unaltered). We do need
to have identifyCall still give us the reduced statement list, which we use for
identifying possible function pointers written onto the stack (but not the
return address, which we do not want to treat as a function pointer).
This update renames many of the declarations exported by
Data.Macaw.Memory so that we have more consistent names.
The majority of the existing names are now exported with DEPRECATION
warnings. Some of the symbol declarations that were not used by the
Memory datatype have been moved to other modules.
The minor version of macaw-base has been incremented.
This should cut down on the number of proxies/explicit type arguments
needed when dealing with these types.
Awkwardly, ArchTermStmt isn't injective, because PPC32 and PPC64 happen
to use exactly the same type. We could add an argument to that type and
then all the families could be injective.
The pretty-printer for Stmts takes a pretty-printer function as an
argument. This used when a Stmt stores an offset from the beginning of a
block can, but we don't have information about that block internally in
the Stmt.
An ArchState Stmt stores an ArchMemAddr, which is independent of the
block it's in. Previously we were treating the ArchMemAddr as an offset
and passing it to the pretty-printer function for offsets; in practice
this means most of them were printed as values about twice as big as
they were supposed to be.
This patch adds initial support for relocations in Macaw code
discovery, and adds other refactoring.
* It introduces a SymbolValue constructor to represent references to
symbols within Macaw.
* The various cases for x86 mov are made explicit after the flexdis refactor
broke the previous code. We should now support segment register movs and
give better error messages when seeing mov with control or debug registers.
* The generic exception operation is replaced with Hlt and UD2 terminal
x86-specific statements.
* CodeAddrReason is split into FunctionExploreReason and BlockExploreReason to
clarify whether a function or block was discovered.
* The Macaw pretty printer is changed to use write_mem in place of pointer syntax.
* Various other refactoring is made to clarify code.
ArchMemAddr is easier to use than ArchAddrWord in downstream clients, and is
probably more faithful in the case where we want to support shared libraries
and/or object files.
The new statement is called `ArchState`, and has two fields: an address and a
map. The address is the address of the instruction it is standing in for. The
map contains a mapping from the *machine registers* that the instruction updated
to the *macaw values* that were assigned to those locations.
This is useful metadata for debugging, but is also required to do some types of
architecture-independent analysis (where we can still reason about machine
register contents).
Instead of inline analysis of whether the instruction pointer has been
updated to contain the ReturnAddr symbolic value, defer the
determination of the call return to the (previously defined but
unused) architecture-specific handling. This allows architectures
like ARM that perform modifications on the values loaded to the
instruction pointer (e.g. clearing lower bits) to provide their own
recognition of a return operation.
Also modifies the signature of identifyReturn to return a Sequence of
statements to match the identifyCall type signature.
Replaces the previously unused identifyX86Return with the inline
detection of IP == ReturnAddr.
This reverts commit 6a0a59e298.
Do not want to add BVUrem because divide by zero can cause an
exception and this will greatly increase the complexity for analysis.