This extends `Data.Macaw.Symbolic.Testing` in `macaw-symbolic` to be able to
handle binaries that depend on shared libraries. This is fully functional for
the x86-64 and AArch32 symbolic backends, and I have added test cases to the
respective repos demonstrating that it works. (The PowerPC backend is not yet
supported. At a minimum, this is blocked on GaloisInc/elf-edit#35.)
To implement this, I also needed to add some additional infrastructure to
`macaw-base` (I put this infrastructure here as it doesn't depend on any
Crucible-specific functionality):
* `Data.Macaw.Memory.ElfLoader.DynamicDependencies`: a basic ELF dynamic
loader that performs a breadth-first search over all `DT_NEEDED` entries
that an ELF binary depends on (both directly and indirectly).
* `Data.Macaw.Memory.ElfLoader.PLTStubs`: a collection of heuristics for
detecting the addresses of PLT stubs in a dynamically linked binary.
It is worth noting that shared libraries are rife with nuance and subtlety,
and the way `macaw` models shared libraries is not 100% accurate. I have
written a length `Note [Shared libraries]` in `Data.Macaw.Symbolic.Testing`
to describe where corners had to be cut.
Fixes#318.
It turns out that we have to be more conservative with tail call identification,
as incorrectly identifying a block as the target of a tail call (instead of a
branch) can cause other branch classifiers to fail if that block is the target
of another jump.
Ultimately, we will need to give up some tail call recognition (since they are
in general indistinguishable from jumps), and instead only identify known call
targets as tail call candidates.
With additional global analysis we could do better.
Fixes#294
GHC 9.2 adds `-Wincomplete-uni-patterns` to `-Wall`, which uncovers a slew of
previously unnoticed warnings in `macaw`. This patch fixes them, mostly by
adding explicit fall-through cases.
The core of macaw cannot represent conditional calls because the existing block terminators are not sufficiently expressive and it doesn't support creating synthetic blocks to represent control flow not directly tied to machine addresses.
To work around this, we introduce ARM-specific block terminators for conditional calls and plumb them through up to macaw-aarch32-symbolic.
Fixes#288
The tail call classifier came after the jump classifier, which was a problem because it is less strict than the tail call classifier, meaning it would always fire. This commit moves direct jump to be the last classifier applied, giving the others a chance.
Includes a test case in the ARM backend.
This requires some updates to some of the expected test results, as a few blocks are now classified as tail calls that were
plain jumps before. They really could be considered either. I think it would be nice if these could be classified as jumps instead, but the reason they are flagged as tail calls is mostly down to the fact that their surrounding context is so simple that either interpretation works.
Correcting this would require some heuristics based on additional analysis passes.
The test harness for macaw symbolic required a few changes because the new detection of some jumps as tail calls introduces new calls into the symbolic test suites. However, the symbolic testing harness did not support calls before. Adding support required a bit of plumbing, including a more extensive code discovery pass.
Fixes#285
When a user overrides a system call on an architecture that supports returning two values from a system call and they provide a context containing the result of the system call in the form
```
empty :> v0 :> v1
```
macaw will perform the register assignment
```
r0 := v1
r1 := v0
```
This change reverses this behavior so that the assignment becomes
```
r0 := v0
r1 := v1
```
This brings the expected ordering of the result context in agreement
with the left-to-right ordering of the argument context:
```
empty :> arg1 :> arg2 :> ...
```
The return address gets masked and has the low-bit set in an obtuse way due to
the semantics. This threw off the call detection.
This change matches against the quirky pattern.
* Add support for standalone PIEs
This changeset adds support for standalone position independent
executables (PIEs) that do not make use of procedure linkage tables. It
does so by adding relative address support to `populateSegmentChunk` and
adding an additional simplification rule for Aarch32.
This covers part of the work for #234.
* Remove NoOp + replace mult with left shift
This mostly deals with the splitting of the old `sym` type into
two: one for dealing with expression creation, and a new simulator
backend type for dealing with control-flow and assertions.
This contains a variety of fixes needed to make the packages in the `macaw`
repo compile with GHC 9.0:
* GHC 9.0 implements simplified subsumption (see
[here](https://gitlab.haskell.org/ghc/ghc/-/wikis/migration/9.0?version_id=5fcd0a50e0872efb3c38a32db140506da8310d87#simplified-subsumption)).
In most cases, adapting to this is a matter of manually eta expanding
definitions, such as in `base:Data.Macaw.Analysis.RegisterUse`. In the case
of `macaw-x86-symbolic:Data.Macaw.X86.Crucible`, the type signature of
`evalExt` had to be made more specific to adapt to the loss of contravariance
when typechecking `(->)`.
* GHC's constraint solver now solves constraints in each top-level group
sooner (see
[here](https://gitlab.haskell.org/ghc/ghc/-/wikis/migration/9.0?version_id=5fcd0a50e0872efb3c38a32db140506da8310d87#the-order-of-th-splices-is-more-important)).
This affects `macaw-aarch32` and `macaw-symbolic`, as they separate top-level
groups with `$(return [])` Template Haskell splices. The previous locations
of these splices made it so that the TH-generated instances in that package
were not available to any code before the splice, resulting in type errors
when compiled with GHC 9.0.
To overcome this, I rearranged the TH-generated instances so that they appear
before the top-level groups that make use of them.
* GHC 9.0 now enables `-Wstar-is-type` in `-Wall`, so this patch replaces some
uses of `*` with `Data.Kind.Type`. `Data.Kind` requires the use of GHC 8.0 or
later, so this patch also updates thes lower bounds on `base` to `>= 4.9` in
the appropriate `.cabal` files. (I'm fairly certain that this requirement was
already present implicity, but better to be explicit about it.)
* The `asl-translator`, `crucible`, and `semmc` submodules were updated to
allow them to build with GHC 9.0. The `llvm-pretty` and
`llvm-pretty-bc-parser` submodules were also bumped to accommodate unrelated
changes in `crucible` that were brought in.
* The upper version bounds on `doctest` in `macaw-symbolic`'s test suite were
raised to allow it to build with GHC 9.0.
Add support for abstract interpretation of division operations in the ARM
backend (when the operands to division are concretely known).
This commit also adds extended documentation on the semantics of these
operations.
The concrete evaluation eliminates constant division operations. The abstract
cases are probably obsolete in light of that, but are still interesting...
These changes were motivated by insufficient simplification around some of the syscall/errno handling code in musl
The old formulation (with system calls as block terminators) proved to be
impossible to implement properly. Handlers for syntax overrides have very
limited types (`IO`, rather than `OverrideSim`), which made symbolic branching
and reusing overrides impossible.
This change replaces the system call block terminator with an arch-specific
function that is translated into a function handle lookup (which is then
dispatched to with a call).
Unfortunately, this refactoring required combining the AArch32 simplification
module with the architecture extension definitions, due to the new translation
relying on the simplifier instance.
Adds support in macaw-aarch32 for conditional returns. These are not supported in core macaw, and are thus architecture-specific block terminators.
This required changes to the type of arch-specific block terminators. Before, `ArchTermStmt` was only parameterized by a state thread (`ids`). This meant that they could not contain macaw (or crucible) values. Some work on. AArch32 requires being able to store condition values in arch terminators (to support conditional returns). This change modifies the `ArchTermStmt` to enable this, which requires a bit of plumbing through various definitions and some extra instances.
In support of actually using this, it also became necessary to plumb fallthrough block labels through the architecture-specific terminator translation in macaw-symbolic.
Note that this change was overdue, as the PowerPC backend was storing macaw values in a way that would have rendered them unusable in the macaw-ppc-symbolic translation, had any interpretation been provided. These new changes will enable a handler to be written for the conditional PowerPC trap instructions.
PowerPC, x86, and ARM have been updated.
Improves the macaw-aarch32 tests. There is now a command line option to save the generated macaw IR for each
discovered function to /tmp. Note that this reuses some infrastructure from the macaw-symbolic tests. This
shared functionality should be extracted into a macaw-testing library.
This change makes the block classifier heuristic part of the `ArchitectureInfo`
structure. This enables clients and architecture backends to customize the
block classification heuristics. This is most useful for architectures that
have complex architecture-specific block terminators that require analysis to
generate (e.g., conditional returns). It will also make macaw-refinement
simpler in the future, as the SMT-based refinement is just an additional block
classifier (but is currently implemented via a hacky side channel).
This change introduces an ancillary change, which should not be very
user-visible.
It splits the Macaw.Discovery and Macaw.Discovery.State modules to break
module import cycles in a way that enables us to expose the classifier. This
should not be user-visible, as Macaw.Discovery still exports the same
names (with one minor exception that should not appear in user code).
It also moves the definition of the `ArchBlockPrecond` type family; the few
affected places should be updated. User code should probably not be able to see
this.
It was previously assuming that addresses are absolute, which is not true for
position independent executables. Extracting the offset from the address is
sufficient for our purposes here (note that taking the offset from the
`MemSegmentOffset` would not be right, as that offset is relative to the segment
start).
We originally left this case out to catch cases where the PC was written to
directly (skipping the special `LoadWritePC` logic in the semantics). At this
point we are confident that the ARM semantics handle that correctly. The
translation of the semantics into macaw (via TH) are not entirely lazy, and the
interpretation will call this function indirectly (via `get_gpr_uf`); returning
Nothing in the `r15` case caused that part of the translation to fail. The
resulting value is never actually used (because the ARM semantics have special
behavior when reading from `r15`), but the error was too eager and caused a
crash.
This change just lets that code continue.
Some important simplifications for classification were failing to fire because
other simplifications fired first, short circuiting the search. It turns out
that more than one rule may apply at any given step (and it is important to
apply all of the rules that can be applied). This commit modifies the
simplifier to apply rules until saturation.
aarch32: Support mixed ARM/Thumb1 binaries
This updates the aarch32 backend to decode Thumb instructions and generate the Thumb semantics. The major implementation change is to use the `ArchBlockPrecond` feature of macaw to track the Thumb state (`PSTATE_T`) across block boundaries.
The ARM code discovery decides whether or not a function entry point should be decoded as Thumb by examining the low bit of the function address. If the low bit is set, it is a Thumb entry point. This has the slightly odd effect of causing macaw to say that the function is at the address with the low bit set, which is not technically true. This is documented in the README, but not obvious on inspection. Most use cases should not care, and can in any case account for it. In the future, it should be possible to fix this (though it will require some changes to the core of macaw).
aarch32-symbolic: Implement most of the remaining macaw-aarch32-symbolic bits
It should be usable now, modulo some execution-time semantics for the floating
point operations. There will be a separate ticket covering the changes required
for them (some refactoring of how they are handled during translation is required).
* update to bv-sized branch of what4 and other things
* removed parameterized-utils submodule completely
* Updates submodules
* Fixes macaw-symbolic w.r.t. crucible-llvm changes
Co-authored-by: Ben Selfridge <ben@000548-benselfridge.local>
* Fix block size accounting in the disassembler
The value in the early failure combinator is used as the *block size* in the
resulting macaw block. The code was actually using the offset from the
beginning of the segment, which is wrong. This produced very large blocks that
didn't reflect the results of code discovery and led to decode errors later in
the pipeline.
* Do not throw an error if concreteIte has a symbolic argument
The `concreteIte` combinator turns formula conditionals with concrete operands
into Haskell-level conditional execution. It would fail because we believed
that there were no cases that could fail to satisfy that condition. That
assumption was not true - we need to fall back to generating a mux when we have
a symbolic condition.
Under GHC8.4, a let binding is independent of the surrounding context,
so the let statements encountered errors related to type matching on
synthesized internal type parameters that could not be identified as
the same due to rigid skolem type binding inside the let.
Improve the TH codegen for macaw-semmc
This change lazily translates as much as possible. It also generates somewhat more compact code. This change also finishes implementing primitives for the aarch32 backend. Complementing the aarch32 changes, the macaw-semmc interface has been modified to allow macaw-aarch32 to avoid a redundant serialize-deserialize round.
Co-authored-by: Kevin Quick <kquick@galois.com>
These packages replace the old macaw-arm (which has been removed). The only
change to the core macaw is to introduce a `Lift` instance for the Endianness
data type, which is used in macaw-semmc.
The macaw-aarch32 package uses the official ARM semantics (via the
asl-translator package). In its current state, macaw-aarch32 seems to handle
the common idioms of simple ARM binaries. Position independent executables have
not been tested yet. The semantics and disassemblers for Thumb are present, but
not integrated into code discovery at this time. There are some tests in
macaw-aarch32. Compile times are longer than necessarily desired.
macaw-aarch32 can be compiled in two modes: lite mode (cabal flag -fasl-lite),
which uses a restricted set of instructions for testing, and takes less time to
compile. The full instruction set is the default, though there are a few
undefined functions that are not yet handled for the full set, mostly relating
to floating point operations.
The macaw-aarch32-symbolic package is currently a stub, but is implemented to
provide a few necessary instances.
We were reading partially updated values that were committed to the register
state out-of-order, yielding some bad results.
This commit takes a snapshot of the register state before executing each
instruction and only reads register values from the snapshot.