macaw/semmc-macaw.org
2017-09-29 09:37:45 -07:00

268 lines
13 KiB
Org Mode

* Overview
The high level goal is to write and/or generate architecture-specific backends
for /macaw/ based on the semantics discovered by /semmc/. In particular, we
are interested in making /macaw-ppc/ and /macaw-arm/. We will hand-write some
of the code, but we will generate as much as possible automatically. We will
read in the semantics files generated by /semmc/ and use Template
Haskell [fn:template-haskell] to generate a function that transforms machine
states according to the learned semantics.
We will implement a base package (/macaw-semmc/) that provides shared
infrastructure for all of our backends; this will include the Template Haskell
function to create a state transformer function from learned semantics files.
* Code dependencies and related packages
- macaw (binary code discovery)
- macaw-x86 (x86_64 backend for macaw)
- semmc (semantics learning and code synthesis)
- semmc-ppc (PowerPC backend for synthesis)
- dismantle-tablegen (disassembler infrastructure)
- dismantle-ppc (PowerPC disassembler)
- crucible (interface to SMT solvers)
- parameterized-utils (utilities for working with parameterized types)
** Semantics background
The /semmc/ library is designed to learn semantics for machine code
instructions. Its output, for each Instruction Set Architecture (ISA), is a
directory of files where each file contains a formula corresponding to the
semantics for an opcode in the ISA. For example, the ~ADDI.sem~ file
contains the semantics for the add immediate instruction in PowerPC.
There are functions in /semmc/ for dealing with this representation.
Formulas are loaded into a data type called ~ParameterizedFormula~, which
contains formula fragments based on the ~SimpleBuilder~ representation of /crucible/. This can be thought of as a convenient representation of SMT
formulas.
* Modules of note
- macaw: ~Data.Macaw.Architecture.Info~
This contains the machine-specific interface that must be implemented for
each backend to /macaw/: ~ArchitectureInfo~. There are many details, but
the main workhorse is ~disassembleFn~, which disassembles bytes into blocks
(sequences of statements with no branches).
- macaw: ~Data.Macaw.CFG.Core~
This defines some key types for the translation we will have to do:
- ~Stmt~: statements that comprise basic blocks (a three-address code style
representation).
- ~Value~: Values that can live in registers or memory, represented using an
expression language defined in /macaw/ (see ~App~ and ~Expr~). Most
values are bitvectors of various lengths.
- ~ArchFn~, ~ArchReg~, ~ArchStmt~, which are for representing
architecture-specific behavior that can't be represented with the ~Stmt~
type. These are type families that are instantiated for each backend.
- ~RegState~, which is a map from registers to ~Value~. The register type
is a parameter and is architecture-specific (e.g., ~X86Reg~). While this
is basically a map (parameterized map from /parameterized-utils/), it has
an additional invariant where it is always full (i.e., it has an entry for
every register).
Note that our goal is to translate machine instructions into one or more /macaw/ statements (the ~Stmt~ type). We will arrange these statements into
basic blocks (linear sequences of blocks with no branches). The bridge
between statements and the expression language is through the ~AssignStmt~
constructor of ~Stmt~, which establishes an assignment (similarly the ~WriteMem~ statement). An assignment defines a new virtual register in /macaw/ IR (via the ~Assignment~ type). The ~Assignment~ names the virtual
register it defines through the ~assignId~ field. The ~assignRhs~ contains
expressions through the ~EvalApp~ constructor (~App~ being the expression
language). The ~ReadMem~ constructor corresponds to reads from memory.
- macaw: ~Data.Macaw.CFG.App~
This module defines the expression language that is referenced by the ~Value~ type.
- macaw-x86: ~Data.Macaw.X86~
This module contains the /macaw/ backend for x86_64: ~x86_64_linux_info~.
The most important function in this definition is probably ~disassembleBlockFromAbsState~, which disassembles instructions into basic
blocks.
This module also contains implementations of the two important interfaces in /macaw-x86/: ~Semantics~ and ~IsValue~. We won't need the classes, but the
underlying ~X86Generator~ Monad is instructive, as is the representation of
expressions.
- macaw-x86: ~Data.Macaw.X86.X86Reg~
This module defines a representation of all of the parts of the machine
state for X86. Each backend will have something analogous. Note that the
definition of ~X86Reg~ is a GADT [fn:GADTs] (despite the unusual definition
style). This is important, as 1) /macaw/ expects the register type to have
a type parameter, and 2) the extra size guarantees are somewhat useful.
Note that the strange form of the declaration is most likely historical.
Before GHC 8.2, haddock could not parse documentation comments on GADT
constructors.
- semmc: ~SemMC.Formula.Load~
Load learned formulas from disk into a map from opcodes to formulas.
- crucible: ~Lang.Crucible.Solver.SimpleBuilder~
This module defines a different ~App~ type that is the expression language
for our parameterized formulas (i.e., instruction semantics). This is the
AST we'll be walking in the Template Haskell code. By and large, we only
use the bitvector operations. We also use a few uninterpreted functions to
represent floating point operations.
* Detailed approach
We will implement a /macaw/ ~ArchitectureInfo~ for each backend, starting with
PowerPC. There is a lot in this structure, so we will start by just
implementing a ~DisassembleFn~, which has the type:
#+BEGIN_SRC haskell
type DisassembleFn arch
= forall ids
. Memory (ArchAddrWidth arch)
-> NonceGenerator (ST ids) ids
-> ArchSegmentOff arch
-- ^ The offset to start reading from.
-> ArchAddrWord arch
-- ^ Maximum offset for this to read from.
-> AbsBlockState (ArchReg arch)
-- ^ Abstract state associated with address that we are disassembling
-- from.
--
-- This is used for things like the height of the x87 stack.
-> ST ids ([Block arch ids], MemWord (ArchAddrWidth arch), Maybe String)
#+END_SRC
Take the implementation of ~disassembleBlockFromAbsState~ in ~Data.Macaw.X86~.
Note that we can ignore the ~AbsBlockState~ parameter, which is only used for
x86. We also don't need to implement the entire function. We can start by
focusing on the equivalent of the ~execInstruction~ function. The surrounding
code we can most likely adapt without many changes.
The ~execInstruction~ function is defined in ~Data.Macaw.X86.Semantics~. The
signature of this function is more interesting than its implementation:
#+BEGIN_SRC haskell
execInstruction :: FullSemantics m
=> Value m (BVType 64)
-- ^ Next ip address
-> F.InstructionInstance
-> Maybe (m ())
#+END_SRC
This signature is more general than necessary: we can concretize the typeclass
constraint to a concrete ~Monad~ in the style of the ~X86Generator~ Monad. We
should create a simple Monad based on the ~State~ Monad from /mtl/ and provide
some functions on it that mirror those of the ~Semantics~ typeclass from /macaw-x86/. An example Monad declaration might be:
#+BEGIN_SRC haskell
{-# LANGUAGE GeneralizedNewtypeDeriving #-}
import Control.Monad.ST ( ST )
import qualified Control.Monad.State.Strict as St
data PreBlock ids = PreBlock { pBlockIndex :: !Word64
, pBlockAddr :: !(MemSegmentOff 64)
-- ^ Starting address of function in preblock.
, pBlockStmts :: !(Seq (Stmt X86_64 ids))
, pBlockState :: !(RegState X86Reg (Value X86_64 ids))
, pBlockApps :: !(MapF (App (Value X86_64 ids)) (Assignment X86_64 ids))
}
data GenState w s ids = GenState { assignIdGen :: !(NonceGenerator (ST s) ids)
, blockSeq :: !(BlockSeq ids)
, blockState :: !(PreBlock ids)
, genAddr :: !(MemSegmentOff w)
}
newtype MCGenerator w s ids a = MCGenerator { runGen :: St.StateT (GenState w s ids) (ST s) a }
deriving (Monad,
Functor,
Applicative,
St.MonadState (GenState w s ids))
#+END_SRC
The ~PreBlock~ type is the key: it is the block *currently* being constructed
(at any given time). It has a ~RegState~, which is one of the key things we
will be modifying. Many of the combinators relating to the ~X86Generator~ in /macaw-x86/ are defined in service of updating this state as machine code
instructions are encountered. It is a ~PreBlock~ because it isn't yet a
block. It becomes a block once we encounter a terminator instruction (e.g., a
jump of some kind). At that point, we add it to the underlying collection of
blocks.
We will need many of the helpers in the ~Data.Macaw.X86~ module that operate
on the ~X86Generator~ Monad. It may also be helpful to have an additional
component to the Monad to signal errors (e.g, ~Control.Monad.Except.ExceptT~).
We need the base of the Monad transformer stack to be ~ST~ so that we can
allocate nonces.
Since we are specializing our ~execInstruction~ to this Monad, its type will
look something like:
#+BEGIN_SRC haskell
execInstruction :: Value PPC.PPC ids (BVType w)
-- ^ Next ip address
-> PPC.Instruction
-- ^ An instruction from Dismantle
-> Maybe (MCGenerator w s ids ())
#+END_SRC
Think of this as the action that we take given an instruction and the value of
the instruction pointer (IP) when that instruction is executed. We pass in
the instruction pointer to accommodate IP-relative addressing (i.e., addresses
that are computed relative to the address of the instruction computing the
address). ~execInstruction~ returns a ~Maybe~ in case the instruction is
invalid. That is not especially likely given our encoding, but it is possible.
As an example of what an implementation of this function might look like is:
#+BEGIN_SRC haskell
execInstruction :: Value PPC.PPC ids (BVType w)
-- ^ Next ip address
-> PPC.Instruction
-- ^ An instruction from Dismantle
-> Maybe (MCGenerator w s ids ())
execInstruction ip (PPC.Instruction opcode operands) =
case opcode of
PPC.ADD4 ->
case operands of
(r1 :> r2 :> r3 :> Nil) -> Just $ do
v2 <- get r2
v3 <- get r3
define r1 (BVAdd v2 v3)
#+END_SRC
For appropriate definitions of ~get~ and ~define~, which read from and write
to (respectively) the ~RegState~ in the ~PreBlock~ of the ~GenState~ in the ~MCGenerator~ Monad.
** Possible task breakdown
1. Define a suitable ~MCGenerator~ state Monad and accompanying state
2. Work out some suitable helper functions (taking inspiration from /macaw-x86/, but not copying out all of the complexity) for managing the
state and accumulating basic blocks
3. Try to define a few simple cases for ~execInstruction~ to ensure that the
Monad and API is suitable and can actually be used to implement the main ~DisassembleFn~
4. Sketch out a function with the signature implied by ~DisassembleFn~,
ensuring that it is actually capable of calling the functions defined in
earlier steps
5. Start replacing the hand-written ~execInstruction~ with a Template Haskell
version
Note: we don't want to define ~execInstruction~ by hand - we just want to
learn enough to generate it from the formulas we obtain from /semmc/, and
figure out what primitives we need to support that.
* References
[fn:template-haskell] https://hackage.haskell.org/package/template-haskell
[fn:GADTs] https://downloads.haskell.org/~ghc/8.2.1/docs/html/users_guide/glasgow_exts.html#generalised-algebraic-data-types-gadts