The design of the deep learning framework in Marian is based on reverse-mode [auto-differentiation](https://en.wikipedia.org/wiki/Automatic_differentiation) (also known as backpropagation) with dynamic computation graphs.
Computation graphs allow a great deal of freedom in network architectures, and they can deal with complicated structures like conditions and loops.
The dynamic declaration, which means a new graph is created for each training instance (for a training example or a batch), is also advantageous.
It allows handling of variably sized inputs, as well as the cases where the graph may change depending on the results of previous steps.
Compared to static declaration, a dynamic computation graph could be expensive in terms of creating and optimising computation graphs.
Marian uses careful memory management to remove overhead in computation graph construction, and supports efficient execution on both CPU and GPU.
All the numerical computations are expressed as a computation graph.
A computation graph (or graph in short) is a series of operations arranged into a graph of nodes.
To put it simply, a graph is just an arrangement of nodes that represent what you want to do with the data.
**Example 1**
Suppose you want to calculate the expression: `z=x*y+sin(x)`.
The computation graph of this expression is something like Figure 1.
![fig1](images/graph_example1.jpg "Figure 1 An example of computation graph")
*Figure 1 An example of computation graph*
In Marian, the `ExpressionGraph` class is the main implementation of a computation graph.
An `ExpressionGraph` object keeps a record of data (tensors) and all operations in a directed graph consisting of `Node` objects.
A `Node` is the basic unit of a graph. It can be an operation (e.g., dot()), or a tensor.
Each operation in a graph is a `NaryNodeOp` (a child of `Node` class).
Each operation defines its forward and backward steps.
Except for operations, a Node can also be a constant tensor (`ConstantNode`) or a parameter tensor (`ParamNode`).
To create a graph, we use `New<>` shortcut in place of regular constructors:
```cpp
// create a graph
auto graph = New<ExpressionGraph>();
```
After creating a graph, we also need to initialise the graph object with device options by `setDevice()` and workspace memory by `reserveWorkspaceMB()`, otherwise the program will result in a crash.
```cpp
// initialise graph with device options
// here we specify device no. is 0
// device type can be DeviceType::cpu or DeviceType::gpu
graph->setDevice({0, DeviceType::cpu});
// preallocate workspace memory (MB) for the graph
graph->reserveWorkspaceMB(128);
```
The _workspace memory_ means the size of the memory available for the forward and backward step of the training procedure.
This does not include model size and optimizer parameters that are allocated outsize workspace.
Hence you cannot allocate all device memory to the workspace.
To create a graph, Marian offers a set of shortcut functions that implements the common expression operators for a neural network (see [src/graph/expression_operators.h](file_src_graph_expression_operators.h), such as `affine()`.
These functions actually construct the corresponding operation nodes in the graph, make links with other nodes.
E.g., `affine()` construct a `AffineNodeOp` node in the graph.
Thus, building a graph turns into a simple task of defining expressions by using those functions.
**Building graph of Example 1 using Marian**
The following code is used to build the graph in Example 1 with inputs `x=2` and `y=3`.
```cpp
// create and initialise a graph object
auto graph = New<ExpressionGraph>();
graph->setDevice({0, DeviceType::cpu});
graph->reserveWorkspaceMB(8);
// add input node x
auto x = graph->constant({1,1}, inits::fromValue(2));
// add input node y
auto y = graph->constant({1,1}, inits::fromValue(3));
// define expression
auto mulOp = x*y;
auto sinOp = sin(x);
auto z = mulOp + sinOp;
// You can also define this expression: auto z = x*y + sin(x);
```
For the above example, `constant()` is used to construct a constant node (a tensor) in the graph as the input.
We will give more details about this function in the next section [**Node types**](#node-types).
The operators `*`, `+` and function `sin()` add corresponding operation nodes (i.e., `MultNodeOp` and `SinNodeOp`) in the graph.
To check the graph, Marian offers `graphviz()` function to generate graph layout in Graphviz format for visualisation.
This visualisation might not be practical for real-size graphs due to an enormous number of nodes and layers.
You can print the graph layout on console by running the following code:
```cpp
// print the graph layout on console
std::cout<<graph->graphviz()<<std::endl;
```
**Graph visualisation of Example 1**
The resulting graph is shown in Figure 2. Here we use an online Graphviz editor [edotor](https://edotor.net/) to generate the graph (by pasting the output of `graphviz()`).
![fig2](images/example1_dot.png "Figure 2 Graph layout of Example 1")
*Figure 2 Graph layout of Example 1*
In Figure 2, there are two numbers (between the pair of parentheses) in each node.
The first number indicates the node ID, and the second number specifies whether the node is trainable (0 means no; 1 means yes).
We will cover the concept of *trainable* in [**ParamNode section**](#paramnode).
One thing to notice here is that Marian adopts dynamic computation graphs;
this means that the nodes will be consumed once performing forward or backwards pass.
Thus, we need to call `graphviz()` function before performing the computation.
## Node types
As mentioned earlier, `Node` is the basic unit of a graph.
Each `Node` defines its forward steps in `Node::forward()` and backward steps in `Node::backward()`.
To access the resulting new tensor in the forward pass, we can call `Node::val()`.
While `Node::grad()` returns the accumulated gradients (a tensor) in the backward pass.
There are three main classes of Node in Marian: `ConstantNode`, `ParamNode` and `NaryNodeOp`.
### ConstantNode
The `ConstantNode` class is used to construct a constant node in the graph.
A constant node is actually a constant tensor whose value is immutable during the training.
A `ConstantNode` instance is usually used to construct the input layer.
To construct a constant node in the graph, we can use `constant()` function in the `ExpressionGraph` class.
We need to specify the shape and element type for the constant node.
For the shape, we can initialise a `Shape` instance in the way of vector initialisation.
E.g., `Shape shape={2,3};` this means 2D matrix with `dim[0]`=2 and `dim[1]`=3.
The element type must be one of the values stored in `Type` enumeration.
`Type` stores all supported data type in Marian, e.g., `Type::float16`.
If the type is not specified, the default type of graph will be used.
The default type of the graph is usually `Type::float32` unless you change it by `setDefaultElementType()`.
```cpp
// construct a constant node in the graph with default type
auto x = graph->constant({N, NUM_FEATURES}, inits::fromVector(inputData));
```
For the above example, the shape of the constant node is `{N, NUM_FEATURES}`, and the value of the constant node is initialised from a vector `inputData`.
`inits::fromVector()` returns a `NodeInitializer` which is a functor used to initialise a tensor by copying from the given vector.
For more shortcut functions used to add operations in the graph, you can find in [src/graph/expression_operators.h](file_src_graph_expression_operators.h) file.
Once you finish building a graph by adding all the nodes, now you can perform the real computation.
### Forward pass
The forward pass refers to the calculation process.
It traverses through all nodes from the input layer (leaves) to the output layer (root).
To perform the forward pass, you can call the function `forward()`. The `forward()` function mainly does two things:
- allocates memory for each node (`Node::allocate()`)
- computing the new tensor for each node by performing required operations (`Node::forward()`), and the resulting new tensor is stored in `val_` attribute in each Node.
**Forward pass of Example 1**
To run the forward pass of Example 1, you can run the following code:
```cpp
// Perform the forward pass on the nodes of the graph
graph->forward();
// get the computation result of z
std::vector<float> w;
z->val()->get(w);
std::cout<<"z="<<w[0]<<std::endl;
// The output is: z=6.9093
```
### **Backward pass**
The backward pass refers to the process of computing the output error.
It traverses through all *trainable* nodes from the output layer to the input layer.
You can call `backward()` to perform the backward pass.
The `backward()` function mainly computes the gradients using the chain rule:
- allocates memory and initialise gradients for each *trainable* Node
- computes the gradients based on backward steps (`Node::backwardOps()`) from each Node, and stores them in `adj_` attribute in each Node
- using the chain rule, propagates all the way to the input layer
We also provide a shortcut function `backprop()` which performs first the forward pass and then the backward pass on the nodes of the graph:
```cpp
// Perform backpropagation on the graph
graph->backprop();
// This function is equal to the following code:
/*
graph->forward();
graph->backward();
*/
```
**Backward pass of modified Example 1**
As shown in Figure 2, there is no trainable node in the graph of Example 1;
this means we cannot perform backwards pass on this graph.
To demonstrate the backward pass, we modify Example 1 by changing the constant node `x` to a parameter node (change `constant()` to `param()`).
Here is the modification:
```cpp
// add parameter node x
auto x = graph->param("x", {1,1}, inits::fromValue(2));
```
The resulting graph is also different as displayed in Figure 3.
![fig3](images/example1_dot2.png "Figure 3 Graph layout of modified Example 1")
*Figure 3 Graph layout of modified Example 1*
To perform the backward pass of modified Example 1, you can run the following code:
```cpp
// Perform the backward pass on the trainable nodes of the graph
graph->backward();
// get the gradient of x node
std::vector<float> b;
x->grad()->get(b);
std::cout<<"dz/dx="<<b[0]<<std::endl;
// The output is: dz/dx=2.58385
```
### Optimiser
After the backward pass, we obtain the gradients of the leaves.
However, the job is not done yet.
To train a model, we need to update the model parameters according to the gradients.
This comes to how we define the loss function and optimiser for the graph.
A loss function is used to calculate the model error between the predicted value and the actual value.
The goal is to minimise this error during training.
In a graph, the loss function is also represented as a group of node(s).