mirror of
https://github.com/marian-nmt/marian.git
synced 2024-09-17 09:47:34 +03:00
cd9afea8d3
* add initial guidelines of code documentation * fix math formula not displayed in Sphinx * remove @name tags which cannot be extracted by exhale and cause function signature errors * fix markdown ref warning and update markdown parser in sphinx * more about doxygen: add Doxygen commands and math formulas * move code doc guide to a new .rst file * add formula image * Set myst-parser version appropriate for the requested sphinx version * Update documentation on how to write Doxygen comments * Add new section to the documentation index * Sphinx 2.4.4 requires myst-parser 0.14 * complete code doc guide and small fixes on reStructuredText formats * More about reStructuredText * Update badges on the documentation frontpage Co-authored-by: Roman Grundkiewicz <rgrundkiewicz@gmail.com>
406 lines
19 KiB
Markdown
406 lines
19 KiB
Markdown
# Expression graphs
|
|
|
|
The design of the deep learning framework in Marian is based on reverse-mode [auto-differentiation](https://en.wikipedia.org/wiki/Automatic_differentiation) (also known as backpropagation) with dynamic computation graphs.
|
|
Computation graphs allow a great deal of freedom in network architectures, and they can deal with complicated structures like conditions and loops.
|
|
The dynamic declaration, which means a new graph is created for each training instance (for a training example or a batch), is also advantageous.
|
|
It allows handling of variably sized inputs, as well as the cases where the graph may change depending on the results of previous steps.
|
|
Compared to static declaration, a dynamic computation graph could be expensive in terms of creating and optimising computation graphs.
|
|
Marian uses careful memory management to remove overhead in computation graph construction, and supports efficient execution on both CPU and GPU.
|
|
The main implementation of computation graph is in under [src/graph](dir_src_graph) directory.
|
|
|
|
Building blocks for graphs:
|
|
|
|
- [graph construction](#graph-construction)
|
|
- [node types](#node-types)
|
|
- [graph execution](#graph-execution)
|
|
|
|
## Graph construction
|
|
|
|
What is a computation graph?
|
|
All the numerical computations are expressed as a computation graph.
|
|
A computation graph (or graph in short) is a series of operations arranged into a graph of nodes.
|
|
To put it simply, a graph is just an arrangement of nodes that represent what you want to do with the data.
|
|
|
|
**Example 1**
|
|
|
|
Suppose you want to calculate the expression: `z=x*y+sin(x)`.
|
|
|
|
The computation graph of this expression is something like Figure 1.
|
|
|
|
![fig1](images/graph_example1.jpg "Figure 1 An example of computation graph")
|
|
|
|
*Figure 1 An example of computation graph*
|
|
|
|
In Marian, the `ExpressionGraph` class is the main implementation of a computation graph.
|
|
An `ExpressionGraph` object keeps a record of data (tensors) and all operations in a directed graph consisting of `Node` objects.
|
|
A `Node` is the basic unit of a graph. It can be an operation (e.g., dot()), or a tensor.
|
|
Each operation in a graph is a `NaryNodeOp` (a child of `Node` class).
|
|
Each operation defines its forward and backward steps.
|
|
Except for operations, a Node can also be a constant tensor (`ConstantNode`) or a parameter tensor (`ParamNode`).
|
|
|
|
To create a graph, we use `New<>` shortcut in place of regular constructors:
|
|
|
|
```cpp
|
|
// create a graph
|
|
auto graph = New<ExpressionGraph>();
|
|
```
|
|
|
|
After creating a graph, we also need to initialise the graph object with device options by `setDevice()` and workspace memory by `reserveWorkspaceMB()`, otherwise the program will result in a crash.
|
|
|
|
```cpp
|
|
// initialise graph with device options
|
|
// here we specify device no. is 0
|
|
// device type can be DeviceType::cpu or DeviceType::gpu
|
|
graph->setDevice({0, DeviceType::cpu});
|
|
// preallocate workspace memory (MB) for the graph
|
|
graph->reserveWorkspaceMB(128);
|
|
```
|
|
The _workspace memory_ means the size of the memory available for the forward and backward step of the training procedure.
|
|
This does not include model size and optimizer parameters that are allocated outsize workspace.
|
|
Hence you cannot allocate all device memory to the workspace.
|
|
|
|
To create a graph, Marian offers a set of shortcut functions that implements the common expression operators for a neural network (see [src/graph/expression_operators.h](file_src_graph_expression_operators.h), such as `affine()`.
|
|
These functions actually construct the corresponding operation nodes in the graph, make links with other nodes.
|
|
E.g., `affine()` construct a `AffineNodeOp` node in the graph.
|
|
Thus, building a graph turns into a simple task of defining expressions by using those functions.
|
|
|
|
**Building graph of Example 1 using Marian**
|
|
|
|
The following code is used to build the graph in Example 1 with inputs `x=2` and `y=3`.
|
|
|
|
```cpp
|
|
// create and initialise a graph object
|
|
auto graph = New<ExpressionGraph>();
|
|
graph->setDevice({0, DeviceType::cpu});
|
|
graph->reserveWorkspaceMB(8);
|
|
// add input node x
|
|
auto x = graph->constant({1,1}, inits::fromValue(2));
|
|
// add input node y
|
|
auto y = graph->constant({1,1}, inits::fromValue(3));
|
|
// define expression
|
|
auto mulOp = x*y;
|
|
auto sinOp = sin(x);
|
|
auto z = mulOp + sinOp;
|
|
// You can also define this expression: auto z = x*y + sin(x);
|
|
```
|
|
|
|
For the above example, `constant()` is used to construct a constant node (a tensor) in the graph as the input.
|
|
We will give more details about this function in the next section [**Node types**](#node-types).
|
|
The operators `*`, `+` and function `sin()` add corresponding operation nodes (i.e., `MultNodeOp` and `SinNodeOp`) in the graph.
|
|
|
|
To check the graph, Marian offers `graphviz()` function to generate graph layout in Graphviz format for visualisation.
|
|
This visualisation might not be practical for real-size graphs due to an enormous number of nodes and layers.
|
|
You can print the graph layout on console by running the following code:
|
|
|
|
```cpp
|
|
// print the graph layout on console
|
|
std::cout<<graph->graphviz()<<std::endl;
|
|
```
|
|
|
|
**Graph visualisation of Example 1**
|
|
|
|
The resulting graph is shown in Figure 2. Here we use an online Graphviz editor [edotor](https://edotor.net/) to generate the graph (by pasting the output of `graphviz()`).
|
|
|
|
![fig2](images/example1_dot.png "Figure 2 Graph layout of Example 1")
|
|
|
|
*Figure 2 Graph layout of Example 1*
|
|
|
|
In Figure 2, there are two numbers (between the pair of parentheses) in each node.
|
|
The first number indicates the node ID, and the second number specifies whether the node is trainable (0 means no; 1 means yes).
|
|
We will cover the concept of *trainable* in [**ParamNode section**](#paramnode).
|
|
|
|
One thing to notice here is that Marian adopts dynamic computation graphs;
|
|
this means that the nodes will be consumed once performing forward or backwards pass.
|
|
Thus, we need to call `graphviz()` function before performing the computation.
|
|
|
|
## Node types
|
|
|
|
As mentioned earlier, `Node` is the basic unit of a graph.
|
|
Each `Node` defines its forward steps in `Node::forward()` and backward steps in `Node::backward()`.
|
|
To access the resulting new tensor in the forward pass, we can call `Node::val()`.
|
|
While `Node::grad()` returns the accumulated gradients (a tensor) in the backward pass.
|
|
There are three main classes of Node in Marian: `ConstantNode`, `ParamNode` and `NaryNodeOp`.
|
|
|
|
### ConstantNode
|
|
|
|
The `ConstantNode` class is used to construct a constant node in the graph.
|
|
A constant node is actually a constant tensor whose value is immutable during the training.
|
|
A `ConstantNode` instance is usually used to construct the input layer.
|
|
To construct a constant node in the graph, we can use `constant()` function in the `ExpressionGraph` class.
|
|
We need to specify the shape and element type for the constant node.
|
|
For the shape, we can initialise a `Shape` instance in the way of vector initialisation.
|
|
E.g., `Shape shape={2,3};` this means 2D matrix with `dim[0]`=2 and `dim[1]`=3.
|
|
The element type must be one of the values stored in `Type` enumeration.
|
|
`Type` stores all supported data type in Marian, e.g., `Type::float16`.
|
|
If the type is not specified, the default type of graph will be used.
|
|
The default type of the graph is usually `Type::float32` unless you change it by `setDefaultElementType()`.
|
|
|
|
```cpp
|
|
// construct a constant node in the graph with default type
|
|
auto x = graph->constant({N, NUM_FEATURES}, inits::fromVector(inputData));
|
|
```
|
|
|
|
For the above example, the shape of the constant node is `{N, NUM_FEATURES}`, and the value of the constant node is initialised from a vector `inputData`.
|
|
`inits::fromVector()` returns a `NodeInitializer` which is a functor used to initialise a tensor by copying from the given vector.
|
|
More functions used to initialise a node can be found in [src/graph/node_initializers.h](namespace_marian__inits) file.
|
|
Marian also provides some shortcut functions to construct special constant nodes, such as `ones()` and `zeros()`:
|
|
|
|
```cpp
|
|
// construct a constant node with 1
|
|
auto ones = graph()->ones({10,10});
|
|
// construct a constant node with 0
|
|
auto zeros = graph()->zeros({10,10});
|
|
```
|
|
|
|
### ParamNode
|
|
|
|
`ParamNode` is used to store model parameters whose value can be changed during the training, such as weights and biases.
|
|
In addition to the shape and the element type, we need to specify whether a `ParamNode` object is _trainable_ or not.
|
|
If a parameter node is _trainable_, then its value will be tracked and updated during the training procedure.
|
|
For a `ParamNode`, the default value of `trainable_` is `true`.
|
|
We can define whether this parameter node is trainable by `Node::setTrainable()` function.
|
|
To construct a parameter node in the graph, we use the `param()` function in the `ExpressionGraph` class.
|
|
For a parameter node, we need to specify its name.
|
|
|
|
```cpp
|
|
// construct a parameter node called W1 in the graph
|
|
auto W1 = graph->param("W1", {NUM_FEATURES, 5}, inits::uniform(-0.1f, 0.1f));
|
|
```
|
|
|
|
The parameter node `W1` has a shape of `{NUM_FEATURES, 5}`, and is initialised with random numbers from the uniform distribution `Uniform(-0.1, 0.1)`.
|
|
|
|
### NaryNodeOp
|
|
|
|
`NaryNodeOp` is the base class that defines the operations in a graph.
|
|
It mainly contains unary and binary operators.
|
|
Each `NaryNodeOp` defines its forward operations in `Node::forwardOps()` and backward operations in `Node::backwardOps()`.
|
|
In the current version of Marian, we provide a set of common operations (inherited from `NaryNodeOp`) used to build a neural network,
|
|
such as `AffineNodeOp` (affine transformation), `CrossEntropyNodeOp` (cross-entropy loss function) and `TanhNodeOp` (tanh activation function).
|
|
As mentioned earlier, Marian implements a set of APIs that can easily add operations to the graph.
|
|
E.g., we can use `affine()` to perform affine transformation and then `tanh()` to perform tanh activation function on the results:
|
|
|
|
```cpp
|
|
// perform affine transformation: x*W1+b
|
|
// and then perform tanh activation function
|
|
auto h = tanh(affine(x, W1, b1));
|
|
```
|
|
|
|
In the above example, `affine()` and `tanh()` actually add `AffineNodeOp` and `TanhNodeOp` nodes to the graph.
|
|
For more shortcut functions used to add operations in the graph, you can find in [src/graph/expression_operators.h](file_src_graph_expression_operators.h) file.
|
|
|
|
## Graph execution
|
|
|
|
Once you finish building a graph by adding all the nodes, now you can perform the real computation.
|
|
|
|
### Forward pass
|
|
|
|
The forward pass refers to the calculation process.
|
|
It traverses through all nodes from the input layer (leaves) to the output layer (root).
|
|
To perform the forward pass, you can call the function `forward()`. The `forward()` function mainly does two things:
|
|
|
|
- allocates memory for each node (`Node::allocate()`)
|
|
- computing the new tensor for each node by performing required operations (`Node::forward()`), and the resulting new tensor is stored in `val_` attribute in each Node.
|
|
|
|
**Forward pass of Example 1**
|
|
|
|
To run the forward pass of Example 1, you can run the following code:
|
|
|
|
```cpp
|
|
// Perform the forward pass on the nodes of the graph
|
|
graph->forward();
|
|
// get the computation result of z
|
|
std::vector<float> w;
|
|
z->val()->get(w);
|
|
std::cout<<"z="<<w[0]<<std::endl;
|
|
// The output is: z=6.9093
|
|
```
|
|
|
|
### **Backward pass**
|
|
|
|
The backward pass refers to the process of computing the output error.
|
|
It traverses through all *trainable* nodes from the output layer to the input layer.
|
|
You can call `backward()` to perform the backward pass.
|
|
The `backward()` function mainly computes the gradients using the chain rule:
|
|
|
|
- allocates memory and initialise gradients for each *trainable* Node
|
|
- computes the gradients based on backward steps (`Node::backwardOps()`) from each Node, and stores them in `adj_` attribute in each Node
|
|
- using the chain rule, propagates all the way to the input layer
|
|
|
|
We also provide a shortcut function `backprop()` which performs first the forward pass and then the backward pass on the nodes of the graph:
|
|
|
|
```cpp
|
|
// Perform backpropagation on the graph
|
|
graph->backprop();
|
|
// This function is equal to the following code:
|
|
/*
|
|
graph->forward();
|
|
graph->backward();
|
|
*/
|
|
```
|
|
|
|
**Backward pass of modified Example 1**
|
|
|
|
As shown in Figure 2, there is no trainable node in the graph of Example 1;
|
|
this means we cannot perform backwards pass on this graph.
|
|
To demonstrate the backward pass, we modify Example 1 by changing the constant node `x` to a parameter node (change `constant()` to `param()`).
|
|
Here is the modification:
|
|
|
|
```cpp
|
|
// add parameter node x
|
|
auto x = graph->param("x", {1,1}, inits::fromValue(2));
|
|
```
|
|
|
|
The resulting graph is also different as displayed in Figure 3.
|
|
|
|
|
|
![fig3](images/example1_dot2.png "Figure 3 Graph layout of modified Example 1")
|
|
|
|
*Figure 3 Graph layout of modified Example 1*
|
|
|
|
To perform the backward pass of modified Example 1, you can run the following code:
|
|
|
|
```cpp
|
|
// Perform the backward pass on the trainable nodes of the graph
|
|
graph->backward();
|
|
// get the gradient of x node
|
|
std::vector<float> b;
|
|
x->grad()->get(b);
|
|
std::cout<<"dz/dx="<<b[0]<<std::endl;
|
|
// The output is: dz/dx=2.58385
|
|
```
|
|
|
|
### Optimiser
|
|
|
|
After the backward pass, we obtain the gradients of the leaves.
|
|
However, the job is not done yet.
|
|
To train a model, we need to update the model parameters according to the gradients.
|
|
This comes to how we define the loss function and optimiser for the graph.
|
|
|
|
A loss function is used to calculate the model error between the predicted value and the actual value.
|
|
The goal is to minimise this error during training.
|
|
In a graph, the loss function is also represented as a group of node(s).
|
|
You can also use the operators provided in [src/graph/expression_operators.h](file_src_graph_expression_operators.h) file to define the loss function.
|
|
E.g., Marian offers `cross_entropy()` function to compute the cross-entropy loss between true labels and predicted labels.
|
|
|
|
**Define a loss function for modified Example 1**
|
|
|
|
Suppose we know the actual value of `z` is 6 with `y = 3`, and `x` is the parameter we would like to learn from the model.
|
|
The loss function we choose here is the absolute error:
|
|
|
|
```cpp
|
|
// pass the actual value to the model
|
|
auto actual = graph->constant({1,1}, inits::fromValue(6));
|
|
// define loss function
|
|
auto loss = abs(actual-z);
|
|
```
|
|
|
|
The graph is changed to Figure 4.
|
|
|
|
![fig4](images/example1_dot3.png "Figure 4 Graph layout of modified Example 1 with loss function")
|
|
|
|
*Figure 4 Graph layout of modified Example 1 with loss function*
|
|
|
|
The purpose of the optimiser is to adjust the variables to fit the data.
|
|
In Marian, there are three built-in optimiser classes: `Sgd`, `Adagrad` and `Adam`.
|
|
`Sgd` is an optimiser based on [stochastic gradient descent](https://en.wikipedia.org/wiki/Stochastic_gradient_descent).
|
|
For each iteration, it updates the parameter `w` according to the rule of `w = w - learning_rate * gradient`.
|
|
`Adagrad` implements [Adagrad algorithm](https://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf),
|
|
an optimiser with parameter-specific learning rates, which are adapted relative to how frequently a parameter gets updated during training.
|
|
`Adam` is an implementation of the [Adam algorithm](https://arxiv.org/abs/1412.6980),
|
|
a stochastic gradient descent method that is based on an adaptive estimation of first-order and second-order moments. .
|
|
We use `Optimizer<>` to set up an optimiser with the learning rate:
|
|
|
|
```cpp
|
|
// Choose optimizer (Sgd, Adagrad, Adam) and initial learning rate
|
|
auto opt = Optimizer<Adam>(0.01);
|
|
```
|
|
|
|
After an iteration of backpropagation, we can call `update()` function to update the parameters:
|
|
|
|
```cpp
|
|
// update parameters in the graph
|
|
opt->update(graph);
|
|
```
|
|
|
|
**Set up an optimiser for modified Example 1**
|
|
|
|
Continue with Example 1, we choose `Sgd` as the optimiser and update the parameter `x`:
|
|
|
|
```cpp
|
|
// set up Sgd optimiser with 0.005 learning rate
|
|
auto opt = Optimizer<Sgd>(0.005);
|
|
// update parameters
|
|
opt->update(graph);
|
|
// get the new value of x
|
|
std::vector<float> v;
|
|
x->val()->get(v);
|
|
std::cout<<"x="<<v[0]<<std::endl;
|
|
// The output is: x=1.98708
|
|
```
|
|
### Debugging
|
|
For debugging, we can call `debug()` to print node parameters. The `debug()` function has to be called prior to graph execution.
|
|
Once a node is marked for debugging, its value (resulting tensor) and the gradient will be printed out during the forward and backward pass.
|
|
It is also recommended to turn on Marian logger by calling `createLoggers()` for more information.
|
|
|
|
**Debugging for modified Example 1**
|
|
|
|
Suppose we want to check the results of node `x` during the computation. We can call `debug()` to mark node `x` for debugging.
|
|
```cpp
|
|
// mark node x for debugging with logging message "Parameter x"
|
|
debug(x, "Parameter x");
|
|
```
|
|
The output is shown as follows with `createLoggers()`:
|
|
```cpp
|
|
[2021-02-16 15:10:51] [memory] Reserving 256 B, device gpu0
|
|
[2021-02-16 15:10:51] Debug: Parameter x op=param
|
|
[2021-02-16 15:10:51] shape=1x1 size=1 type=float32 device=gpu0 ptr=140505547538432 bytes=256
|
|
min: 2.00000000 max: 2.00000000 l2-norm: 2.00000000
|
|
[[ 2.00000000 ]]
|
|
|
|
[2021-02-16 15:10:51] [memory] Reserving 256 B, device gpu0
|
|
[2021-02-16 15:10:51] Debug Grad: Parameter x op=param
|
|
[2021-02-16 15:10:51] shape=1x1 size=1 type=float32 device=gpu0 ptr=140505547538944 bytes=256
|
|
min: 2.58385324 max: 2.58385324 l2-norm: 2.58385324
|
|
[[ 2.58385324 ]]
|
|
```
|
|
|
|
### More advanced
|
|
|
|
For more details about graph execution, a graph keeps track of all the `Node` objects in its `nodesForward_` and `nodesBackward_` lists.
|
|
`nodesForward_` contains all nodes used for the forward pass and `nodesBackward_` contains all trainable nodes used for the backward pass.
|
|
All the tensor objects for a graph are stored in its `tensors_` attribute.
|
|
`tensors_` is a shared pointer holding memory and nodes for a graph.
|
|
Since each `Node` can result in new tensors, this attribute is used to allocate memory for new tensors during the forward and backward pass.
|
|
This `tensors_` attribute gets cleared before a new graph is built.
|
|
Another important attribute in `ExpressionGraph` is `paramsByElementType_`.
|
|
This attribute holds memory and nodes that correspond to graph parameters.
|
|
You can call `params()` function in a graph to get all the parameter objects:
|
|
|
|
```cpp
|
|
// return the Parameters object related to the graph
|
|
// The Parameters object holds the whole set of the parameter nodes.
|
|
graph->params();
|
|
```
|
|
|
|
Besides, we provide APIs to support the mechanism of Gradient Checkpointing.
|
|
This method works by trading compute for memory, which reruns a forward-pass segment for each checkpoint segment during the backward pass.
|
|
Currently, Marian only supports setting checkpoint nodes manually by calling `Node::markCheckpoint()` or `checkpoint()`.
|
|
To enable the gradient-checkpointing mode for a graph, we use `setCheckpointing()`:
|
|
|
|
```cpp
|
|
// enable gradient-checkpointing for a graph
|
|
graph->setCheckpointing(true);
|
|
```
|
|
|
|
We can also save and load the parameters of a graph in Marian.
|
|
We can call `save()` to save all parameters in the graph into a file (`.npz` or `.bin` format).
|
|
The function `load()` can load all model parameters to the graph (either from an array of `io::Items`, a file or a buffer).
|
|
|
|
```cpp
|
|
// specify the filename
|
|
std::string filename = "my_model.npz";
|
|
// save all the parameters into a file
|
|
graph->save(filename);
|
|
// load model from a file
|
|
graph->load(filename);
|
|
``` |