mirror of
https://github.com/marian-nmt/marian.git
synced 2024-12-11 09:54:22 +03:00
534 lines
14 KiB
Plaintext
534 lines
14 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"# Encoder-Decoder implementation based on DL4MT"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"import numpy as np\n",
|
|
"model = np.load(\"model_hal.npz\")\n",
|
|
"\n",
|
|
"for matrix in model:\n",
|
|
" print(matrix, model[matrix].shape)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"# Encoder"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 17,
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Wemb: (30000, 512)\n",
|
|
"encoder_U (1024, 2048)\n",
|
|
"encoder_W (512, 2048)\n",
|
|
"encoder_r_Wx (512, 1024)\n",
|
|
"encoder_bx (1024,)\n",
|
|
"encoder_b (2048,)\n",
|
|
"encoder_r_bx (1024,)\n",
|
|
"encoder_r_U (1024, 2048)\n",
|
|
"encoder_r_b (2048,)\n",
|
|
"encoder_r_W (512, 2048)\n",
|
|
"encoder_Ux (1024, 1024)\n",
|
|
"encoder_Wx (512, 1024)\n",
|
|
"encoder_r_Ux (1024, 1024)\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"print ('Wemb:', model['Wemb'].shape)\n",
|
|
"for matrix in model:\n",
|
|
" if matrix.startswith(\"encoder\"):\n",
|
|
" print(matrix, model[matrix].shape)\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"## Common\n",
|
|
"\n",
|
|
"* $\\overline{E}$ - `Wemb` - source word embeddings, common for both directions, size ${K_x \\times m}$, where $K_x = 30000$ i $m = 512$\n",
|
|
"* $m$: embedding size (e.g. 512)\n",
|
|
"* $n$: internal state size (e.g. 1024)\n",
|
|
"\n",
|
|
"## Forward pass\n",
|
|
"\n",
|
|
"* $\\overrightarrow{W}_x$ - `encoder_Wx` $m\\times n$\n",
|
|
"* $\\overrightarrow{U}_x$ - `encoder_Ux`, size $n \\times n$\n",
|
|
"* $\\overrightarrow{b}_x$ - `encoder_bx`, size $n$\n",
|
|
"* $\\overrightarrow{W}$ - `encoder_W`, size $m \\times 2n$\n",
|
|
"* $\\overrightarrow{U}$ - `encoder_U`, size $n\\times 2n$\n",
|
|
"* $\\overrightarrow{b}$ - `encoder_b`, size $2n$\n",
|
|
"\n",
|
|
"## Backward pass\n",
|
|
"Analogously, with `_r_` as an interfix."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"## Computation\n",
|
|
"\n",
|
|
"Differences in comparing with the model from Bahdanau:\n",
|
|
"* different place for bias.\n",
|
|
"* The reset $r_i$ and update $u_i$ vectors are computed together (they are concatenated).\n",
|
|
"\n",
|
|
"$$\n",
|
|
"\\renewcommand{\\ora}[1]{\\overrightarrow{#1}}\n",
|
|
"\\renewcommand{\\ola}[1]{\\overleftarrow{#1}}\n",
|
|
"\\ora{h}_i = \\left\\{\n",
|
|
"\\begin{array}{ll}\n",
|
|
" \\ora{u}_i \\circ \\ora{h}_{i-1} + (1- \\ora{u}_i) \\circ \\ora{\\underline{h}}_i & \\mathrm{, if ~ } i > 0 \\\\\n",
|
|
"0 & \\mathrm{, if ~} i = 0 \n",
|
|
"\\end{array}\n",
|
|
"\\right.\n",
|
|
"$$\n",
|
|
"\n",
|
|
"where \n",
|
|
"\n",
|
|
"$$\n",
|
|
"\\begin{eqnarray}\n",
|
|
"\\left[\n",
|
|
"\\begin{array}{c}\n",
|
|
"\\ora{r}_i \\\\\n",
|
|
"\\ora{u}_i\\end{array}\n",
|
|
"\\right] &=& \\sigma\\left(\\ora{h}_{i-1}\\ora{U} + \\overline{E}_i\\ora{W} + \\ora{b} \\right)\\\\\n",
|
|
"\\ora{\\underline{h}}_i &=& \\tanh\\left((\\ora{h}_{i-1}\\ora{U_x}) \\circ \\ora{r}_i+ \\overline{E}_i\\ora{W_x} + \\ora{b_x} \\right)\\\\\n",
|
|
"\\end{eqnarray}\n",
|
|
"$$\n",
|
|
"\n",
|
|
"The backward pass is similar. The pass over the words is reversed, but the implementation stays the same.\n",
|
|
"\n",
|
|
"For every word, the $\\ora{h}_i$ and $\\ola{h}_i$ are concatenated into $h_i$:\n",
|
|
"\n",
|
|
"$$\n",
|
|
"h_i = \\left[\n",
|
|
"\\begin{array}{c}\n",
|
|
"\\ora{h}_i \\\\\n",
|
|
"\\ola{h}_i\n",
|
|
"\\end{array}\n",
|
|
"\\right]\n",
|
|
"$$\n",
|
|
"\n",
|
|
"and the context matrix - $c$\n",
|
|
"$$\n",
|
|
"c = \\left[ h_1, \\ldots, h_n\\right]\n",
|
|
"$$"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"# Decoder "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 11,
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"decoder_U (1024, 2048)\n",
|
|
"decoder_W (512, 2048)\n",
|
|
"decoder_b (2048,)\n",
|
|
"decoder_Wc (2048, 2048)\n",
|
|
"decoder_b_att (2048,)\n",
|
|
"decoder_bx_nl (1024,)\n",
|
|
"decoder_Wcx (2048, 1024)\n",
|
|
"decoder_Ux (1024, 1024)\n",
|
|
"decoder_bx (1024,)\n",
|
|
"decoder_Wc_att (2048, 2048)\n",
|
|
"decoder_U_att (2048, 1)\n",
|
|
"decoder_c_tt (1,)\n",
|
|
"decoder_U_nl (1024, 2048)\n",
|
|
"decoder_W_comb_att (1024, 2048)\n",
|
|
"decoder_b_nl (2048,)\n",
|
|
"decoder_Wx (512, 1024)\n",
|
|
"decoder_Ux_nl (1024, 1024)\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"for matrix in model:\n",
|
|
" if matrix.startswith(\"decoder\"):\n",
|
|
" print(matrix, model[matrix].shape)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"# Decoder RNN"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"source": [
|
|
"* $E_t$ - `Wemb_dec` - target language embeddings, size: ${K_y \\times m}$, where $K_y = 30000$ i $m = 512$\n",
|
|
"* $m$: embedding size (e.g. 512)\n",
|
|
"* $n$: internal state size (e.g. 1024)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"## Initialising decoder RNN"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"source": [
|
|
"In the Bahdanau Model, the last $h_i$ was taken to compute an initial state for the decoder.\n",
|
|
"\n",
|
|
"This model takes the mean $h$ of all $h_i$.\n",
|
|
"\n",
|
|
"$$\n",
|
|
"\\qquad s_0 = \\tanh\\left(hW_I + b_I\\right)\n",
|
|
"$$\n",
|
|
"\n",
|
|
"* $W_I$ - `ff_state_W` - size ${2n \\times n}$\n",
|
|
"* $b_I$ - `ff_state_b` - size ${n}$"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"## Computing a new RNN decoder state"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"source": [
|
|
"Lets define $E_i$ as the embedding vector of a word $y_i$. In other words, $E_i = Ey_i$\n",
|
|
"\n",
|
|
"\n",
|
|
" * $E$ - `Wemb_dec` - target word embeddings, size: ${K_y \\times m}$, where $K_y = 30000$ and $m = 512$\n",
|
|
"\n",
|
|
"The computation of the next state is divided into two steps: computing a middle state, which goes to the attention model and computing the genue state based on the attention model output."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"## Computing the hidden state (First GRU layer?)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"source": [
|
|
"$$\n",
|
|
"\\begin{eqnarray}\n",
|
|
"\\left[\n",
|
|
"\\begin{array}{c}\n",
|
|
"\\ora{r}_i^h \\\\\n",
|
|
"\\ora{u}_i^h\\end{array}\n",
|
|
"\\right] &=& \\sigma \\left( s_{i-1}U + E_{i-1}W + b \\right) \\\\\n",
|
|
"\\\\\n",
|
|
"\\overline{s}_i &=& \\tanh \\left( (s_{i-1}U_x) \\circ r_i^h + E_{i-1}W_x + b_x \\right) \\\\\n",
|
|
"\\\\\n",
|
|
"s_i &=& u_i^h \\circ s_{i-1} + (1- u_i^h) \\circ \\overline{s}_i \\\\\n",
|
|
"\\end{eqnarray}\n",
|
|
"$$\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"source": [
|
|
"**Computing reset and update vectors:**\n",
|
|
" * $ s_{i-1}$ - the previous decoder state, size: $n$\n",
|
|
" * $ E_{i-1}$ the embedding of the word $y_{i-1}$,\n",
|
|
" * $U$ - `decoder_U` - a matrix for a state, size: ${n \\times 2n}$\n",
|
|
" * $W$ - `decoder_W` - a matrix for a word embedding, size: ${m \\times 2n}$\n",
|
|
" * $b$ - `decoder_b` - a bias vector, size: ${2n}$\n",
|
|
"\n",
|
|
"**Computing a new hidden state:**\n",
|
|
" * $U_x$ - `decoder_Ux` - a matrix for the previous decoder state, size: ${n \\times n}$\n",
|
|
" * $W_x$ - `decoder_Wx` - a matrix for a word embedding, size: ${m \\times n}$\n",
|
|
" * $b_x$ - `decoder_bx` - a bias vector, size: ${n}$"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"## Attention model (or Alignment model)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"source": [
|
|
"$$\n",
|
|
"c_i = \\sum_{j=1}^{T_x} \\alpha_{ij}h_j\n",
|
|
"$$\n",
|
|
"\n",
|
|
"(or if $h$ is a state matrix for an entire batch $c = Ah$ where $A = \\left[a_{ij}\\right]$)\n",
|
|
"\n",
|
|
"where\n",
|
|
"\n",
|
|
"$$ \n",
|
|
"\\begin{eqnarray}\n",
|
|
"\\alpha_{ij} &=& \\frac{\\exp(e_{ij})}{\\sum_{k=1}^{T_x}\\exp(e_{ik})}\n",
|
|
"\\end{eqnarray}\n",
|
|
"$$\n",
|
|
"\n",
|
|
"$$\n",
|
|
"\\begin{eqnarray}\n",
|
|
"e_{ij} &=& v_\\alpha^T \\tanh\\left(s_{i} W_{\\alpha} + b_{\\alpha} + h_jU_{\\alpha}\\right) + c_{\\alpha}\n",
|
|
"\\end{eqnarray}\n",
|
|
"$$\n",
|
|
"\n",
|
|
"When doing batch computation, the sum in the last step involves rather complicated broadcasting to 3D tensors to get matching shapes.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"source": [
|
|
"* $W_{\\alpha}$ - `decoder_W_comb_att` - size: ${n \\times 2n}$,\n",
|
|
"* $b_{\\alpha}$ - `decoder_b_att` - size: $2n$,\n",
|
|
"* $U_{\\alpha}$ - `decoder_Wc_att` - size: ${2n \\times 2n}$,\n",
|
|
"* $v_{\\alpha}$ - `decoder_U_att` - size: ${2n}$,\n",
|
|
"* $c_{\\alpha}$ - `decoder_c_tt` - a scalar (normalisation constant (?)),"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"## Computing the final decoder state (Second GRU layer?)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"source": [
|
|
"Take care for the different bias in the computation of the intermediate state, $\\tilde{z}_i$. This is an oddity in the way Nematus implements GRUs.\n",
|
|
"\n",
|
|
"$$\n",
|
|
"\\begin{eqnarray}\n",
|
|
"\\left[\n",
|
|
"\\begin{array}{c}\n",
|
|
"\\ora{r}_i^f \\\\\n",
|
|
"\\ora{u}_i^f\\end{array}\n",
|
|
"\\right] &=& \\sigma \\left( s_iU + c_iW + b \\right) \\\\\n",
|
|
"\\\\\n",
|
|
"\\tilde{z}_i &=& \\tanh\\left( (s_iU_x + b_x) \\circ r_i^f + c_iW_x \\right) \\\\\n",
|
|
"\\\\\n",
|
|
"z_i &=& u_i^f \\circ s_i + (1 - u_i^f) \\circ \\tilde{z}_i\n",
|
|
"\\end{eqnarray}\n",
|
|
"$$"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"source": [
|
|
"**Computing the reset and update vectors:**\n",
|
|
" * $U$ - `decoder_U_nl` - matrix for the state, size: ${n \\times 2n}$\n",
|
|
" * $W$ - `decoder_Wc` - matrix for the context vector , size: ${m \\times 2n}$\n",
|
|
" * $b$ - `decoder_b_nl` - bias vector, size: ${ 2n}$\n",
|
|
"\n",
|
|
"**Computing the next state**\n",
|
|
" * $U_x$ - `decoder_Ux_nl` - matrix for a middle state, size: ${n \\times n}$\n",
|
|
" * $W_x$ - `decoder_Wcx` - matrix for the context vector, size: ${m \\times n}$\n",
|
|
" * $b_x$ - `decoder_bx_nl` - bias vector, size: ${n}$"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"## ReadOut or Deep Output or just getting probabilities over words"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"source": [
|
|
"$$\n",
|
|
"\\begin{eqnarray}\n",
|
|
" t_i &=&\\tanh \\left( \\left( z_iW_1 + b_1 \\right) + \\left( E_{i-1} W_2 + b_2 \\right) + \\left( c_iW_3 + b_3 \\right) \\right) \\\\\n",
|
|
"\\\\\n",
|
|
"p(y_i|z_{i},y_{i-1},c_i) &=& \\textrm{softmax} \\left( t_iW_4 + b_4 \\right) \\\\\n",
|
|
"\\end{eqnarray}\n",
|
|
"$$\n",
|
|
"\n",
|
|
"* $W_1$ - `ff_logit_lstm_W` - size: ${n} \\times {m}$ \n",
|
|
"* $b_1$ - `ff_logit_lstm_b` - size: ${m} $\n",
|
|
"* $W_2$ - `ff_logit_prev_W` - v ${m} \\times {m}$ \n",
|
|
"* $b_2$ - `ff_logit_prev_b` - size: ${m} $\n",
|
|
"* $W_3$ - `ff_logit_ctx_W` - size: ${2n} \\times {m}$ \n",
|
|
"* $b_3$ - `ff_logit_ctx_b` - size: ${m} $\n",
|
|
"* $W_4$ - `ff_logit_W` - size: ${m} \\times K_y$ \n",
|
|
"* $b_4$ - `ff_logit_b` - size: $K_y $"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [],
|
|
"source": []
|
|
}
|
|
],
|
|
"metadata": {
|
|
"celltoolbar": "Slideshow",
|
|
"kernelspec": {
|
|
"display_name": "Python 2",
|
|
"language": "python",
|
|
"name": "python2"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 2
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython2",
|
|
"version": "2.7.12"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 0
|
|
}
|