marian/notebooks/dl4mt.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Encoder-Decoder implementation based on DL4MT"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "model = np.load(\"model_hal.npz\")\n",
    "\n",
    "for matrix in model:\n",
    "    print(matrix, model[matrix].shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Encoder"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Wemb: (30000, 512)\n",
      "encoder_U (1024, 2048)\n",
      "encoder_W (512, 2048)\n",
      "encoder_r_Wx (512, 1024)\n",
      "encoder_bx (1024,)\n",
      "encoder_b (2048,)\n",
      "encoder_r_bx (1024,)\n",
      "encoder_r_U (1024, 2048)\n",
      "encoder_r_b (2048,)\n",
      "encoder_r_W (512, 2048)\n",
      "encoder_Ux (1024, 1024)\n",
      "encoder_Wx (512, 1024)\n",
      "encoder_r_Ux (1024, 1024)\n"
     ]
    }
   ],
   "source": [
    "print ('Wemb:', model['Wemb'].shape)\n",
    "for matrix in model:\n",
    "    if matrix.startswith(\"encoder\"):\n",
    "        print(matrix, model[matrix].shape)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Common\n",
    "\n",
    "* $\\overline{E}$ - `Wemb` - source word embeddings, common for both directions, size ${K_x \\times m}$, where $K_x = 30000$ i $m = 512$\n",
    "* $m$: embedding size (e.g. 512)\n",
    "* $n$: internal state size (e.g. 1024)\n",
    "\n",
    "## Forward pass\n",
    "\n",
    "* $\\overrightarrow{W}_x$ - `encoder_Wx` $m\\times n$\n",
    "* $\\overrightarrow{U}_x$ - `encoder_Ux`, size $n \\times n$\n",
    "* $\\overrightarrow{b}_x$ - `encoder_bx`, size $n$\n",
    "* $\\overrightarrow{W}$ - `encoder_W`, size $m \\times 2n$\n",
    "* $\\overrightarrow{U}$ - `encoder_U`, size $n\\times 2n$\n",
    "* $\\overrightarrow{b}$ - `encoder_b`, size $2n$\n",
    "\n",
    "## Backward pass\n",
    "Analogously, with `_r_` as an interfix."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Computation\n",
    "\n",
    "Differences in comparing with the model from  Bahdanau:\n",
    "* different place for bias.\n",
    "* The reset $r_i$ and update $u_i$ vectors are computed together (they are concatenated).\n",
    "\n",
    "$$\n",
    "\\renewcommand{\\ora}[1]{\\overrightarrow{#1}}\n",
    "\\renewcommand{\\ola}[1]{\\overleftarrow{#1}}\n",
    "\\ora{h}_i = \\left\\{\n",
    "\\begin{array}{ll}\n",
    "    \\ora{u}_i \\circ \\ora{h}_{i-1} + (1- \\ora{u}_i) \\circ \\ora{\\underline{h}}_i & \\mathrm{, if ~ } i > 0 \\\\\n",
    "0 & \\mathrm{, if  ~} i = 0 \n",
    "\\end{array}\n",
    "\\right.\n",
    "$$\n",
    "\n",
    "where \n",
    "\n",
    "$$\n",
    "\\begin{eqnarray}\n",
    "\\left[\n",
    "\\begin{array}{c}\n",
    "\\ora{r}_i \\\\\n",
    "\\ora{u}_i\\end{array}\n",
    "\\right] &=& \\sigma\\left(\\ora{h}_{i-1}\\ora{U} + \\overline{E}_i\\ora{W} + \\ora{b} \\right)\\\\\n",
    "\\ora{\\underline{h}}_i &=& \\tanh\\left((\\ora{h}_{i-1}\\ora{U_x})  \\circ  \\ora{r}_i+ \\overline{E}_i\\ora{W_x} + \\ora{b_x} \\right)\\\\\n",
    "\\end{eqnarray}\n",
    "$$\n",
    "\n",
    "The backward pass is similar. The pass over the words is reversed, but the implementation stays the same.\n",
    "\n",
    "For every word,  the $\\ora{h}_i$ and $\\ola{h}_i$ are concatenated into $h_i$:\n",
    "\n",
    "$$\n",
    "h_i = \\left[\n",
    "\\begin{array}{c}\n",
    "\\ora{h}_i \\\\\n",
    "\\ola{h}_i\n",
    "\\end{array}\n",
    "\\right]\n",
    "$$\n",
    "\n",
    "and the context matrix - $c$\n",
    "$$\n",
    "c = \\left[ h_1, \\ldots, h_n\\right]\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Decoder "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "decoder_U (1024, 2048)\n",
      "decoder_W (512, 2048)\n",
      "decoder_b (2048,)\n",
      "decoder_Wc (2048, 2048)\n",
      "decoder_b_att (2048,)\n",
      "decoder_bx_nl (1024,)\n",
      "decoder_Wcx (2048, 1024)\n",
      "decoder_Ux (1024, 1024)\n",
      "decoder_bx (1024,)\n",
      "decoder_Wc_att (2048, 2048)\n",
      "decoder_U_att (2048, 1)\n",
      "decoder_c_tt (1,)\n",
      "decoder_U_nl (1024, 2048)\n",
      "decoder_W_comb_att (1024, 2048)\n",
      "decoder_b_nl (2048,)\n",
      "decoder_Wx (512, 1024)\n",
      "decoder_Ux_nl (1024, 1024)\n"
     ]
    }
   ],
   "source": [
    "for matrix in model:\n",
    "    if matrix.startswith(\"decoder\"):\n",
    "        print(matrix, model[matrix].shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# Decoder RNN"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "* $E_t$ - `Wemb_dec` - target language embeddings, size: ${K_y \\times m}$, where $K_y = 30000$ i $m = 512$\n",
    "* $m$: embedding size (e.g. 512)\n",
    "* $n$: internal state size (e.g. 1024)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Initialising decoder RNN"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "In the Bahdanau Model, the last $h_i$ was taken to compute an initial state for the decoder.\n",
    "\n",
    "This model takes the mean $h$ of all $h_i$.\n",
    "\n",
    "$$\n",
    "\\qquad s_0 = \\tanh\\left(hW_I + b_I\\right)\n",
    "$$\n",
    "\n",
    "* $W_I$ - `ff_state_W` - size ${2n \\times n}$\n",
    "* $b_I$ - `ff_state_b` - size ${n}$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Computing a new RNN decoder state"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "Lets define $E_i$ as the embedding vector of a word $y_i$. In other words,  $E_i = Ey_i$\n",
    "\n",
    "\n",
    " * $E$ - `Wemb_dec` - target word embeddings, size: ${K_y \\times m}$, where $K_y = 30000$ and $m = 512$\n",
    "\n",
    "The computation of the next state is divided into two steps: computing a middle state, which goes to the attention model and computing the genue state based on the attention model output."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Computing the hidden state (First GRU layer?)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "$$\n",
    "\\begin{eqnarray}\n",
    "\\left[\n",
    "\\begin{array}{c}\n",
    "\\ora{r}_i^h \\\\\n",
    "\\ora{u}_i^h\\end{array}\n",
    "\\right] &=& \\sigma \\left( s_{i-1}U + E_{i-1}W + b \\right) \\\\\n",
    "\\\\\n",
    "\\overline{s}_i &=& \\tanh \\left( (s_{i-1}U_x) \\circ r_i^h +  E_{i-1}W_x + b_x \\right) \\\\\n",
    "\\\\\n",
    "s_i &=& u_i^h \\circ s_{i-1} + (1- u_i^h) \\circ \\overline{s}_i \\\\\n",
    "\\end{eqnarray}\n",
    "$$\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "**Computing reset and update vectors:**\n",
    " * $ s_{i-1}$ - the previous decoder state, size: $n$\n",
    " * $ E_{i-1}$ the embedding of the word $y_{i-1}$,\n",
    " * $U$ - `decoder_U` - a matrix for a state, size: ${n \\times 2n}$\n",
    " * $W$ - `decoder_W` - a matrix for a word embedding, size: ${m \\times 2n}$\n",
    " * $b$ - `decoder_b` - a bias vector, size: ${2n}$\n",
    "\n",
    "**Computing a new hidden state:**\n",
    " * $U_x$ - `decoder_Ux` - a matrix for the previous decoder state, size: ${n \\times n}$\n",
    " * $W_x$ - `decoder_Wx` - a matrix for a word embedding, size: ${m \\times n}$\n",
    " * $b_x$ - `decoder_bx` - a bias vector, size: ${n}$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Attention model (or Alignment model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "$$\n",
    "c_i = \\sum_{j=1}^{T_x} \\alpha_{ij}h_j\n",
    "$$\n",
    "\n",
    "(or if $h$ is a state matrix for an entire batch $c = Ah$ where $A = \\left[a_{ij}\\right]$)\n",
    "\n",
    "where\n",
    "\n",
    "$$ \n",
    "\\begin{eqnarray}\n",
    "\\alpha_{ij} &=& \\frac{\\exp(e_{ij})}{\\sum_{k=1}^{T_x}\\exp(e_{ik})}\n",
    "\\end{eqnarray}\n",
    "$$\n",
    "\n",
    "$$\n",
    "\\begin{eqnarray}\n",
    "e_{ij} &=& v_\\alpha^T \\tanh\\left(s_{i} W_{\\alpha} + b_{\\alpha} + h_jU_{\\alpha}\\right) + c_{\\alpha}\n",
    "\\end{eqnarray}\n",
    "$$\n",
    "\n",
    "When doing batch computation, the sum in the last step involves rather complicated broadcasting to 3D tensors to get matching shapes.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "* $W_{\\alpha}$ - `decoder_W_comb_att` - size: ${n \\times 2n}$,\n",
    "* $b_{\\alpha}$ - `decoder_b_att` - size: $2n$,\n",
    "* $U_{\\alpha}$ - `decoder_Wc_att` - size: ${2n \\times 2n}$,\n",
    "* $v_{\\alpha}$ - `decoder_U_att` - size: ${2n}$,\n",
    "* $c_{\\alpha}$ - `decoder_c_tt` - a scalar (normalisation constant (?)),"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Computing the final decoder state (Second GRU layer?)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "Take care for the different bias in the computation of the intermediate state, $\\tilde{z}_i$. This is an oddity in the way Nematus implements GRUs.\n",
    "\n",
    "$$\n",
    "\\begin{eqnarray}\n",
    "\\left[\n",
    "\\begin{array}{c}\n",
    "\\ora{r}_i^f \\\\\n",
    "\\ora{u}_i^f\\end{array}\n",
    "\\right] &=& \\sigma \\left( s_iU + c_iW + b  \\right) \\\\\n",
    "\\\\\n",
    "\\tilde{z}_i &=& \\tanh\\left( (s_iU_x + b_x) \\circ r_i^f + c_iW_x \\right) \\\\\n",
    "\\\\\n",
    "z_i &=& u_i^f \\circ s_i + (1 - u_i^f) \\circ \\tilde{z}_i\n",
    "\\end{eqnarray}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "**Computing the reset and update vectors:**\n",
    " * $U$ - `decoder_U_nl` - matrix for the state, size: ${n \\times 2n}$\n",
    " * $W$ - `decoder_Wc` - matrix for the context vector , size: ${m \\times 2n}$\n",
    " * $b$ - `decoder_b_nl` - bias vector, size: ${ 2n}$\n",
    "\n",
    "**Computing the next state**\n",
    " * $U_x$ - `decoder_Ux_nl` - matrix for a middle state, size: ${n \\times n}$\n",
    " * $W_x$ - `decoder_Wcx` - matrix for the context vector, size: ${m \\times n}$\n",
    " * $b_x$ - `decoder_bx_nl` - bias vector, size: ${n}$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## ReadOut or Deep Output or just getting probabilities over words"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "$$\n",
    "\\begin{eqnarray}\n",
    " t_i &=&\\tanh \\left( \\left( z_iW_1 + b_1 \\right)  + \\left( E_{i-1} W_2 + b_2 \\right) + \\left( c_iW_3 + b_3 \\right) \\right) \\\\\n",
    "\\\\\n",
    "p(y_i|z_{i},y_{i-1},c_i) &=& \\textrm{softmax} \\left(  t_iW_4 + b_4 \\right) \\\\\n",
    "\\end{eqnarray}\n",
    "$$\n",
    "\n",
    "* $W_1$ - `ff_logit_lstm_W` - size: ${n} \\times {m}$ \n",
    "* $b_1$ - `ff_logit_lstm_b` - size: ${m} $\n",
    "* $W_2$ - `ff_logit_prev_W` - v ${m} \\times {m}$ \n",
    "* $b_2$ - `ff_logit_prev_b` - size: ${m} $\n",
    "* $W_3$ - `ff_logit_ctx_W` - size: ${2n} \\times {m}$ \n",
    "* $b_3$ - `ff_logit_ctx_b` - size: ${m} $\n",
    "* $W_4$ - `ff_logit_W` - size: ${m} \\times K_y$ \n",
    "* $b_4$ - `ff_logit_b` - size: $K_y $"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}