mosesdecoder/contrib/sigtest-filter/README.txt

Re-implementation of Johnson et al. (2007)'s phrasetable filtering strategy.

This implementation relies on Joy Zhang's SALM Suffix Array toolkit. It is
available here:

  http://projectile.sv.cmu.edu/research/public/tools/salm/salm.htm
  
--Chris Dyer <redpony@umd.edu>

BUILD INSTRUCTIONS
---------------------------------

1. Download and build SALM.

2. make SALMDIR=/path/to/SALM


USAGE INSTRUCTIONS
---------------------------------

1. Using the SALM/Bin/Linux/Index/IndexSA.O32, create a suffix array index
   of the source and target sides of your training bitext.

2. cat phrase-table.txt | ./filter-pt -e TARG.suffix -f SOURCE.suffix \
    -l <FILTER-VALUE>

   FILTER-VALUE is the -log prob threshold described in Johnson et al.
     (2007)'s paper.  It may be either 'a+e', 'a-e', or a positive real
     value. 'a+e' is a good setting- it filters out <1,1,1> phrase pairs.
     I also recommend using -n 30, which filteres out all but the top
     30 phrase pairs, sorted by P(e|f).  This was used in the paper.

3. Run with no options to see more use-cases.


REFERENCES
---------------------------------

H. Johnson, J. Martin, G. Foster and R. Kuhn. (2007) Improving Translation
  Quality by Discarding Most of the Phrasetable. In Proceedings of the 2007
  Joint Conference on Empirical Methods in Natural Language Processing and
  Computational Natural Language Learning (EMNLP-CoNLL), pp. 967-975.
Implementation of idea in "Improving Translation Quality by Discarding Most of the Phrasetable". Johnson et al. 2007. EMNLP. Requires Joy Zhang's SALM toolkit. git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@1438 1f5c12ca-751b-0410-a591-d2e778427230 2007-07-27 00:26:32 +04:00			`Re-implementation of Johnson et al. (2007)'s phrasetable filtering strategy.`

			`This implementation relies on Joy Zhang's SALM Suffix Array toolkit. It is`
			`available here:`

fixes for sigtest filter git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@2217 1f5c12ca-751b-0410-a591-d2e778427230 2009-02-25 18:48:29 +03:00			`http://projectile.sv.cmu.edu/research/public/tools/salm/salm.htm`

Implementation of idea in "Improving Translation Quality by Discarding Most of the Phrasetable". Johnson et al. 2007. EMNLP. Requires Joy Zhang's SALM toolkit. git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@1438 1f5c12ca-751b-0410-a591-d2e778427230 2007-07-27 00:26:32 +04:00			`--Chris Dyer <redpony@umd.edu>`

			`BUILD INSTRUCTIONS`
			`---------------------------------`

			`1. Download and build SALM.`

			`2. make SALMDIR=/path/to/SALM`


			`USAGE INSTRUCTIONS`
			`---------------------------------`

			`1. Using the SALM/Bin/Linux/Index/IndexSA.O32, create a suffix array index`
			`of the source and target sides of your training bitext.`

			`2. cat phrase-table.txt \| ./filter-pt -e TARG.suffix -f SOURCE.suffix \`
			`-l <FILTER-VALUE>`

			`FILTER-VALUE is the -log prob threshold described in Johnson et al.`
			`(2007)'s paper. It may be either 'a+e', 'a-e', or a positive real`
fixes for sigtest filter git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@2217 1f5c12ca-751b-0410-a591-d2e778427230 2009-02-25 18:48:29 +03:00			`value. 'a+e' is a good setting- it filters out <1,1,1> phrase pairs.`
more recommendations git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@2219 1f5c12ca-751b-0410-a591-d2e778427230 2009-02-25 20:28:38 +03:00			`I also recommend using -n 30, which filteres out all but the top`
			`30 phrase pairs, sorted by P(e\|f). This was used in the paper.`
Implementation of idea in "Improving Translation Quality by Discarding Most of the Phrasetable". Johnson et al. 2007. EMNLP. Requires Joy Zhang's SALM toolkit. git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@1438 1f5c12ca-751b-0410-a591-d2e778427230 2007-07-27 00:26:32 +04:00
			`3. Run with no options to see more use-cases.`


			`REFERENCES`
			`---------------------------------`

			`H. Johnson, J. Martin, G. Foster and R. Kuhn. (2007) Improving Translation`
			`Quality by Discarding Most of the Phrasetable. In Proceedings of the 2007`
			`Joint Conference on Empirical Methods in Natural Language Processing and`
			`Computational Natural Language Learning (EMNLP-CoNLL), pp. 967-975.`