merge with master

This commit is contained in:
Marcin Junczys-Dowmunt 2014-07-15 15:28:52 +02:00
commit b5391b52d2
208 changed files with 9746 additions and 7085 deletions

View File

@ -1,158 +1,3 @@
PRELIMINARIES
Please see the Moses website on how to compile and run Moses
http://www.statmt.org/moses/?n=Development.GetStarted
Moses is primarily targeted at gcc on UNIX.
Moses requires gcc, Boost >= 1.36, and zlib including the headers that some
distributions package separately (i.e. -dev or -devel packages). Source is
available at http://boost.org .
There are several optional dependencies:
GIZA++ from http://code.google.com/p/giza-pp/ is used to align words in the parallel corpus during training.
Moses server requires xmlrpc-c with abyss-server. Source is available from
http://xmlrpc-c.sourceforge.net/.
The scripts support building ARPA format language models with SRILM or IRSTLM.
To apply models inside the decoder, you can use SRILM, IRSTLM, or KenLM. The
ARPA format is exchangable so that e.g. you can build a model with SRILM and
run the decoder with IRSTLM or KenLM.
If you want to use SRILM, you will need to download its source and build it.
The SRILM can be downloaded from
http://www.speech.sri.com/projects/srilm/download.html .
On x86_64, the default machine type is broken. Edit sbin/machine-type, find
this code
else if (`uname -m` == x86_64) then
set MACHINE_TYPE = i686
and change it to
else if (`uname -m` == x86_64) then
set MACHINE_TYPE = i686-m64
You may have to chmod +w sbin/machine-type first.
If you want to use IRSTLM, you will need to download its source and build it.
The IRSTLM can be downloaded from either the SourceForge website
http://sourceforge.net/projects/irstlm
or the official IRSTLM website
http://hlt.fbk.eu/en/irstlm
KenLM is included with Moses.
--------------------------------------------------------------------------
ADVICE ON INSTALLING EXTERNAL LIBRARIES
Generally, for trouble installing external libraries, you should get support
directly from the library maker:
Boost: http://www.boost.org/doc/libs/release/more/getting_started/unix-variants.html
IRSTLM: https://list.fbk.eu/sympa/subscribe/user-irstlm
SRILM: http://www.speech.sri.com/projects/srilm/#srilm-user
However, here's some general advice on installing software (for bash users):
#Determine where you want to install packages
PREFIX=$HOME/usr
#If your system has lib64 directories, lib64 should be used AND NOT lib
if [ -d /lib64 ]; then
LIBDIR=$PREFIX/lib64
else
LIBDIR=$PREFIX/lib
fi
#If you're installing to a non-standard path, tell programs where to find things:
export PATH=$PREFIX/bin${PATH:+:$PATH}
export LD_LIBRARY_PATH=$LIBDIR${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}
export LIBRARY_PATH=$LIBDIR${LIBRARY_PATH:+:$LIBRARY_PATH}
export CPATH=$PREFIX/include${CPATH:+:$CPATH}
Add all the above code to your .bashrc or .bash_login as appropriate. Then
you're ready to install packages in non-standard paths:
#For autotools packages e.g. xmlrpc-c and zlib
./configure --prefix=$PREFIX --libdir=$LIBDIR [other options here]
#tcmalloc is a malloc implementation with threaded performance. To see how it
#improves Moses performance, read
# http://www.mail-archive.com/moses-support@mit.edu/msg07303.html
#It is part of gperftools which can be downloaded from from
# https://code.google.com/p/gperftools/downloads/list
#configure with this:
./configure --prefix=$PREFIX --libdir=$LIBDIR --enable-shared --enable-static --enable-minimal
#For bzip2:
wget http://www.bzip.org/1.0.6/bzip2-1.0.6.tar.gz
tar xzvf bzip2-1.0.6.tar.gz
cd bzip2-1.0.6/
#Compile and install libbz2.a (static library)
make
make install PREFIX=$PREFIX
mkdir -p $LIBDIR
#Note this may be the same file; you can ignore the error
mv $PREFIX/lib/libbz2.a $LIBDIR 2>/dev/null
#Compile and install libbz2.so (dynamic library)
make clean
make -f Makefile-libbz2_so
cp libbz2.so.* $LIBDIR
ln -sf libbz2.so.1.0 $LIBDIR/libbz2.so
#For Boost:
./bootstrap.sh
./b2 --prefix=$PWD --libdir=$PWD/lib64 --layout=tagged link=static,shared threading=multi,single install || echo FAILURE
This will put the header files and libraries files in the current directory, rather than the system directory.
For most Linux systems, you should replace
link=static,shared
with
link=static
so it will only create static libraries. The minimised headaches when linking with Moses.
To link Moses to your version of boost,
./bjam --with-boost=[boost/path]
Alternatively, you run
./b2 --prefix=/usr/ --libdir=/usr/lib
to install boost in the systems folder. However, this may override the built in boost and causes problems for your OS, therefore, it is not recommended.
--------------------------------------------------------------------------
BUILDING
Building consists of running
./bjam [options]
Common options are:
--with-srilm=/path/to/srilm to compile the decoder with SRILM support
--with-irstlm=/path/to/irstlm to compile the decoder with IRSTLM support
-jN where N is the number of CPUs
--with-macports=/path/to/macports use MacPorts on Mac OS X.
If you leave out /path/to/macports bjam will use the /opt/local as default.
You don't have to use --with-boost with-macports as it is implicitly set.
Also note that using --with-macports automatically triggers "using darwin".
Binaries will appear in dist/bin.
You can clean up data from previous builds using
./bjam --clean
For further documentation, run
./bjam --help
--------------------------------------------------------------------------
ALTERNATIVE WAYS TO BUILD ON UNIX AND OTHER PLATFORMS
Microsoft Windows
-----------------
Moses is primarily targeted at gcc on UNIX. Windows users should
install using Cygwin. Outdated instructions can be found here:
http://ssli.ee.washington.edu/people/amittai/Moses-on-Win7.pdf .
Binaries for all external libraries needed can be downloaded from
http://www.statmt.org/moses/?n=Moses.LibrariesUsed
Only the decoder is developed and tested under Windows. There are
difficulties using the training scripts under Windows, even with
Cygwin, but it can be done.

View File

@ -1,2 +1,12 @@
exe merge-sorted : util/merge-sorted.cc $(TOP)/moses/generic//generic $(TOP)//boost_iostreams ;
external-lib bzip2 ;
external-lib zlib ;
exe merge-sorted :
util/merge-sorted.cc
$(TOP)/moses/TranslationModel/UG/mm//mm
$(TOP)/moses/TranslationModel/UG/generic//generic
$(TOP)//boost_iostreams
$(TOP)//boost_program_options
;
install $(PREFIX)/bin : merge-sorted ;

2
contrib/m4m/Makefile Normal file
View File

@ -0,0 +1,2 @@
merge-sorted:
g++ -O3 -I ../.. util/merge-sorted.cc ../../moses/TranslationModel/UG/generic/file_io/ug_stream.cpp -o $@ -lboost_iostreams -lboost_program_options -lbz2 -lz

View File

@ -64,6 +64,6 @@ LMODELS :=
LMODEL_ENTRIES :=
endef
clear-locks:
@rm -rf `find -L -type d -name '*.lock'`
clear-locks: | $(shell find -L -type d -name '*.lock')
rm -rf $|

View File

@ -2,7 +2,7 @@
moses.threads ?= 4
moses.flags += -threads ${moses.threads}
moses.flags += -v 0 -t -text-type "test"
moses.flags += -v 0 -t -text-type "test" -fd '${FACTORSEP}'
%.multi-bleu: | %.cleaned
$(lock)
@ -35,19 +35,16 @@ moses.ini ?=
# $1: output base name
# $2: system to be evaluated
# $3: evaluation input
# $4: evaluation input type
# $5: evaluation reference
define bleu_eval
EVALUATIONS += $1
.INTERMEDIATE: $3 $5
$1: moses.ini := $2
$1: moses.input := $3
$1: moses.inputtype := $4
$1: bleu.ref := $5
$1: moses.inputtype := $(call guess-inputtype,$3)
$1: bleu.ref := $$(shell echo $(patsubst %.${L1},%.${L2},$3) | perl -pe 's?/cfn[^/]+/?/cased/?')
$1.moses-out: | $2 $3
$1.multi-bleu: | $5
$1.multi-bleu: | $(call reffiles,$3,$(dir $(patsubst %/,%,$(dir $3))))
$1: | $1.multi-bleu
endef
@ -63,7 +60,8 @@ $(foreach system,${SYSTEMS},\
$(foreach tuneset,${tune.sets},\
$(foreach evalset,${eval.sets},\
$(foreach run,$(shell seq ${tune.runs}),\
$(eval $(call bleu_eval,${system}/eval/$(notdir ${tuneset})/${run}/$(notdir ${evalset}),\
$(eval $(call bleu_eval,\
${system}/eval/$(notdir ${tuneset})/${run}/$(notdir ${evalset}),\
${system}/tuned/$(notdir ${tuneset})/${run}/moses.ini,\
${evalset}.${L1},${moses.inputtype.plaintext},${evalset}.${L2}))))))

View File

@ -18,6 +18,7 @@ TUNED_SYSTEMS :=
DTABLES :=
PTABLES :=
LMODELS :=
INPUT_FEATURES ?=
export MY_EXPERIMENT :=

View File

@ -3,7 +3,7 @@
# default parameters
kenlm.order ?= 5
kenlm.memory ?= 30%
kenlm.memory ?= 10%
kenlm.type ?= probing
kenlm.lazy ?= 1
kenlm.factor ?= 0

View File

@ -5,10 +5,10 @@
m4mdir := $(patsubst %modules/,%,\
$(dir $(word $(words $(MAKEFILE_LIST)),\
$(MAKEFILE_LIST))))
# $(info M4MDIR is ${m4mdir})
# m4m modules to be included
M4M_MODULES := aux init
#M4M_MODULES += directory-structure
M4M_MODULES += tools moses-parameters prepare-corpus
M4M_MODULES += mgiza fastalign mmbitext phrase-table moses-ini
M4M_MODULES += tune-moses eval-system kenlm

View File

@ -16,7 +16,6 @@ mkcls_args = -n10 -c50
%/mgiza.cfg: m4=3
%/mgiza.cfg: nodumps=0
%/mgiza.cfg: onlyaldumps=0
%/mgiza.cfg: model1dumpfrequency=0
%/mgiza.cfg: model4smoothfactor=0.4
%/mgiza.cfg: nsmooth=4
%/mgiza.cfg: NCPUS=8
@ -62,7 +61,7 @@ $(gizout)/${L1}-${L2}.symal.gz: | $(gizout)/${L1}-${L2}.A3.final.gz
$(gizout)/${L1}-${L2}.symal.gz: | $(gizout)/${L2}-${L1}.A3.final.gz $(giza2bal.pl)
$(lock)
$(giza2bal.pl) -d 'gunzip -c ${A3fwd}' -i 'gunzip -c ${A3bwd}' \
| $(symal) $(symal_args) | gzip > $@_ && mv $@_ $@
| $(symal) $(symal_args) | perl -pe 's/^.*{##}\s+//' | gzip > $@_ && mv $@_ $@
$(unlock)
# merge alignments produced by mgiza
@ -93,8 +92,8 @@ mkcls_cmd = $(call stream,$(1),$(2)) | ${mkcls} $(mkcls_args) -p/dev/stdin -V$(3
# mkcls: -p: input data
# mkcls: -V: word classes (output)
$(giztmp)/%.vcb.classes: | $(gizaln.in)
$(giztmp)/%.vcb.classes: ${mkcls}
$(giztmp)/%.vcb.classes: | $(gizaln.in)
@echo CREATING $@
$(lock)
mkdir -p $(@D)
@ -107,9 +106,12 @@ $(giztmp)/${L1}.vcb: | $(giztmp)/${L1}-${L2}.snt
$(giztmp)/${L1}.vcb: | $(giztmp)/${L1}-${L2}.snt
$(giztmp)/${L2}.vcb: | $(giztmp)/${L1}-${L2}.snt
$(giztmp)/${L2}-${L1}.snt: | $(giztmp)/${L1}-${L2}.snt
$(giztmp)/${L1}-${L2}.snt: L1files = $(addprefix $(pll-clean), .${L1}.gz)
$(giztmp)/${L1}-${L2}.snt: L2files = $(addprefix $(pll-clean), .${L2}.gz)
$(giztmp)/${L1}-${L2}.snt: | $(giztmp) $(addprefix $(pll-clean), .${L1}.gz .${L2}.gz)
#$(info $(addprefix $(pll-clean), .${L1}.gz))
$(giztmp)/${L1}-${L2}.snt: L1files = $(addsuffix .${L1}.gz, $(pll-clean))
$(giztmp)/${L1}-${L2}.snt: L2files = $(addsuffix .${L2}.gz, $(pll-clean))
$(giztmp)/${L1}-${L2}.snt: | $(giztmp)
$(giztmp)/${L1}-${L2}.snt: | $(addsuffix .${L1}.gz, $(pll-clean))
$(giztmp)/${L1}-${L2}.snt: | $(addsuffix .${L2}.gz, $(pll-clean))
$(lock)
$(plain2snt) \
<(ls $(L1files) | xargs zcat -f) \
@ -167,8 +169,12 @@ $(giztmp)/${L2}-${L1}/mgiza.cfg: | \
@echo "mh ${mh}" >> $@
@echo "m3 ${m3}" >> $@
@echo "m4 ${m4}" >> $@
@echo "t1 ${m1}" >> $@
@echo "t2 ${m2}" >> $@
@echo "th ${mh}" >> $@
@echo "t3 ${m3}" >> $@
@echo "t4 ${m4}" >> $@
@echo "o ${ODIR}" >> $@
@echo "model1dumpfrequency ${model1dumpfrequency}" >> $@
@echo "model4smoothfactor ${model4smoothfactor}" >> $@
@echo "onlyaldumps ${onlyaldumps}" >> $@
@echo "nodumps ${nodumps}" >> $@

View File

@ -5,12 +5,12 @@
define mmap_ttrack
.INTERMEDIATE += $(strip $1).txt.gz
$2/$(notdir $1).mct: | $2/$(notdir $1).sfa
$2/$(notdir $1).mct: | $2/$(notdir $1).tdx
$2/$(notdir $1).tdx: | $2/$(notdir $1).sfa
$2/$(notdir $1).sfa: | $(strip $1).txt.gz
$$(lock)
zcat -f $$< | ${MOSES_BIN}/mtt-build -i -o $$@.lock/$$(basename $${@F})
zcat -f $(strip $1).txt.gz \
| ${MOSES_BIN}/mtt-build -i -o $$@.lock/$$(basename $${@F})
mv $$@.lock/$$(basename $${@F}).tdx $${@D}
mv $$@.lock/$$(basename $${@F}).sfa $${@D}
mv $$@.lock/$$(basename $${@F}).mct $${@D}
@ -24,12 +24,21 @@ define mmap_bitext
$(call mmap_ttrack,$1${L1},$2)
$(call mmap_ttrack,$1${L2},$2)
$2/$(notdir $1)${L1}-${L2}.mam: SYMAL = $(strip $1)${L1}-${L2}.symal.gz
$2/$(notdir $1)${L1}-${L2}.mam: | $(strip $1)${L1}-${L2}.symal.gz
$$(lock)
zcat -f $$< | ${MOSES_BIN}/symal2mam $$@_ && mv $$@_ $$@
zcat -f $${SYMAL} | ${MOSES_BIN}/symal2mam $$@_ && mv $$@_ $$@
$$(unlock)
.INTERMEDIATE += $(strip $1)${L1}-${L2}.symal.gz
$2/$(notdir $1)${L1}-${L2}.lex: | $2/$(notdir $1)${L1}.mct
$2/$(notdir $1)${L1}-${L2}.lex: | $2/$(notdir $1)${L2}.mct
$2/$(notdir $1)${L1}-${L2}.lex: | $2/$(notdir $1)${L1}-${L2}.mam
$$(lock)
${MOSES_BIN}/mmlex-build $2/$(notdir $1) ${L1} ${L2} \
-o $$@.lock/$${@F} -c $$@.lock/$$(basename $${@F}).coc
mv $$@.lock/$${@F} $${@D}
mv $$@.lock/$$(basename $${@F}).coc $${@D}
$$(unlock)
endef

View File

@ -13,7 +13,6 @@ moses.ini_ttable-limit = 20
moses.ini_distortion-limit = 6
moses.ini_v = 0
weight_vector = perl -ne \
'm/name=([^; ]+)/;\
print "$$1=";\
@ -24,35 +23,37 @@ define create_moses_ini
$(strip $1)/moses.ini.0: ${PTABLES} ${DTABLES} ${LMODELS} ${MOSES_INI_PREREQ}
$$(lock)
echo '[input-factors]' > $$@_
echo '[input-factors]' > $$@_
echo '$${moses.ini_input-factors}' >> $$@_
echo >> $$@_
echo '[search-algorithm]' >> $$@_
echo >> $$@_
echo '[search-algorithm]' >> $$@_
echo '$${moses.ini_search-algorithm}' >> $$@_
echo >> $$@_
echo '[stack]' >> $$@_
echo >> $$@_
echo '[stack]' >> $$@_
echo '$${moses.ini_stack}' >> $$@_
echo >> $$@_
echo '[cube-pruning-pop-limit]' >> $$@_
echo >> $$@_
echo '[cube-pruning-pop-limit]' >> $$@_
echo '$${moses.ini_cube-pruning-pop-limit}' >> $$@_
echo >> $$@_
echo '[mapping]' >> $$@_
echo >> $$@_
echo '[mapping]' >> $$@_
echo '$${moses.ini_mapping}' >> $$@_
echo >> $$@_
echo '[distortion-limit]' >> $$@_
echo >> $$@_
echo '[distortion-limit]' >> $$@_
echo '$${moses.ini_distortion-limit}' >> $$@_
echo >> $$@_
echo '[v]' >> $$@_
echo >> $$@_
echo '[v]' >> $$@_
echo '$${moses.ini_v}' >> $$@_
echo >> $$@_
echo '[feature]' >> $$@_
$$(foreach f,${STANDARD_FEATURES},echo $$f >> $$@_;)
$$(foreach pt,${PTABLE_ENTRIES},echo "$$(subst ;, ,$${pt})" >> $$@_;)
$$(foreach dt,${DTABLE_ENTRIES},echo "$$(subst ;, ,$${dt})" >> $$@_;)
$$(foreach lm,${LMODEL_ENTRIES},echo "$$(subst ;, ,$${lm})" >> $$@_;)
echo >> $$@_
echo '[weight]' >> $$@_
$$(foreach x,$(STANDARD_FEATURES),echo "$$x0= 1.0" >> $$@_;)
echo >> $$@_
echo '[feature]' >> $$@_
$$(foreach f,${STANDARD_FEATURES},echo $$f >> $$@_;)
$$(foreach i,${INPUT_FEATURES},echo "$$(subst ;, ,$${i})" >> $$@_;)
$$(foreach pt,${PTABLE_ENTRIES},echo "$$(subst ;, ,$${pt})" >> $$@_;)
$$(foreach dt,${DTABLE_ENTRIES},echo "$$(subst ;, ,$${dt})" >> $$@_;)
$$(foreach lm,${LMODEL_ENTRIES},echo "$$(subst ;, ,$${lm})" >> $$@_;)
echo >> $$@_
echo '[weight]' >> $$@_
$$(foreach x,$(STANDARD_FEATURES),echo "$$x0= 1.0" >> $$@_;)
$$(foreach i,${INPUT_FEATURES},echo '$$i' | $${weight_vector} >> $$@_;)
$$(foreach x,${PTABLE_ENTRIES},echo '$$x' | $${weight_vector} >> $$@_;)
$$(foreach x,${DTABLE_ENTRIES},echo '$$x' | $${weight_vector} >> $$@_;)
$$(foreach x,${LMODEL_ENTRIES},echo '$$x' | $${weight_vector} >> $$@_;)

View File

@ -1,7 +1,7 @@
# -*- makefile -*-
casing1 = truecase
casing2 = truecase
word-alignment = fast
casing1 ?= truecase
casing2 ?= truecase
word-alignment ?= fast
moses.threads = $(shell parallel --number-of-cores)
# numerical constants for moses
@ -17,12 +17,12 @@ lmodel = model/lm/${L2}/kenlm
lexdm_specs = wbe-mslr-bidirectional-fe-allff
lexdm = model/dm/bin/${L1}-${L2}/${dflt_lexdmodel_specs}
ptable.max-phrase-length = 7
ptable.smoothing = --GoodTuring
ptable.source-factors = 0
ptable.target-factors = 0
ptable.num-features = 5
ptable.implemetation = 1
ptable.max-phrase-length ?= 7
ptable.smoothing ?= --GoodTuring
ptable.source-factors ?= 0
ptable.target-factors ?= 0
ptable.num-features ?= 5
ptable.implemetation ?= 1
# reminder: implementation types:
# 0 - text
# 1 - binary via processPhraseTable
@ -50,6 +50,8 @@ dmodel.description = $(addprefix ${dmodel.type}-${dmodel.orientation}-,\
distortion-limit = 6
# DEFAULT TUNING PARAMETERS
mert.nbest = 100
mert.extra-flags = --no-filter-phrase-table
mert.decoder-flags = -threads ${moses.threads}
FACTORSEP ?= \n
mert.nbest = 100
mert.extra-flags ?=
mert.extra-flags += --no-filter-phrase-table
mert.decoder-flags = -threads ${moses.threads} -fd '${FACTORSEP}'

View File

@ -23,37 +23,15 @@ ${moses.extract-phrases} ${moses.extract} $(1:.aln.gz=) ${L1} ${L2} \
endef
#################################################################################
# create_phrase_table: add rules to create a standard phrase table
# ADD RULES TO CREATE A STANDARD PHRASE TABLE FROM
# $(pll.txt1),$(pll.txt2),$(pll.aln) that are specified as target-specific
# variables like this:
# $1.txt.gz: pll.txt1 = ...
# $1.txt.gz: pll.txt2 = ...
# $1.txt.gz: pll.aln = ...
# This function is normally called indirectly via $(eval $(call add_bin_pt,...))
#
# Note: this section should be improved:
# - split into shards
# - create bash file with jobs
# - run batch file in parallel
#--------------------------------------------------------------------------------
define create_phrase_table
# $1: stem of phrase extractions
# $2: L1 text
# $3: L2 text
# $4: symal file
# normally, $2 ... $4 are default values ${pll.txt1} ${pll.txt2} ${pll.aln}
define extract_phrases
SHARDS = $$(foreach x, $${L1} $${L2} aln, $1.shards/$$x-DONE)
.SECONDARY: $1.txt.gz
.SECONDARY: $1.${L2}-given-${L1}.lex.gz
.SECONDARY: $1.${L1}-given-${L2}.lex.gz
.INTERMEDIATE: $1.txt.gz
.INTERMEDIATE: $$(SHARDS)
.INTERMEDIATE: $1.tmp/fwd.scored.gz
.INTERMEDIATE: $1.tmp/bwd/scoring.done
.INTERMEDIATE: $1.${L2}-given-${L1}.lex.gz
.INTERMEDIATE: $1.${L1}-given-${L2}.lex.gz
.INTERMEDIATE: $1.shards/${L1}-DONE
.INTERMEDIATE: $1.shards/${L2}-DONE
.INTERMEDIATE: $1.shards/aln-DONE
.INTERMEDIATE: $1.shards/extract.batch
.INTERMEDIATE: $1.shards/extract.done
$1.shards/${L1}-DONE: $(if $2,$2,$$(pll.txt1))
$$(lock)
@ -87,10 +65,32 @@ $1.shards/extract.batch: $$(SHARDS)
$1.shards/extract.done: $1.shards/extract.batch
$$(lock)
${parallel} -j$(shell echo $$((${NUMCORES}/1))) < $1.shards/extract.batch
${parallel} -j$(shell echo $$((${NUMCORES}/4))) < $1.shards/extract.batch
touch $$@
$$(unlock)
endef
#################################################################################
# create_phrase_table: add rules to create a standard phrase table
# ADD RULES TO CREATE A STANDARD PHRASE TABLE FROM
# $(pll.txt1),$(pll.txt2),$(pll.aln) that are specified as target-specific
# variables like this:
# $1.txt.gz: pll.txt1 = ...
# $1.txt.gz: pll.txt2 = ...
# $1.txt.gz: pll.aln = ...
# This function is normally called indirectly via $(eval $(call add_bin_pt,...))
#
# Note: this section should be improved:
# - split into shards
# - create bash file with jobs
# - run batch file in parallel
#--------------------------------------------------------------------------------
define create_phrase_table
$(call extract_phrases,$1,$${pll.txt1},$${pll.txt2},$${pll.aln})
ptable: $1.txt.gz
$1.txt.gz: | ${merge-sorted}
$1.txt.gz: | ${MOSES_BIN}/consolidate
@ -99,10 +99,10 @@ $1.txt.gz: | $1.tmp/bwd/scoring.done
$$(lock)
${MOSES_BIN}/consolidate \
<(zcat -f $1.tmp/fwd.scored.gz) \
<(${merge-sorted} $1.tmp/bwd/scored.*.gz) \
/dev/stdout \
<(${merge-sorted} $1.tmp/bwd/scored.*.gz) /dev/stdout \
$(if $(ptable.smoothing), $(ptable.smoothing) $1.tmp/fwd.coc) \
| gzip > $$@_ && mv $$@_ $$@
| gzip > $$@_
mv $$@_ $$@
$$(unlock)
$1.tmp/fwd.scored.gz: | $(merge-sorted)
@ -119,7 +119,7 @@ $1.tmp/bwd/scoring.done: | $1.${L2}-given-${L1}.lex.gz
$$(lock)
$(merge-sorted) $1.shards/*.bwd.gz \
| ${moses.score-phrases} ${MOSES_BIN}/score - $1.${L2}-given-${L1}.lex.gz \
$${@D}/scored "$(ptable.smoothing)" --Inverse && touch $$@
$${@D}/scored $(ptable.smoothing) --Inverse && touch $$@
$$(unlock)
# reminder: $2,$3,$4 = L1text, L2text, alignment
@ -206,7 +206,6 @@ endef
define create_lexical_reordering_table
mystem := $(strip $1).$(strip $2)
.INTERMEDIATE: $${mystem}.gz
$${mystem}.gz: dmshards = $$(shell ls $3.shards/*.dst.gz 2>/dev/null)
$${mystem}.gz: dm.type=$(word 1,$(subst -, ,$2))
$${mystem}.gz: dm.orient=$(word 2,$(subst -, ,$2))
@ -271,6 +270,12 @@ PTABLES += $(strip $4) $(strip $5) $(strip $6)
endef
#################################################################################
# $1: input factor
# $2: output factor
# $3: num-features
# $4: path to mmapped data
# $5: path to text data
# $6: path and basename of dynamic data
define add_mmsapt
$(call mmap_bitext,$(strip $5),$(strip $4))
@ -282,8 +287,10 @@ MY_ENTRY += output-factor=$(strip $2)
MY_ENTRY += num-features=$(strip $3)
MY_ENTRY += base=$(abspath $4)/ L1=${L1} L2=${L2}
PTABLE_ENTRIES += $$(subst $$(space),;,$${MY_ENTRY})
MOSES_INI_PREREQ += $(addprefix $(strip $4),${L1}.mct ${L1}.tdx ${L1}.sfa)
MOSES_INI_PREREQ += $(addprefix $(strip $4),${L2}.mct ${L2}.tdx ${L2}.sfa)
MOSES_INI_PREREQ += $(strip $4)${L1}-${L2}.mam
MOSES_INI_PREREQ += $(addprefix $(strip $4)/${L1},.mct .tdx .sfa)
MOSES_INI_PREREQ += $(addprefix $(strip $4)/${L2},.mct .tdx .sfa)
MOSES_INI_PREREQ += $(strip $4)/${L1}-${L2}.mam
MOSES_INI_PREREQ += $(strip $4)/${L1}-${L2}.lex
endef

View File

@ -9,17 +9,26 @@
max-sentence-length ?= 80
casing.${L1} ?= truecase
casing.${L2} ?= truecase
MAX_NUM_REFS ?= 4
# tok-mno: monolingual resources
# tok-pll: parallel resources
trn.tok-mno = $(addprefix ${WDIR}/crp/trn/mno/tok/, $(notdir $(wildcard ${WDIR}/crp/trn/mno/raw/*.$1.gz)))
trn.tok-pll = $(addprefix ${WDIR}/crp/trn/pll/tok/, $(notdir $(wildcard ${WDIR}/crp/trn/pll/raw/*.$1.gz)))
trn.raw-mno = $(notdir $(wildcard ${WDIR}/crp/trn/mno/raw/*.$1.gz))
trn.tok-mno = $(addprefix ${WDIR}/crp/trn/mno/tok/, $(call trn.raw-mno,$1))
trn.cased-mno = $(addprefix ${WDIR}/crp/trn/mno/cased/, $(call trn.raw-mno,$1))
trn.raw-pll = $(notdir $(wildcard ${WDIR}/crp/trn/pll/raw/*.$1.gz))
trn.tok-pll = $(addprefix ${WDIR}/crp/trn/pll/tok/, $(call trn.raw-pll,$1))
trn.cased-pll = $(addprefix ${WDIR}/crp/trn/pll/cased/, $(call trn.raw-pll,$1))
define tokenize
$1/tok/%.$2.gz: $1/raw/%.$2.gz
$2/tok/%.$3.gz: | $2/raw/%.$3.gz
$$(lock)
zcat $$< | ${parallel} --pipe -k ${tokenize.$2} | gzip > $$@_
zcat $$(word 1,$$|) | ${pre-tokenize.$1} \
| ${parallel} -j4 --pipe -k ${tokenize.$1} \
| gzip > $$@_
mv $$@_ $$@
$$(unlock)
@ -30,36 +39,36 @@ endef
###########################################################################
define truecase
$1/cased/%.$2.gz: caser = ${run-truecaser}
$1/cased/%.$2.gz: caser += -model ${WDIR}/aux/truecasing-model.$2
$1/cased/%.$2.gz: $1/tok/%.$2.gz ${WDIR}/aux/truecasing-model.$2
$2/cased/%.$3.gz: caser = ${run-truecaser}
$2/cased/%.$3.gz: caser += -model ${WDIR}/aux/truecasing-model.$1
$2/cased/%.$3.gz: | $2/tok/%.$3.gz ${WDIR}/aux/truecasing-model.$1
$$(lock)
zcat $$< | ${parallel} --pipe -k $${caser} | gzip > $$@_
zcat $$(word 1, $$|) | ${parallel} --pipe -k $${caser} | gzip > $$@_
mv $$@_ $$@
$$(unlock)
$1/cased/%.$2: caser = ${run-truecaser}
$1/cased/%.$2: caser += -model ${WDIR}/aux/truecasing-model.$2
$1/cased/%.$2: $1/tok/%.$2.gz ${WDIR}/aux/truecasing-model.$2
$2/cased/%.$3: | $2/cased/%.$3.gz
$$(lock)
zcat $$< | ${parallel} --pipe -k $${caser} > $$@_
gzip -d < $$(word 1, $$|) > $$@_
mv $$@_ $$@
$$(unlock)
endef
define lowercase
$1/cased/%.$2.gz: caser = ${run-lowercaser}
$1/cased/%.$2.gz: | $1/tok/%.$2.gz
$2/cased/%.$3.gz: caser = ${run-lowercaser}
$2/cased/%.$3.gz: | $2/tok/%.$3.gz
$$(lock)
zcat $$| | ${parallel} --pipe -k $${caser} | gzip > $$@_
zcat $$| | ${parallel} -j4 --pipe -k $${caser} | gzip > $$@_
mv $$@_ $$@
$$(unlock)
$1/cased/%.$2: caser = ${run-lowercaser}
$1/cased/%.$2: | $1/tok/%.$2.gz
$2/cased/%.$3: | $2/cased/%.$3.gz
$$(lock)
zcat $$| | ${parallel} --pipe -k $${caser} > $$@_
gzip -d < $$(word 1, $$|) > $$@_
mv $$@_ $$@
$$(unlock)
endef
define skipcasing
@ -83,9 +92,12 @@ pll-ready: $(foreach l,${L1} ${L2}, $(addsuffix .$l.gz,${pll-clean}))
define clean_corpus
.INTERMEDIATE: $1/clean/$2.${L1}.gz
.INTERMEDIATE: $1/clean/$2.${L2}.gz
.INTERMEDIATE: $1/clean/$2.clean.log
# .INTERMEDIATE: $1/clean/$2.${L1}.gz
# .INTERMEDIATE: $1/clean/$2.${L2}.gz
# .INTERMEDIATE: $1/clean/$2.clean.log
# .SECONDARY: $1/clean/$2.${L1}.gz
# .SECONDARY: $1/clean/$2.${L2}.gz
# .SECONDARY: $1/clean/$2.clean.log
$1/clean/$2.${L2}.gz: | $1/clean/$2.clean.log
$$(lock)
gzip < $$(@D)/_$2.${L2} > $$@_ && rm $$(@D)/_$2.${L2}
@ -110,17 +122,26 @@ endef
############################################################################
# Truecasing models #
############################################################################
.INTERMEDIATE: $(call trn.tok-mno,${L1}) $(call trn.tok-pll,${L1})
.INTERMEDIATE: $(call trn.tok-mno,${L2}) $(call trn.tok-pll,${L2})
${WDIR}/aux/truecasing-model.${L1}: | $(call trn.tok-mno,${L1}) $(call trn.tok-pll,${L1})
# .INTERMEDIATE: $(call trn.tok-mno,${L1}) $(call trn.tok-pll,${L1})
# .INTERMEDIATE: $(call trn.tok-mno,${L2}) $(call trn.tok-pll,${L2})
# .SECONDARY: $(call trn.tok-mno,${L1}) $(call trn.tok-pll,${L1})
# .SECONDARY: $(call trn.tok-mno,${L2}) $(call trn.tok-pll,${L2})
#${WDIR}/aux/truecasing-model.${L1}: | $(call trn.tok-mno,${L1}) $(call trn.tok-pll,${L1})
${WDIR}/aux/truecasing-model.${L1}: | $(call trn.tok-mno,${L1})
$(lock)
$(if $|,,$(error Can't find training data for $@!))#'
${train-truecaser} -model $@_ -corpus <(echo $| | xargs zcat -f)
test -s $@_ || (echo "Truecasing model $@ is empty!" && exit 1)
mv $@_ $@
$(unlock)
${WDIR}/aux/truecasing-model.${L2}: | $(call trn.tok-mno,${L2}) $(call trn.tok-pll,${L2})
#${WDIR}/aux/truecasing-model.${L2}: | $(call trn.tok-mno,${L2}) $(call trn.tok-pll,${L2})
${WDIR}/aux/truecasing-model.${L2}: | $(call trn.tok-mno,${L2})
$(lock)
$(if $|,,$(error Can't find training data for $@!))#'
${train-truecaser} -model $@_ -corpus <(echo $| | xargs zcat -f)
test -s $@_ || (echo "Truecasing model $@ is empty!" && exit 1)
mv $@_ $@
$(unlock)
@ -129,18 +150,24 @@ ${WDIR}/aux/truecasing-model.${L2}: | $(call trn.tok-mno,${L2}) $(call trn.tok-p
# Generate rules #
############################################################################
all_data_dirs := $(addprefix ${WDIR}/crp/,trn/mno trn/pll dev tst)
all_data_dirs := $(addprefix ${WDIR}/crp/,trn/mno trn/pll dev tst dev+tst)
# add rules for tokenization and casing
snippet := $(foreach d,$(all_data_dirs),$(foreach l,${L1} ${L2},\
$(call tokenize,$d,$l)$(call ${casing.$l},$d,$l)))
snippet := $(foreach d,$(all_data_dirs),\
$(call tokenize,${L1},$d,${L1})$(call ${casing.${L1}},${L1},$d,${L1}))
snippet += $(foreach d,$(all_data_dirs),\
$(foreach l,${L2} $(addprefix ${L2},$(shell seq 0 ${MAX_NUM_REFS})),\
$(call tokenize,${L2},$d,$l)$(call ${casing.${L2}},${L2},$d,$l)))
MY_EXPERIMENT += $(snippet)
#$(info $(snippet))
$(eval $(snippet))
# add rules for cleaning parallel data prior to word alignment
snippet := $(foreach s,${pllshards},$(call clean_corpus,${WDIR}/crp/trn/pll,$s))
MY_EXPERIMENT += $(snippet)
#$(info $(snippet))
$(eval $(snippet))

View File

@ -5,9 +5,9 @@
# MOSES_ROOT: root directory of the distribution
# MOSES_BIN: where compiled binaries are kept
# MGIZA_ROOT: root directory of the mgiza installation
MOSES_ROOT ?= ${HOME}/code/moses/master/mosesdecoder
MOSES_BIN ?= ${HOME}/bin
MGIZA_ROOT ?= ${HOME}/tools/mgiza
MOSES_ROOT ?= ${HOME}/accept/exp/journal-paper/moses
MOSES_BIN ?= ${MOSES_ROOT}/bin
MGIZA_ROOT ?= ${MOSES_ROOT}
# default location (unless specified otherwise above)
MOSES_BIN ?= ${MOSES_ROOT}/bin
@ -19,12 +19,15 @@ M4M_SCRIPTS ?= ${m4mdir}scripts
# default locations of scripts and executables
# utilities
parallel ?= $(shell which parallel)
parallel ?= $(shell which parallel) --gnu
$(if ${parallel},,$(error GNU parallel utility not found!))
# corpus preprocessing
tokenize.${L1} ?= ${MOSES_SCRIPTS}/tokenizer/tokenizer.perl -q -a -l ${L1}
tokenize.${L2} ?= ${MOSES_SCRIPTS}/tokenizer/tokenizer.perl -q -a -l ${L2}
pre-tokenize.${L1} ?= ${MOSES_SCRIPTS}/tokenizer/pre-tokenizer.perl -l ${L1}
pre-tokenize.${L2} ?= ${MOSES_SCRIPTS}/tokenizer/pre-tokenizer.perl -l ${L2}
tokenize.${L1} ?= ${MOSES_SCRIPTS}/tokenizer/tokenizer.perl -q -a -l ${L1} -no-escape
tokenize.${L2} ?= ${MOSES_SCRIPTS}/tokenizer/tokenizer.perl -q -a -l ${L2} -no-escape
train-truecaser ?= ${MOSES_SCRIPTS}/recaser/train-truecaser.perl
run-truecaser ?= ${MOSES_SCRIPTS}/recaser/truecase.perl
run-detruecaser ?= ${MOSES_SCRIPTS}/recaser/detruecase.perl

View File

@ -6,29 +6,62 @@
untuned_model ?= model/moses.ini.0
tune.dir ?= ${basedir}/tune
# FUNCTIONS FOR COMPUTING REFERENCE FILE DEPENDENCIES
# AND INPUT TYPE FROM INPUT FILE PATH FOR TUNING AND EVAL
# get basenames (with path) of all files belonging
# to a particular set (e.g. dev / tst)
get_set = $(addprefix $(patsubst %/,%,$1)/,\
$(shell find -L $(patsubst %/,%,$(dir $1)) -regex '.*${L1}\(.gz\)?'\
| perl -pe 's/.*\/(.*?).${L1}(\.gz)?$$/$$1/' | sort | uniq))
# $1: moses input file
# ->: base name of corresponding reference files
refbase = $(notdir $(patsubst %.${L1},%.${L2},%,$(patsubst %.gz,%,$1)))
# $1: moses input file
# $2: root of directory tree for search
# ->: list of full paths to reference files
reffiles = $(addprefix $(patsubst %/,%,$2)/cased/,\
$(shell find -L $2 -regex '.*$(call refbase,$1)[0-9]*\(.gz\)?'\
| perl -pe 's/.*\/(.*?)(\.gz)?$$/$$1/' | sort | uniq))
# $1: moses input file
# ->: 0 for plain text, 1 for confusion network
guess-inputtype = $(if $(findstring /cfn,$1),1,0)
############################################################################
# TUNE SYSTEM
#
# $1: untuned moses.ini
# $2: tuned moses.ini
# $3: moses input
# $4: reference
# $5: input type
# $2: tuned moses.ini
# $3: moses input (ref files and input type are computed automatically)
# ->: Makefile snippet for tuning system on input file given
#
define tune_system
TUNED_SYSTEMS += $(strip $2)
.INTERMEDIATE: $1
tune.reffiles = $$(call reffiles,$3,$(dir $(patsubst %/,%,$(dir $3))))
#.INTERMEDIATE: $1
$(strip $2): $${PTABLES} $${DTABLES} $${LMODELS} $${MOSES_INI_PREREQ}
$(strip $2): mert.wdir = $(dir $(abspath $2))tmp
$(strip $2): tune.src = $3
$(strip $2): tune.ref = $4
$(strip $2): | $1 $3 $4
$(strip $2): mert.wdir = $(dir $(abspath $2))tmp
$(strip $2): tune.src = $3
$(strip $2): tune.ref = $$(shell echo $(patsubst %.${L1},%.${L2},$3) | perl -pe 's?/cfn[^/]+/?/cased/?')
$(strip $2): tune.itype = $$(call guess-inputtype,$3)
$(strip $2): | $1 $3 $${tune.reffiles}
$(strip $2):
$$(lock)
$$(info REFFILES = $${tune.reffiles})
mkdir -p $${mert.wdir}
rm -f $${mert.wdir}/*
${mert} ${mert.extra-flags} --nbest ${mert.nbest} --mertdir ${MOSES_BIN} \
--rootdir ${MOSES_ROOT}/scripts --working-dir $${mert.wdir} \
--decoder-flags "${mert.decoder-flags} -inputtype $5" \
$${tune.src} $${tune.ref} ${moses} $1
$(if $(findstring -continue,${mert.extra-flags}),,rm -f $${mert.wdir}/*)
${mert} ${mert.extra-flags} \
--nbest ${mert.nbest} \
--mertdir ${MOSES_BIN} \
--rootdir ${MOSES_SCRIPTS} \
--working-dir $${mert.wdir} \
--decoder-flags "$${mert.decoder-flags}" \
--inputtype $${tune.itype} \
$${tune.src} $${tune.ref} $${moses} $1
${apply-weights} $1 $${mert.wdir}/moses.ini $$@_ && mv $$@_ $$@
$$(unlock)

View File

@ -25,7 +25,7 @@ trap 'cleanup' 0
export LC_ALL=C
if [[ "$inv" == "--Inverse" ]] ; then
parallel < $obase.$$ -j10 --pipe --blocksize 250M "sort -S 10G | gzip > $obase.{#}.gz" &
parallel --gnu < $obase.$$ -j10 --pipe --blocksize 250M "sort -S 10G | gzip > $obase.{#}.gz" &
else
gzip < $obase.$$ > $obase.scored.gz_ &
fi

View File

@ -1,2 +1,12 @@
exe merge-sorted : merge-sorted.cc ../../../moses/generic/file_io/ug_stream.o ;
external-lib bzip2 ;
external-lib zlib ;
exe merge-sorted :
merge-sorted.cc
$(TOP)/moses/TranslationModel/UG/mm//mm
$(TOP)/moses/TranslationModel/UG/generic//generic
$(TOP)//boost_iostreams
$(TOP)//boost_program_options
;

View File

@ -5,7 +5,7 @@
#include <algorithm>
#include <string>
#include <vector>
#include "../../../moses/generic/file_io/ug_stream.h"
#include "moses/TranslationModel/UG/generic/file_io/ug_stream.h"
using namespace std;
using namespace ugdiss;
using namespace boost::iostreams;

View File

@ -541,6 +541,11 @@
<type>1</type>
<locationURI>PARENT-3-PROJECT_LOC/moses/PDTAimp.h</locationURI>
</link>
<link>
<name>PP</name>
<type>2</type>
<locationURI>virtual:/virtual</locationURI>
</link>
<link>
<name>Parameter.cpp</name>
<type>1</type>
@ -1176,6 +1181,16 @@
<type>1</type>
<locationURI>PARENT-3-PROJECT_LOC/moses/FF/MaxSpanFreeNonTermSource.h</locationURI>
</link>
<link>
<name>FF/NieceTerminal.cpp</name>
<type>1</type>
<locationURI>PARENT-3-PROJECT_LOC/moses/FF/NieceTerminal.cpp</locationURI>
</link>
<link>
<name>FF/NieceTerminal.h</name>
<type>1</type>
<locationURI>PARENT-3-PROJECT_LOC/moses/FF/NieceTerminal.h</locationURI>
</link>
<link>
<name>FF/OSM-Feature</name>
<type>2</type>
@ -1591,6 +1606,26 @@
<type>1</type>
<locationURI>PARENT-3-PROJECT_LOC/moses/LM/backward.arpa</locationURI>
</link>
<link>
<name>PP/Factory.cpp</name>
<type>1</type>
<locationURI>PARENT-3-PROJECT_LOC/moses/PP/Factory.cpp</locationURI>
</link>
<link>
<name>PP/Factory.h</name>
<type>1</type>
<locationURI>PARENT-3-PROJECT_LOC/moses/PP/Factory.h</locationURI>
</link>
<link>
<name>PP/PhraseProperty.h</name>
<type>1</type>
<locationURI>PARENT-3-PROJECT_LOC/moses/PP/PhraseProperty.h</locationURI>
</link>
<link>
<name>PP/TreeStructurePhraseProperty.h</name>
<type>1</type>
<locationURI>PARENT-3-PROJECT_LOC/moses/PP/TreeStructurePhraseProperty.h</locationURI>
</link>
<link>
<name>TranslationModel/BilingualDynSuffixArray.cpp</name>
<type>1</type>

View File

@ -89,8 +89,7 @@ for line in sys.stdin:
If you want to add your changes you are going to have to recompile the cython code.
1. Compile the cython code using Cython 0.17.1
1. Compile the cython code:
python setup.py build_ext -i --cython

View File

@ -16,14 +16,14 @@ cdef extern from 'PhraseDictionaryTree.h' namespace 'Moses':
Scores fvalues
cdef cppclass PhraseDictionaryTree:
PhraseDictionaryTree(unsigned nscores)
PhraseDictionaryTree()
void NeedAlignmentInfo(bint value)
void PrintWordAlignment(bint value)
bint PrintWordAlignment()
int Read(string& path)
void GetTargetCandidates(vector[string]& fs,
void GetTargetCandidates(vector[string]& fs,
vector[StringTgtCand]& rv)
void GetTargetCandidates(vector[string]& fs,
void GetTargetCandidates(vector[string]& fs,
vector[StringTgtCand]& rv,
vector[string]& wa)

File diff suppressed because it is too large Load Diff

View File

@ -179,7 +179,7 @@ cdef class QueryResult(list):
def __init__(self, source, targets = []):
super(QueryResult, self).__init__(targets)
self.source = source
cdef class DictionaryTree(object):
@ -222,10 +222,10 @@ cdef class PhraseDictionaryTree(DictionaryTree):
raise ValueError, "'%s' doesn't seem a valid binary table." % path
self.path = path
self.tableLimit = tableLimit
self.nscores = nscores
self.nscores = nscores #used to be passed to PhraseDictionaryTree, not used now
self.wa = wa
self.delimiters = delimiters
self.tree = new cdictree.PhraseDictionaryTree(nscores)
self.tree = new cdictree.PhraseDictionaryTree()
self.tree.NeedAlignmentInfo(wa)
self.tree.Read(path)
@ -248,7 +248,7 @@ cdef class PhraseDictionaryTree(DictionaryTree):
and os.path.isfile(stem + ".binphr.srcvoc") \
and os.path.isfile(stem + ".binphr.tgtdata") \
and os.path.isfile(stem + ".binphr.tgtvoc")
cdef TargetProduction getTargetProduction(self, cdictree.StringTgtCand& cand, wa = None, converter = None):
"""Converts a StringTgtCandidate (c++ object) and possibly a word-alignment info (string) to a TargetProduction (python object)."""
cdef list words = [cand.tokens[i].c_str() for i in xrange(cand.tokens.size())]
@ -284,9 +284,9 @@ cdef class PhraseDictionaryTree(DictionaryTree):
results.sort(cmp=cmp, key=key)
if self.tableLimit > 0:
return QueryResult(source, results[0:self.tableLimit])
else:
else:
return results
cdef class OnDiskWrapper(DictionaryTree):
cdef condiskpt.OnDiskWrapper *wrapper
@ -300,7 +300,7 @@ cdef class OnDiskWrapper(DictionaryTree):
self.wrapper = new condiskpt.OnDiskWrapper()
self.wrapper.BeginLoad(string(path))
self.finder = new condiskpt.OnDiskQuery(self.wrapper[0])
@classmethod
def canLoad(cls, stem, bint wa = False):
return os.path.isfile(stem + "/Misc.dat") \
@ -345,7 +345,7 @@ cdef class OnDiskWrapper(DictionaryTree):
if cmp:
results.sort(cmp=cmp, key=key)
return results
def load(path, nscores, limit):
"""Finds out the correct implementation depending on the content of 'path' and returns the appropriate dictionary tree."""
if PhraseDictionaryTree.canLoad(path, False):

View File

@ -279,6 +279,12 @@ public:
manager.ProcessSentence();
const ChartHypothesis *hypo = manager.GetBestHypothesis();
outputChartHypo(out,hypo);
if (addGraphInfo) {
const size_t translationId = tinput.GetTranslationId();
std::ostringstream sgstream;
manager.GetSearchGraph(translationId,sgstream);
retData.insert(pair<string, xmlrpc_c::value>("sg", xmlrpc_c::value_string(sgstream.str())));
}
} else {
Sentence sentence;
const vector<FactorType> &inputFactorOrder =
@ -310,7 +316,7 @@ public:
retData.insert(pair<string, xmlrpc_c::value_array>("word-align", alignments));
}
if(addGraphInfo) {
if (addGraphInfo) {
insertGraphInfo(manager,retData);
(const_cast<StaticData&>(staticData)).SetOutputSearchGraph(false);
}

View File

@ -109,12 +109,12 @@ class Moses():
exit(1)
scores = scores[:self.number_of_features]
model_probabilities = map(float,scores)
model_probabilities = list(map(float,scores))
phrase_probabilities = self.phrase_pairs[src][target][0]
if mode == 'counts' and not priority == 2: #priority 2 is MAP
try:
counts = map(float,line[4].split())
counts = list(map(float,line[4].split()))
try:
target_count,src_count,joint_count = counts
joint_count_e2f = joint_count
@ -171,7 +171,7 @@ class Moses():
src = line[0]
target = line[1]
model_probabilities = map(float,line[2].split())
model_probabilities = list(map(float,line[2].split()))
reordering_probabilities = self.reordering_pairs[src][target]
try:
@ -212,7 +212,7 @@ class Moses():
line = line.rstrip().split(b' ||| ')
if line[-1].endswith(b' |||'):
line[-1] = line[-1][:-4]
line.append('')
line.append(b'')
if increment != line[0]:
stack[i] = line
@ -341,7 +341,7 @@ class Moses():
textual_f2e = [[t,[]] for t in target_list]
for pair in alignment.split(b' '):
s,t = pair.split('-')
s,t = pair.split(b'-')
s,t = int(s),int(t)
textual_e2f[s][1].append(target_list[t])
@ -349,11 +349,11 @@ class Moses():
for s,t in textual_e2f:
if not t:
t.append('NULL')
t.append(b'NULL')
for s,t in textual_f2e:
if not t:
t.append('NULL')
t.append(b'NULL')
#tupelize so we can use the value as dictionary keys
for i in range(len(textual_e2f)):
@ -374,7 +374,7 @@ class Moses():
# if one feature value is 0 (either because of loglinear interpolation or rounding to 0), don't write it to phrasetable
# (phrase pair will end up with probability zero in log-linear model anyway)
if 0 in features:
return ''
return b''
# information specific to Moses model: alignment info and comment section with target and source counts
additional_entries = self.phrase_pairs[src][target][1]
@ -394,7 +394,7 @@ class Moses():
features = b' '.join([b'%.6g' %(f) for f in features])
if flags['add_origin_features']:
origin_features = map(lambda x: 2.718**bool(x),self.phrase_pairs[src][target][0][0]) # 1 if phrase pair doesn't occur in model, 2.718 if it does
origin_features = list(map(lambda x: 2.718**bool(x),self.phrase_pairs[src][target][0][0])) # 1 if phrase pair doesn't occur in model, 2.718 if it does
origin_features = b' '.join([b'%.4f' %(f) for f in origin_features]) + ' '
else:
origin_features = b''
@ -445,7 +445,7 @@ class Moses():
# if one feature value is 0 (either because of loglinear interpolation or rounding to 0), don't write it to reordering table
# (phrase pair will end up with probability zero in log-linear model anyway)
if 0 in features:
return ''
return b''
features = b' '.join([b'%.6g' %(f) for f in features])
@ -699,7 +699,7 @@ class Moses_Alignment():
line = line.split(b' ||| ')
if line[-1].endswith(b' |||'):
line[-1] = line[-1][:-4]
line.append('')
line.append(b'')
src = line[0]
target = line[1]
@ -1030,21 +1030,21 @@ def redistribute_probability_mass(weights,src,target,interface,flags,mode='inter
if flags['normalize_s_given_t'] == 's':
# set weight to 0 for all models where target phrase is unseen (p(s|t)
new_weights[i_e2f] = map(mul,interface.phrase_source[src],weights[i_e2f])
new_weights[i_e2f] = list(map(mul,interface.phrase_source[src],weights[i_e2f]))
if flags['normalize-lexical_weights']:
new_weights[i_e2f_lex] = map(mul,interface.phrase_source[src],weights[i_e2f_lex])
new_weights[i_e2f_lex] = list(map(mul,interface.phrase_source[src],weights[i_e2f_lex]))
elif flags['normalize_s_given_t'] == 't':
# set weight to 0 for all models where target phrase is unseen (p(s|t)
new_weights[i_e2f] = map(mul,interface.phrase_target[target],weights[i_e2f])
new_weights[i_e2f] = list(map(mul,interface.phrase_target[target],weights[i_e2f]))
if flags['normalize-lexical_weights']:
new_weights[i_e2f_lex] = map(mul,interface.phrase_target[target],weights[i_e2f_lex])
new_weights[i_e2f_lex] = list(map(mul,interface.phrase_target[target],weights[i_e2f_lex]))
# set weight to 0 for all models where source phrase is unseen (p(t|s)
new_weights[i_f2e] = map(mul,interface.phrase_source[src],weights[i_f2e])
new_weights[i_f2e] = list(map(mul,interface.phrase_source[src],weights[i_f2e]))
if flags['normalize-lexical_weights']:
new_weights[i_f2e_lex] = map(mul,interface.phrase_source[src],weights[i_f2e_lex])
new_weights[i_f2e_lex] = list(map(mul,interface.phrase_source[src],weights[i_f2e_lex]))
return normalize_weights(new_weights,mode,flags)
@ -1095,7 +1095,7 @@ def score_loglinear(weights,src,target,interface,flags,cache=False):
for idx,prob in enumerate(model_values):
try:
scores.append(exp(dot_product(map(log,prob),weights[idx])))
scores.append(exp(dot_product(list(map(log,prob)),weights[idx])))
except ValueError:
scores.append(0)
@ -1265,6 +1265,8 @@ def handle_file(filename,action,fileobj=None,mode='r'):
if mode == 'r':
mode = 'rb'
elif mode == 'w':
mode = 'wb'
if mode == 'rb' and not filename == '-' and not os.path.exists(filename):
if os.path.exists(filename+'.gz'):
@ -1281,7 +1283,7 @@ def handle_file(filename,action,fileobj=None,mode='r'):
if filename.endswith('.gz'):
fileobj = gzip.open(filename,mode)
elif filename == '-' and mode == 'w':
elif filename == '-' and mode == 'wb':
fileobj = sys.stdout
else:

View File

@ -13,7 +13,16 @@ update-if-changed $(ORDER-LOG) $(max-order) ;
max-order += <dependency>$(ORDER-LOG) ;
fakelib kenlm : [ glob *.cc : *main.cc *test.cc ] ../util//kenutil : <include>.. $(max-order) : : <include>.. $(max-order) ;
wrappers = ;
local with-nplm = [ option.get "with-nplm" ] ;
if $(with-nplm) {
lib neuralLM : : <search>$(with-nplm)/src ;
obj nplm.o : wrappers/nplm.cc : <include>.. <include>$(with-nplm)/src <cxxflags>-fopenmp ;
alias nplm : nplm.o neuralLM ..//boost_thread : : : <cxxflags>-fopenmp <linkflags>-fopenmp <define>WITH_NPLM <library>..//boost_thread ;
wrappers += nplm ;
}
fakelib kenlm : $(wrappers) [ glob *.cc : *main.cc *test.cc ] ../util//kenutil : <include>.. $(max-order) : : <include>.. $(max-order) ;
import testing ;

View File

@ -10,17 +10,19 @@
* Currently only used for next pointers.
*/
#ifndef LM_BHIKSHA__
#define LM_BHIKSHA__
#include <stdint.h>
#include <assert.h>
#ifndef LM_BHIKSHA_H
#define LM_BHIKSHA_H
#include "lm/model_type.hh"
#include "lm/trie.hh"
#include "util/bit_packing.hh"
#include "util/sorted_uniform.hh"
#include <algorithm>
#include <stdint.h>
#include <assert.h>
namespace lm {
namespace ngram {
struct Config;
@ -73,15 +75,24 @@ class ArrayBhiksha {
ArrayBhiksha(void *base, uint64_t max_offset, uint64_t max_value, const Config &config);
void ReadNext(const void *base, uint64_t bit_offset, uint64_t index, uint8_t total_bits, NodeRange &out) const {
const uint64_t *begin_it = util::BinaryBelow(util::IdentityAccessor<uint64_t>(), offset_begin_, offset_end_, index);
// Some assertions are commented out because they are expensive.
// assert(*offset_begin_ == 0);
// std::upper_bound returns the first element that is greater. Want the
// last element that is <= to the index.
const uint64_t *begin_it = std::upper_bound(offset_begin_, offset_end_, index) - 1;
// Since *offset_begin_ == 0, the position should be in range.
// assert(begin_it >= offset_begin_);
const uint64_t *end_it;
for (end_it = begin_it; (end_it < offset_end_) && (*end_it <= index + 1); ++end_it) {}
for (end_it = begin_it + 1; (end_it < offset_end_) && (*end_it <= index + 1); ++end_it) {}
// assert(end_it == std::upper_bound(offset_begin_, offset_end_, index + 1));
--end_it;
// assert(end_it >= begin_it);
out.begin = ((begin_it - offset_begin_) << next_inline_.bits) |
util::ReadInt57(base, bit_offset, next_inline_.bits, next_inline_.mask);
out.end = ((end_it - offset_begin_) << next_inline_.bits) |
util::ReadInt57(base, bit_offset + total_bits, next_inline_.bits, next_inline_.mask);
//assert(out.end >= out.begin);
// If this fails, consider rebuilding your model using KenLM after 1e333d786b748555e8f368d2bbba29a016c98052
assert(out.end >= out.begin);
}
void WriteNext(void *base, uint64_t bit_offset, uint64_t index, uint64_t value) {
@ -109,4 +120,4 @@ class ArrayBhiksha {
} // namespace ngram
} // namespace lm
#endif // LM_BHIKSHA__
#endif // LM_BHIKSHA_H

View File

@ -149,7 +149,7 @@ void BinaryFormat::InitializeBinary(int fd, ModelType model_type, unsigned int s
void BinaryFormat::ReadForConfig(void *to, std::size_t amount, uint64_t offset_excluding_header) const {
assert(header_size_ != kInvalidSize);
util::PReadOrThrow(file_.get(), to, amount, offset_excluding_header + header_size_);
util::ErsatzPRead(file_.get(), to, amount, offset_excluding_header + header_size_);
}
void *BinaryFormat::LoadBinary(std::size_t size) {

View File

@ -1,5 +1,5 @@
#ifndef LM_BINARY_FORMAT__
#define LM_BINARY_FORMAT__
#ifndef LM_BINARY_FORMAT_H
#define LM_BINARY_FORMAT_H
#include "lm/config.hh"
#include "lm/model_type.hh"
@ -103,4 +103,4 @@ bool IsBinaryFormat(int fd);
} // namespace ngram
} // namespace lm
#endif // LM_BINARY_FORMAT__
#endif // LM_BINARY_FORMAT_H

View File

@ -1,5 +1,5 @@
#ifndef LM_BLANK__
#define LM_BLANK__
#ifndef LM_BLANK_H
#define LM_BLANK_H
#include <limits>
@ -40,4 +40,4 @@ inline bool HasExtension(const float &backoff) {
} // namespace ngram
} // namespace lm
#endif // LM_BLANK__
#endif // LM_BLANK_H

View File

@ -1,8 +1,9 @@
#include "lm/builder/adjust_counts.hh"
#include "lm/builder/multi_stream.hh"
#include "lm/builder/ngram_stream.hh"
#include "util/stream/timer.hh"
#include <algorithm>
#include <iostream>
namespace lm { namespace builder {
@ -10,19 +11,19 @@ BadDiscountException::BadDiscountException() throw() {}
BadDiscountException::~BadDiscountException() throw() {}
namespace {
// Return last word in full that is different.
// Return last word in full that is different.
const WordIndex* FindDifference(const NGram &full, const NGram &lower_last) {
const WordIndex *cur_word = full.end() - 1;
const WordIndex *pre_word = lower_last.end() - 1;
// Find last difference.
// Find last difference.
for (; pre_word >= lower_last.begin() && *pre_word == *cur_word; --cur_word, --pre_word) {}
return cur_word;
}
class StatCollector {
public:
StatCollector(std::size_t order, std::vector<uint64_t> &counts, std::vector<Discount> &discounts)
: orders_(order), full_(orders_.back()), counts_(counts), discounts_(discounts) {
StatCollector(std::size_t order, std::vector<uint64_t> &counts, std::vector<uint64_t> &counts_pruned, std::vector<Discount> &discounts)
: orders_(order), full_(orders_.back()), counts_(counts), counts_pruned_(counts_pruned), discounts_(discounts) {
memset(&orders_[0], 0, sizeof(OrderStat) * order);
}
@ -30,10 +31,12 @@ class StatCollector {
void CalculateDiscounts() {
counts_.resize(orders_.size());
counts_pruned_.resize(orders_.size());
discounts_.resize(orders_.size());
for (std::size_t i = 0; i < orders_.size(); ++i) {
const OrderStat &s = orders_[i];
counts_[i] = s.count;
counts_pruned_[i] = s.count_pruned;
for (unsigned j = 1; j < 4; ++j) {
// TODO: Specialize error message for j == 3, meaning 3+
@ -52,14 +55,18 @@ class StatCollector {
}
}
void Add(std::size_t order_minus_1, uint64_t count) {
void Add(std::size_t order_minus_1, uint64_t count, bool pruned = false) {
OrderStat &stat = orders_[order_minus_1];
++stat.count;
if (!pruned)
++stat.count_pruned;
if (count < 5) ++stat.n[count];
}
void AddFull(uint64_t count) {
void AddFull(uint64_t count, bool pruned = false) {
++full_.count;
if (!pruned)
++full_.count_pruned;
if (count < 5) ++full_.n[count];
}
@ -68,24 +75,27 @@ class StatCollector {
// n_1 in equation 26 of Chen and Goodman etc
uint64_t n[5];
uint64_t count;
uint64_t count_pruned;
};
std::vector<OrderStat> orders_;
OrderStat &full_;
std::vector<uint64_t> &counts_;
std::vector<uint64_t> &counts_pruned_;
std::vector<Discount> &discounts_;
};
// Reads all entries in order like NGramStream does.
// Reads all entries in order like NGramStream does.
// But deletes any entries that have <s> in the 1st (not 0th) position on the
// way out by putting other entries in their place. This disrupts the sort
// order but we don't care because the data is going to be sorted again.
// order but we don't care because the data is going to be sorted again.
class CollapseStream {
public:
CollapseStream(const util::stream::ChainPosition &position) :
CollapseStream(const util::stream::ChainPosition &position, uint64_t prune_threshold) :
current_(NULL, NGram::OrderFromSize(position.GetChain().EntrySize())),
block_(position) {
prune_threshold_(prune_threshold),
block_(position) {
StartBlock();
}
@ -96,10 +106,18 @@ class CollapseStream {
CollapseStream &operator++() {
assert(block_);
if (current_.begin()[1] == kBOS && current_.Base() < copy_from_) {
memcpy(current_.Base(), copy_from_, current_.TotalSize());
UpdateCopyFrom();
// Mark highest order n-grams for later pruning
if(current_.Count() <= prune_threshold_) {
current_.Mark();
}
}
current_.NextInMemory();
uint8_t *block_base = static_cast<uint8_t*>(block_->Get());
if (current_.Base() == block_base + block_->ValidSize()) {
@ -107,6 +125,12 @@ class CollapseStream {
++block_;
StartBlock();
}
// Mark highest order n-grams for later pruning
if(current_.Count() <= prune_threshold_) {
current_.Mark();
}
return *this;
}
@ -119,9 +143,15 @@ class CollapseStream {
current_.ReBase(block_->Get());
copy_from_ = static_cast<uint8_t*>(block_->Get()) + block_->ValidSize();
UpdateCopyFrom();
// Mark highest order n-grams for later pruning
if(current_.Count() <= prune_threshold_) {
current_.Mark();
}
}
// Find last without bos.
// Find last without bos.
void UpdateCopyFrom() {
for (copy_from_ -= current_.TotalSize(); copy_from_ >= current_.Base(); copy_from_ -= current_.TotalSize()) {
if (NGram(copy_from_, current_.Order()).begin()[1] != kBOS) break;
@ -132,79 +162,103 @@ class CollapseStream {
// Goes backwards in the block
uint8_t *copy_from_;
uint64_t prune_threshold_;
util::stream::Link block_;
};
} // namespace
void AdjustCounts::Run(const ChainPositions &positions) {
void AdjustCounts::Run(const util::stream::ChainPositions &positions) {
UTIL_TIMER("(%w s) Adjusted counts\n");
const std::size_t order = positions.size();
StatCollector stats(order, counts_, discounts_);
StatCollector stats(order, counts_, counts_pruned_, discounts_);
if (order == 1) {
// Only unigrams. Just collect stats.
for (NGramStream full(positions[0]); full; ++full)
stats.AddFull(full->Count());
stats.CalculateDiscounts();
return;
}
NGramStreams streams;
streams.Init(positions, positions.size() - 1);
CollapseStream full(positions[positions.size() - 1]);
CollapseStream full(positions[positions.size() - 1], prune_thresholds_.back());
// Initialization: <unk> has count 0 and so does <s>.
// Initialization: <unk> has count 0 and so does <s>.
NGramStream *lower_valid = streams.begin();
streams[0]->Count() = 0;
*streams[0]->begin() = kUNK;
stats.Add(0, 0);
(++streams[0])->Count() = 0;
*streams[0]->begin() = kBOS;
// not in stats because it will get put in later.
// not in stats because it will get put in later.
std::vector<uint64_t> lower_counts(positions.size(), 0);
// iterate over full (the stream of the highest order ngrams)
for (; full; ++full) {
for (; full; ++full) {
const WordIndex *different = FindDifference(*full, **lower_valid);
std::size_t same = full->end() - 1 - different;
// Increment the adjusted count.
// Increment the adjusted count.
if (same) ++streams[same - 1]->Count();
// Output all the valid ones that changed.
// Output all the valid ones that changed.
for (; lower_valid >= &streams[same]; --lower_valid) {
stats.Add(lower_valid - streams.begin(), (*lower_valid)->Count());
// mjd: review this!
uint64_t order = (*lower_valid)->Order();
uint64_t realCount = lower_counts[order - 1];
if(order > 1 && prune_thresholds_[order - 1] && realCount <= prune_thresholds_[order - 1])
(*lower_valid)->Mark();
stats.Add(lower_valid - streams.begin(), (*lower_valid)->UnmarkedCount(), (*lower_valid)->IsMarked());
++*lower_valid;
}
// Count the true occurrences of lower-order n-grams
for (std::size_t i = 0; i < lower_counts.size(); ++i) {
if (i >= same) {
lower_counts[i] = 0;
}
lower_counts[i] += full->UnmarkedCount();
}
// This is here because bos is also const WordIndex *, so copy gets
// consistent argument types.
// consistent argument types.
const WordIndex *full_end = full->end();
// Initialize and mark as valid up to bos.
// Initialize and mark as valid up to bos.
const WordIndex *bos;
for (bos = different; (bos > full->begin()) && (*bos != kBOS); --bos) {
++lower_valid;
std::copy(bos, full_end, (*lower_valid)->begin());
(*lower_valid)->Count() = 1;
}
// Now bos indicates where <s> is or is the 0th word of full.
// Now bos indicates where <s> is or is the 0th word of full.
if (bos != full->begin()) {
// There is an <s> beyond the 0th word.
// There is an <s> beyond the 0th word.
NGramStream &to = *++lower_valid;
std::copy(bos, full_end, to->begin());
to->Count() = full->Count();
// mjd: what is this doing?
to->Count() = full->UnmarkedCount();
} else {
stats.AddFull(full->Count());
stats.AddFull(full->UnmarkedCount(), full->IsMarked());
}
assert(lower_valid >= &streams[0]);
}
// Output everything valid.
for (NGramStream *s = streams.begin(); s <= lower_valid; ++s) {
stats.Add(s - streams.begin(), (*s)->Count());
if((*s)->Count() <= prune_thresholds_[(*s)->Order() - 1])
(*s)->Mark();
stats.Add(s - streams.begin(), (*s)->UnmarkedCount(), (*s)->IsMarked());
++*s;
}
// Poison everyone! Except the N-grams which were already poisoned by the input.
// Poison everyone! Except the N-grams which were already poisoned by the input.
for (NGramStream *s = streams.begin(); s != streams.end(); ++s)
s->Poison();

View File

@ -1,5 +1,5 @@
#ifndef LM_BUILDER_ADJUST_COUNTS__
#define LM_BUILDER_ADJUST_COUNTS__
#ifndef LM_BUILDER_ADJUST_COUNTS_H
#define LM_BUILDER_ADJUST_COUNTS_H
#include "lm/builder/discount.hh"
#include "util/exception.hh"
@ -8,11 +8,11 @@
#include <stdint.h>
namespace util { namespace stream { class ChainPositions; } }
namespace lm {
namespace builder {
class ChainPositions;
class BadDiscountException : public util::Exception {
public:
BadDiscountException() throw();
@ -27,18 +27,21 @@ class BadDiscountException : public util::Exception {
*/
class AdjustCounts {
public:
AdjustCounts(std::vector<uint64_t> &counts, std::vector<Discount> &discounts)
: counts_(counts), discounts_(discounts) {}
AdjustCounts(std::vector<uint64_t> &counts, std::vector<uint64_t> &counts_pruned, std::vector<Discount> &discounts, std::vector<uint64_t> &prune_thresholds)
: counts_(counts), counts_pruned_(counts_pruned), discounts_(discounts), prune_thresholds_(prune_thresholds)
{}
void Run(const ChainPositions &positions);
void Run(const util::stream::ChainPositions &positions);
private:
std::vector<uint64_t> &counts_;
std::vector<uint64_t> &counts_pruned_;
std::vector<Discount> &discounts_;
std::vector<uint64_t> &prune_thresholds_;
};
} // namespace builder
} // namespace lm
#endif // LM_BUILDER_ADJUST_COUNTS__
#endif // LM_BUILDER_ADJUST_COUNTS_H

View File

@ -1,6 +1,6 @@
#include "lm/builder/adjust_counts.hh"
#include "lm/builder/multi_stream.hh"
#include "lm/builder/ngram_stream.hh"
#include "util/scoped.hh"
#include <boost/thread/thread.hpp>
@ -61,19 +61,21 @@ BOOST_AUTO_TEST_CASE(Simple) {
util::stream::ChainConfig config;
config.total_memory = 100;
config.block_count = 1;
Chains chains(4);
util::stream::Chains chains(4);
for (unsigned i = 0; i < 4; ++i) {
config.entry_size = NGram::TotalSize(i + 1);
chains.push_back(config);
}
chains[3] >> WriteInput();
ChainPositions for_adjust(chains);
util::stream::ChainPositions for_adjust(chains);
for (unsigned i = 0; i < 4; ++i) {
chains[i] >> boost::ref(outputs[i]);
}
chains >> util::stream::kRecycle;
BOOST_CHECK_THROW(AdjustCounts(counts, discount).Run(for_adjust), BadDiscountException);
std::vector<uint64_t> counts_pruned(4);
std::vector<uint64_t> prune_thresholds(4);
BOOST_CHECK_THROW(AdjustCounts(counts, counts_pruned, discount, prune_thresholds).Run(for_adjust), BadDiscountException);
}
BOOST_REQUIRE_EQUAL(4UL, counts.size());
BOOST_CHECK_EQUAL(4UL, counts[0]);

View File

@ -2,6 +2,7 @@
#include "lm/builder/ngram.hh"
#include "lm/lm_exception.hh"
#include "lm/vocab.hh"
#include "lm/word_index.hh"
#include "util/fake_ofstream.hh"
#include "util/file.hh"
@ -37,60 +38,6 @@ struct VocabEntry {
};
#pragma pack(pop)
const float kProbingMultiplier = 1.5;
class VocabHandout {
public:
static std::size_t MemUsage(WordIndex initial_guess) {
if (initial_guess < 2) initial_guess = 2;
return util::CheckOverflow(Table::Size(initial_guess, kProbingMultiplier));
}
explicit VocabHandout(int fd, WordIndex initial_guess) :
table_backing_(util::CallocOrThrow(MemUsage(initial_guess))),
table_(table_backing_.get(), MemUsage(initial_guess)),
double_cutoff_(std::max<std::size_t>(initial_guess * 1.1, 1)),
word_list_(fd) {
Lookup("<unk>"); // Force 0
Lookup("<s>"); // Force 1
Lookup("</s>"); // Force 2
}
WordIndex Lookup(const StringPiece &word) {
VocabEntry entry;
entry.key = util::MurmurHashNative(word.data(), word.size());
entry.value = table_.SizeNoSerialization();
Table::MutableIterator it;
if (table_.FindOrInsert(entry, it))
return it->value;
word_list_ << word << '\0';
UTIL_THROW_IF(Size() >= std::numeric_limits<lm::WordIndex>::max(), VocabLoadException, "Too many vocabulary words. Change WordIndex to uint64_t in lm/word_index.hh.");
if (Size() >= double_cutoff_) {
table_backing_.call_realloc(table_.DoubleTo());
table_.Double(table_backing_.get());
double_cutoff_ *= 2;
}
return entry.value;
}
WordIndex Size() const {
return table_.SizeNoSerialization();
}
private:
// TODO: factor out a resizable probing hash table.
// TODO: use mremap on linux to get all zeros on resizes.
util::scoped_malloc table_backing_;
typedef util::ProbingHashTable<VocabEntry, util::IdentityHash> Table;
Table table_;
std::size_t double_cutoff_;
util::FakeOFStream word_list_;
};
class DedupeHash : public std::unary_function<const WordIndex *, bool> {
public:
explicit DedupeHash(std::size_t order) : size_(order * sizeof(WordIndex)) {}
@ -127,6 +74,10 @@ struct DedupeEntry {
}
};
// TODO: don't have this here, should be with probing hash table defaults?
const float kProbingMultiplier = 1.5;
typedef util::ProbingHashTable<DedupeEntry, DedupeHash, DedupeEquals> Dedupe;
class Writer {
@ -220,37 +171,50 @@ float CorpusCount::DedupeMultiplier(std::size_t order) {
}
std::size_t CorpusCount::VocabUsage(std::size_t vocab_estimate) {
return VocabHandout::MemUsage(vocab_estimate);
return ngram::GrowableVocab<ngram::WriteUniqueWords>::MemUsage(vocab_estimate);
}
CorpusCount::CorpusCount(util::FilePiece &from, int vocab_write, uint64_t &token_count, WordIndex &type_count, std::size_t entries_per_block)
CorpusCount::CorpusCount(util::FilePiece &from, int vocab_write, uint64_t &token_count, WordIndex &type_count, std::size_t entries_per_block, WarningAction disallowed_symbol)
: from_(from), vocab_write_(vocab_write), token_count_(token_count), type_count_(type_count),
dedupe_mem_size_(Dedupe::Size(entries_per_block, kProbingMultiplier)),
dedupe_mem_(util::MallocOrThrow(dedupe_mem_size_)) {
dedupe_mem_(util::MallocOrThrow(dedupe_mem_size_)),
disallowed_symbol_action_(disallowed_symbol) {
}
void CorpusCount::Run(const util::stream::ChainPosition &position) {
UTIL_TIMER("(%w s) Counted n-grams\n");
namespace {
void ComplainDisallowed(StringPiece word, WarningAction &action) {
switch (action) {
case SILENT:
return;
case COMPLAIN:
std::cerr << "Warning: " << word << " appears in the input. All instances of <s>, </s>, and <unk> will be interpreted as whitespace." << std::endl;
action = SILENT;
return;
case THROW_UP:
UTIL_THROW(FormatLoadException, "Special word " << word << " is not allowed in the corpus. I plan to support models containing <unk> in the future. Pass --skip_symbols to convert these symbols to whitespace.");
}
}
} // namespace
VocabHandout vocab(vocab_write_, type_count_);
void CorpusCount::Run(const util::stream::ChainPosition &position) {
ngram::GrowableVocab<ngram::WriteUniqueWords> vocab(type_count_, vocab_write_);
token_count_ = 0;
type_count_ = 0;
const WordIndex end_sentence = vocab.Lookup("</s>");
const WordIndex end_sentence = vocab.FindOrInsert("</s>");
Writer writer(NGram::OrderFromSize(position.GetChain().EntrySize()), position, dedupe_mem_.get(), dedupe_mem_size_);
uint64_t count = 0;
bool delimiters[256];
memset(delimiters, 0, sizeof(delimiters));
const char kDelimiterSet[] = "\0\t\n\r ";
for (const char *i = kDelimiterSet; i < kDelimiterSet + sizeof(kDelimiterSet); ++i) {
delimiters[static_cast<unsigned char>(*i)] = true;
}
util::BoolCharacter::Build("\0\t\n\r ", delimiters);
try {
while(true) {
StringPiece line(from_.ReadLine());
writer.StartSentence();
for (util::TokenIter<util::BoolCharacter, true> w(line, delimiters); w; ++w) {
WordIndex word = vocab.Lookup(*w);
UTIL_THROW_IF(word <= 2, FormatLoadException, "Special word " << *w << " is not allowed in the corpus. I plan to support models containing <unk> in the future.");
WordIndex word = vocab.FindOrInsert(*w);
if (word <= 2) {
ComplainDisallowed(*w, disallowed_symbol_action_);
continue;
}
writer.Append(word);
++count;
}

View File

@ -1,6 +1,7 @@
#ifndef LM_BUILDER_CORPUS_COUNT__
#define LM_BUILDER_CORPUS_COUNT__
#ifndef LM_BUILDER_CORPUS_COUNT_H
#define LM_BUILDER_CORPUS_COUNT_H
#include "lm/lm_exception.hh"
#include "lm/word_index.hh"
#include "util/scoped.hh"
@ -28,7 +29,7 @@ class CorpusCount {
// token_count: out.
// type_count aka vocabulary size. Initialize to an estimate. It is set to the exact value.
CorpusCount(util::FilePiece &from, int vocab_write, uint64_t &token_count, WordIndex &type_count, std::size_t entries_per_block);
CorpusCount(util::FilePiece &from, int vocab_write, uint64_t &token_count, WordIndex &type_count, std::size_t entries_per_block, WarningAction disallowed_symbol);
void Run(const util::stream::ChainPosition &position);
@ -40,8 +41,10 @@ class CorpusCount {
std::size_t dedupe_mem_size_;
util::scoped_malloc dedupe_mem_;
WarningAction disallowed_symbol_action_;
};
} // namespace builder
} // namespace lm
#endif // LM_BUILDER_CORPUS_COUNT__
#endif // LM_BUILDER_CORPUS_COUNT_H

View File

@ -45,7 +45,7 @@ BOOST_AUTO_TEST_CASE(Short) {
NGramStream stream;
uint64_t token_count;
WordIndex type_count = 10;
CorpusCount counter(input_piece, vocab.get(), token_count, type_count, chain.BlockSize() / chain.EntrySize());
CorpusCount counter(input_piece, vocab.get(), token_count, type_count, chain.BlockSize() / chain.EntrySize(), SILENT);
chain >> boost::ref(counter) >> stream >> util::stream::kRecycle;
const char *v[] = {"<unk>", "<s>", "</s>", "looking", "on", "a", "little", "more", "loin", "foo", "bar"};

View File

@ -1,5 +1,5 @@
#ifndef BUILDER_DISCOUNT__
#define BUILDER_DISCOUNT__
#ifndef LM_BUILDER_DISCOUNT_H
#define LM_BUILDER_DISCOUNT_H
#include <algorithm>
@ -23,4 +23,4 @@ struct Discount {
} // namespace builder
} // namespace lm
#endif // BUILDER_DISCOUNT__
#endif // LM_BUILDER_DISCOUNT_H

19
lm/builder/hash_gamma.hh Normal file
View File

@ -0,0 +1,19 @@
#ifndef LM_BUILDER_HASH_GAMMA__
#define LM_BUILDER_HASH_GAMMA__
#include <stdint.h>
namespace lm { namespace builder {
#pragma pack(push)
#pragma pack(4)
struct HashGamma {
uint64_t hash_value;
float gamma;
};
#pragma pack(pop)
}} // namespaces
#endif // LM_BUILDER_HASH_GAMMA__

View File

@ -1,5 +1,5 @@
#ifndef LM_BUILDER_HEADER_INFO__
#define LM_BUILDER_HEADER_INFO__
#ifndef LM_BUILDER_HEADER_INFO_H
#define LM_BUILDER_HEADER_INFO_H
#include <string>
#include <stdint.h>

View File

@ -3,6 +3,8 @@
#include "lm/builder/discount.hh"
#include "lm/builder/ngram_stream.hh"
#include "lm/builder/sort.hh"
#include "lm/builder/hash_gamma.hh"
#include "util/murmur_hash.hh"
#include "util/file.hh"
#include "util/stream/chain.hh"
#include "util/stream/io.hh"
@ -14,55 +16,179 @@ namespace lm { namespace builder {
namespace {
struct BufferEntry {
// Gamma from page 20 of Chen and Goodman.
// Gamma from page 20 of Chen and Goodman.
float gamma;
// \sum_w a(c w) for all w.
// \sum_w a(c w) for all w.
float denominator;
};
// Extract an array of gamma from an array of BufferEntry.
struct HashBufferEntry : public BufferEntry {
// Hash value of ngram. Used to join contexts with backoffs.
uint64_t hash_value;
};
// Reads all entries in order like NGramStream does.
// But deletes any entries that have CutoffCount below or equal to pruning
// threshold.
class PruneNGramStream {
public:
PruneNGramStream(const util::stream::ChainPosition &position) :
current_(NULL, NGram::OrderFromSize(position.GetChain().EntrySize())),
dest_(NULL, NGram::OrderFromSize(position.GetChain().EntrySize())),
currentCount_(0),
block_(position)
{
StartBlock();
}
NGram &operator*() { return current_; }
NGram *operator->() { return &current_; }
operator bool() const {
return block_;
}
PruneNGramStream &operator++() {
assert(block_);
if (current_.Order() > 1) {
if(currentCount_ > 0) {
if(dest_.Base() < current_.Base()) {
memcpy(dest_.Base(), current_.Base(), current_.TotalSize());
}
dest_.NextInMemory();
}
} else {
dest_.NextInMemory();
}
current_.NextInMemory();
uint8_t *block_base = static_cast<uint8_t*>(block_->Get());
if (current_.Base() == block_base + block_->ValidSize()) {
block_->SetValidSize(dest_.Base() - block_base);
++block_;
StartBlock();
}
currentCount_ = current_.CutoffCount();
return *this;
}
private:
void StartBlock() {
for (; ; ++block_) {
if (!block_) return;
if (block_->ValidSize()) break;
}
current_.ReBase(block_->Get());
currentCount_ = current_.CutoffCount();
dest_.ReBase(block_->Get());
}
NGram current_; // input iterator
NGram dest_; // output iterator
uint64_t currentCount_;
util::stream::Link block_;
};
// Extract an array of HashedGamma from an array of BufferEntry.
class OnlyGamma {
public:
OnlyGamma(bool pruning) : pruning_(pruning) {}
void Run(const util::stream::ChainPosition &position) {
for (util::stream::Link block_it(position); block_it; ++block_it) {
float *out = static_cast<float*>(block_it->Get());
const float *in = out;
const float *end = static_cast<const float*>(block_it->ValidEnd());
for (out += 1, in += 2; in < end; out += 1, in += 2) {
*out = *in;
if(pruning_) {
const HashBufferEntry *in = static_cast<const HashBufferEntry*>(block_it->Get());
const HashBufferEntry *end = static_cast<const HashBufferEntry*>(block_it->ValidEnd());
// Just make it point to the beginning of the stream so it can be overwritten
// With HashGamma values. Do not attempt to interpret the values until set below.
HashGamma *out = static_cast<HashGamma*>(block_it->Get());
for (; in < end; out += 1, in += 1) {
// buffering, otherwise might overwrite values too early
float gamma_buf = in->gamma;
uint64_t hash_buf = in->hash_value;
out->gamma = gamma_buf;
out->hash_value = hash_buf;
}
block_it->SetValidSize((block_it->ValidSize() * sizeof(HashGamma)) / sizeof(HashBufferEntry));
}
else {
float *out = static_cast<float*>(block_it->Get());
const float *in = out;
const float *end = static_cast<const float*>(block_it->ValidEnd());
for (out += 1, in += 2; in < end; out += 1, in += 2) {
*out = *in;
}
block_it->SetValidSize(block_it->ValidSize() / 2);
}
block_it->SetValidSize(block_it->ValidSize() / 2);
}
}
private:
bool pruning_;
};
class AddRight {
public:
AddRight(const Discount &discount, const util::stream::ChainPosition &input)
: discount_(discount), input_(input) {}
AddRight(const Discount &discount, const util::stream::ChainPosition &input, bool pruning)
: discount_(discount), input_(input), pruning_(pruning) {}
void Run(const util::stream::ChainPosition &output) {
NGramStream in(input_);
util::stream::Stream out(output);
std::vector<WordIndex> previous(in->Order() - 1);
// Silly windows requires this workaround to just get an invalid pointer when empty.
void *const previous_raw = previous.empty() ? NULL : static_cast<void*>(&previous[0]);
const std::size_t size = sizeof(WordIndex) * previous.size();
for(; in; ++out) {
memcpy(&previous[0], in->begin(), size);
memcpy(previous_raw, in->begin(), size);
uint64_t denominator = 0;
uint64_t normalizer = 0;
uint64_t counts[4];
memset(counts, 0, sizeof(counts));
do {
denominator += in->Count();
++counts[std::min(in->Count(), static_cast<uint64_t>(3))];
} while (++in && !memcmp(&previous[0], in->begin(), size));
denominator += in->UnmarkedCount();
// Collect unused probability mass from pruning.
// Becomes 0 for unpruned ngrams.
normalizer += in->UnmarkedCount() - in->CutoffCount();
// Chen&Goodman do not mention counting based on cutoffs, but
// backoff becomes larger than 1 otherwise, so probably needs
// to count cutoffs. Counts normally without pruning.
if(in->CutoffCount() > 0)
++counts[std::min(in->CutoffCount(), static_cast<uint64_t>(3))];
} while (++in && !memcmp(previous_raw, in->begin(), size));
BufferEntry &entry = *reinterpret_cast<BufferEntry*>(out.Get());
entry.denominator = static_cast<float>(denominator);
entry.gamma = 0.0;
for (unsigned i = 1; i <= 3; ++i) {
entry.gamma += discount_.Get(i) * static_cast<float>(counts[i]);
}
// Makes model sum to 1 with pruning (I hope).
entry.gamma += normalizer;
entry.gamma /= entry.denominator;
if(pruning_) {
// If pruning is enabled the stream actually contains HashBufferEntry, see InitialProbabilities(...),
// so add a hash value that identifies the current ngram.
static_cast<HashBufferEntry*>(&entry)->hash_value = util::MurmurHashNative(previous_raw, size);
}
}
out.Poison();
}
@ -70,6 +196,7 @@ class AddRight {
private:
const Discount &discount_;
const util::stream::ChainPosition input_;
bool pruning_;
};
class MergeRight {
@ -82,7 +209,7 @@ class MergeRight {
void Run(const util::stream::ChainPosition &primary) {
util::stream::Stream summed(from_adder_);
NGramStream grams(primary);
PruneNGramStream grams(primary);
// Without interpolation, the interpolation weight goes to <unk>.
if (grams->Order() == 1 && !interpolate_unigrams_) {
@ -97,15 +224,16 @@ class MergeRight {
++summed;
return;
}
std::vector<WordIndex> previous(grams->Order() - 1);
const std::size_t size = sizeof(WordIndex) * previous.size();
for (; grams; ++summed) {
memcpy(&previous[0], grams->begin(), size);
const BufferEntry &sums = *static_cast<const BufferEntry*>(summed.Get());
do {
Payload &pay = grams->Value();
pay.uninterp.prob = discount_.Apply(pay.count) / sums.denominator;
pay.uninterp.prob = discount_.Apply(grams->UnmarkedCount()) / sums.denominator;
pay.uninterp.gamma = sums.gamma;
} while (++grams && !memcmp(&previous[0], grams->begin(), size));
}
@ -119,17 +247,29 @@ class MergeRight {
} // namespace
void InitialProbabilities(const InitialProbabilitiesConfig &config, const std::vector<Discount> &discounts, Chains &primary, Chains &second_in, Chains &gamma_out) {
util::stream::ChainConfig gamma_config = config.adder_out;
gamma_config.entry_size = sizeof(BufferEntry);
void InitialProbabilities(
const InitialProbabilitiesConfig &config,
const std::vector<Discount> &discounts,
util::stream::Chains &primary,
util::stream::Chains &second_in,
util::stream::Chains &gamma_out,
const std::vector<uint64_t> &prune_thresholds) {
for (size_t i = 0; i < primary.size(); ++i) {
util::stream::ChainConfig gamma_config = config.adder_out;
if(prune_thresholds[i] > 0)
gamma_config.entry_size = sizeof(HashBufferEntry);
else
gamma_config.entry_size = sizeof(BufferEntry);
util::stream::ChainPosition second(second_in[i].Add());
second_in[i] >> util::stream::kRecycle;
gamma_out.push_back(gamma_config);
gamma_out[i] >> AddRight(discounts[i], second);
gamma_out[i] >> AddRight(discounts[i], second, prune_thresholds[i] > 0);
primary[i] >> MergeRight(config.interpolate_unigrams, gamma_out[i].Add(), discounts[i]);
// Don't bother with the OnlyGamma thread for something to discard.
if (i) gamma_out[i] >> OnlyGamma();
// Don't bother with the OnlyGamma thread for something to discard.
if (i) gamma_out[i] >> OnlyGamma(prune_thresholds[i] > 0);
}
}

View File

@ -1,14 +1,15 @@
#ifndef LM_BUILDER_INITIAL_PROBABILITIES__
#define LM_BUILDER_INITIAL_PROBABILITIES__
#ifndef LM_BUILDER_INITIAL_PROBABILITIES_H
#define LM_BUILDER_INITIAL_PROBABILITIES_H
#include "lm/builder/discount.hh"
#include "util/stream/config.hh"
#include <vector>
namespace util { namespace stream { class Chains; } }
namespace lm {
namespace builder {
class Chains;
struct InitialProbabilitiesConfig {
// These should be small buffers to keep the adder from getting too far ahead
@ -26,9 +27,15 @@ struct InitialProbabilitiesConfig {
* The values are bare floats and should be buffered for interpolation to
* use.
*/
void InitialProbabilities(const InitialProbabilitiesConfig &config, const std::vector<Discount> &discounts, Chains &primary, Chains &second_in, Chains &gamma_out);
void InitialProbabilities(
const InitialProbabilitiesConfig &config,
const std::vector<Discount> &discounts,
util::stream::Chains &primary,
util::stream::Chains &second_in,
util::stream::Chains &gamma_out,
const std::vector<uint64_t> &prune_thresholds);
} // namespace builder
} // namespace lm
#endif // LM_BUILDER_INITIAL_PROBABILITIES__
#endif // LM_BUILDER_INITIAL_PROBABILITIES_H

View File

@ -1,9 +1,12 @@
#include "lm/builder/interpolate.hh"
#include "lm/builder/hash_gamma.hh"
#include "lm/builder/joint_order.hh"
#include "lm/builder/multi_stream.hh"
#include "lm/builder/ngram_stream.hh"
#include "lm/builder/sort.hh"
#include "lm/lm_exception.hh"
#include "util/fixed_array.hh"
#include "util/murmur_hash.hh"
#include <assert.h>
@ -12,7 +15,8 @@ namespace {
class Callback {
public:
Callback(float uniform_prob, const ChainPositions &backoffs) : backoffs_(backoffs.size()), probs_(backoffs.size() + 2) {
Callback(float uniform_prob, const util::stream::ChainPositions &backoffs, const std::vector<uint64_t> &prune_thresholds)
: backoffs_(backoffs.size()), probs_(backoffs.size() + 2), prune_thresholds_(prune_thresholds) {
probs_[0] = uniform_prob;
for (std::size_t i = 0; i < backoffs.size(); ++i) {
backoffs_.push_back(backoffs[i]);
@ -33,12 +37,37 @@ class Callback {
pay.complete.prob = pay.uninterp.prob + pay.uninterp.gamma * probs_[order_minus_1];
probs_[order_minus_1 + 1] = pay.complete.prob;
pay.complete.prob = log10(pay.complete.prob);
// TODO: this is a hack to skip n-grams that don't appear as context. Pruning will require some different handling.
if (order_minus_1 < backoffs_.size() && *(gram.end() - 1) != kUNK && *(gram.end() - 1) != kEOS) {
pay.complete.backoff = log10(*static_cast<const float*>(backoffs_[order_minus_1].Get()));
++backoffs_[order_minus_1];
// This skips over ngrams if backoffs have been exhausted.
if(!backoffs_[order_minus_1]) {
pay.complete.backoff = 0.0;
return;
}
if(prune_thresholds_[order_minus_1 + 1] > 0) {
//Compute hash value for current context
uint64_t current_hash = util::MurmurHashNative(gram.begin(), gram.Order() * sizeof(WordIndex));
const HashGamma *hashed_backoff = static_cast<const HashGamma*>(backoffs_[order_minus_1].Get());
while(backoffs_[order_minus_1] && current_hash != hashed_backoff->hash_value) {
hashed_backoff = static_cast<const HashGamma*>(backoffs_[order_minus_1].Get());
++backoffs_[order_minus_1];
}
if(current_hash == hashed_backoff->hash_value) {
pay.complete.backoff = log10(hashed_backoff->gamma);
++backoffs_[order_minus_1];
} else {
// Has been pruned away so it is not a context anymore
pay.complete.backoff = 0.0;
}
} else {
pay.complete.backoff = log10(*static_cast<const float*>(backoffs_[order_minus_1].Get()));
++backoffs_[order_minus_1];
}
} else {
// Not a context.
// Not a context.
pay.complete.backoff = 0.0;
}
}
@ -46,19 +75,22 @@ class Callback {
void Exit(unsigned, const NGram &) const {}
private:
FixedArray<util::stream::Stream> backoffs_;
util::FixedArray<util::stream::Stream> backoffs_;
std::vector<float> probs_;
const std::vector<uint64_t>& prune_thresholds_;
};
} // namespace
Interpolate::Interpolate(uint64_t unigram_count, const ChainPositions &backoffs)
: uniform_prob_(1.0 / static_cast<float>(unigram_count - 1)), backoffs_(backoffs) {}
Interpolate::Interpolate(uint64_t vocab_size, const util::stream::ChainPositions &backoffs, const std::vector<uint64_t>& prune_thresholds)
: uniform_prob_(1.0 / static_cast<float>(vocab_size)), // Includes <unk> but excludes <s>.
backoffs_(backoffs),
prune_thresholds_(prune_thresholds) {}
// perform order-wise interpolation
void Interpolate::Run(const ChainPositions &positions) {
void Interpolate::Run(const util::stream::ChainPositions &positions) {
assert(positions.size() == backoffs_.size() + 1);
Callback callback(uniform_prob_, backoffs_);
Callback callback(uniform_prob_, backoffs_, prune_thresholds_);
JointOrder<Callback, SuffixOrder>(positions, callback);
}

View File

@ -1,10 +1,12 @@
#ifndef LM_BUILDER_INTERPOLATE__
#define LM_BUILDER_INTERPOLATE__
#ifndef LM_BUILDER_INTERPOLATE_H
#define LM_BUILDER_INTERPOLATE_H
#include "util/stream/multi_stream.hh"
#include <vector>
#include <stdint.h>
#include "lm/builder/multi_stream.hh"
namespace lm { namespace builder {
/* Interpolate step.
@ -14,14 +16,17 @@ namespace lm { namespace builder {
*/
class Interpolate {
public:
explicit Interpolate(uint64_t unigram_count, const ChainPositions &backoffs);
// Normally vocab_size is the unigram count-1 (since p(<s>) = 0) but might
// be larger when the user specifies a consistent vocabulary size.
explicit Interpolate(uint64_t vocab_size, const util::stream::ChainPositions &backoffs, const std::vector<uint64_t> &prune_thresholds);
void Run(const ChainPositions &positions);
void Run(const util::stream::ChainPositions &positions);
private:
float uniform_prob_;
ChainPositions backoffs_;
util::stream::ChainPositions backoffs_;
const std::vector<uint64_t> prune_thresholds_;
};
}} // namespaces
#endif // LM_BUILDER_INTERPOLATE__
#endif // LM_BUILDER_INTERPOLATE_H

View File

@ -1,14 +1,14 @@
#ifndef LM_BUILDER_JOINT_ORDER__
#define LM_BUILDER_JOINT_ORDER__
#ifndef LM_BUILDER_JOINT_ORDER_H
#define LM_BUILDER_JOINT_ORDER_H
#include "lm/builder/multi_stream.hh"
#include "lm/builder/ngram_stream.hh"
#include "lm/lm_exception.hh"
#include <string.h>
namespace lm { namespace builder {
template <class Callback, class Compare> void JointOrder(const ChainPositions &positions, Callback &callback) {
template <class Callback, class Compare> void JointOrder(const util::stream::ChainPositions &positions, Callback &callback) {
// Allow matching to reference streams[-1].
NGramStreams streams_with_dummy;
streams_with_dummy.InitWithDummy(positions);
@ -40,4 +40,4 @@ template <class Callback, class Compare> void JointOrder(const ChainPositions &p
}} // namespaces
#endif // LM_BUILDER_JOINT_ORDER__
#endif // LM_BUILDER_JOINT_ORDER_H

View File

@ -1,4 +1,5 @@
#include "lm/builder/pipeline.hh"
#include "lm/lm_exception.hh"
#include "util/file.hh"
#include "util/file_piece.hh"
#include "util/usage.hh"
@ -7,6 +8,7 @@
#include <boost/program_options.hpp>
#include <boost/version.hpp>
#include <vector>
namespace {
class SizeNotify {
@ -25,6 +27,46 @@ boost::program_options::typed_value<std::string> *SizeOption(std::size_t &to, co
return boost::program_options::value<std::string>()->notifier(SizeNotify(to))->default_value(default_value);
}
// Parse and validate pruning thresholds then return vector of threshold counts
// for each n-grams order.
std::vector<uint64_t> ParsePruning(const std::vector<std::string> &param, std::size_t order) {
// convert to vector of integers
std::vector<uint64_t> prune_thresholds;
prune_thresholds.reserve(order);
std::cerr << "Pruning ";
for (std::vector<std::string>::const_iterator it(param.begin()); it != param.end(); ++it) {
try {
prune_thresholds.push_back(boost::lexical_cast<uint64_t>(*it));
} catch(const boost::bad_lexical_cast &) {
UTIL_THROW(util::Exception, "Bad pruning threshold " << *it);
}
}
// Fill with zeros by default.
if (prune_thresholds.empty()) {
prune_thresholds.resize(order, 0);
return prune_thresholds;
}
// validate pruning threshold if specified
// throw if each n-gram order has not threshold specified
UTIL_THROW_IF(prune_thresholds.size() > order, util::Exception, "You specified pruning thresholds for orders 1 through " << prune_thresholds.size() << " but the model only has order " << order);
// threshold for unigram can only be 0 (no pruning)
UTIL_THROW_IF(prune_thresholds[0] != 0, util::Exception, "Unigram pruning is not implemented, so the first pruning threshold must be 0.");
// check if threshold are not in decreasing order
uint64_t lower_threshold = 0;
for (std::vector<uint64_t>::iterator it = prune_thresholds.begin(); it != prune_thresholds.end(); ++it) {
UTIL_THROW_IF(lower_threshold > *it, util::Exception, "Pruning thresholds should be in non-decreasing order. Otherwise substrings would be removed, which is bad for query-time data structures.");
lower_threshold = *it;
}
// Pad to all orders using the last value.
prune_thresholds.resize(order, prune_thresholds.back());
return prune_thresholds;
}
} // namespace
int main(int argc, char *argv[]) {
@ -34,25 +76,30 @@ int main(int argc, char *argv[]) {
lm::builder::PipelineConfig pipeline;
std::string text, arpa;
std::vector<std::string> pruning;
options.add_options()
("help", po::bool_switch(), "Show this help message")
("help,h", po::bool_switch(), "Show this help message")
("order,o", po::value<std::size_t>(&pipeline.order)
#if BOOST_VERSION >= 104200
->required()
#endif
, "Order of the model")
("interpolate_unigrams", po::bool_switch(&pipeline.initial_probs.interpolate_unigrams), "Interpolate the unigrams (default: emulate SRILM by not interpolating)")
("skip_symbols", po::bool_switch(), "Treat <s>, </s>, and <unk> as whitespace instead of throwing an exception")
("temp_prefix,T", po::value<std::string>(&pipeline.sort.temp_prefix)->default_value("/tmp/lm"), "Temporary file prefix")
("memory,S", SizeOption(pipeline.sort.total_memory, util::GuessPhysicalMemory() ? "80%" : "1G"), "Sorting memory")
("minimum_block", SizeOption(pipeline.minimum_block, "8K"), "Minimum block size to allow")
("sort_block", SizeOption(pipeline.sort.buffer_size, "64M"), "Size of IO operations for sort (determines arity)")
("vocab_estimate", po::value<lm::WordIndex>(&pipeline.vocab_estimate)->default_value(1000000), "Assume this vocabulary size for purposes of calculating memory in step 1 (corpus count) and pre-sizing the hash table")
("block_count", po::value<std::size_t>(&pipeline.block_count)->default_value(2), "Block count (per order)")
("vocab_file", po::value<std::string>(&pipeline.vocab_file)->default_value(""), "Location to write vocabulary file")
("vocab_estimate", po::value<lm::WordIndex>(&pipeline.vocab_estimate)->default_value(1000000), "Assume this vocabulary size for purposes of calculating memory in step 1 (corpus count) and pre-sizing the hash table")
("vocab_file", po::value<std::string>(&pipeline.vocab_file)->default_value(""), "Location to write a file containing the unique vocabulary strings delimited by null bytes")
("vocab_pad", po::value<uint64_t>(&pipeline.vocab_size_for_unk)->default_value(0), "If the vocabulary is smaller than this value, pad with <unk> to reach this size. Requires --interpolate_unigrams")
("verbose_header", po::bool_switch(&pipeline.verbose_header), "Add a verbose header to the ARPA file that includes information such as token count, smoothing type, etc.")
("text", po::value<std::string>(&text), "Read text from a file instead of stdin")
("arpa", po::value<std::string>(&arpa), "Write ARPA to a file instead of stdout");
("arpa", po::value<std::string>(&arpa), "Write ARPA to a file instead of stdout")
("prune", po::value<std::vector<std::string> >(&pruning)->multitoken(), "Prune n-grams with count less than or equal to the given threshold. Specify one value for each order i.e. 0 0 1 to prune singleton trigrams and above. The sequence of values must be non-decreasing and the last value applies to any remaining orders. Unigram pruning is not implemented, so the first value must be zero. Default is to not prune, which is equivalent to --prune 0.");
po::variables_map vm;
po::store(po::parse_command_line(argc, argv, options), vm);
@ -95,6 +142,20 @@ int main(int argc, char *argv[]) {
}
#endif
if (pipeline.vocab_size_for_unk && !pipeline.initial_probs.interpolate_unigrams) {
std::cerr << "--vocab_pad requires --interpolate_unigrams" << std::endl;
return 1;
}
if (vm["skip_symbols"].as<bool>()) {
pipeline.disallowed_symbol_action = lm::COMPLAIN;
} else {
pipeline.disallowed_symbol_action = lm::THROW_UP;
}
// parse pruning thresholds. These depend on order, so it is not done as a notifier.
pipeline.prune_thresholds = ParsePruning(pruning, pipeline.order);
util::NormalizeTempPrefix(pipeline.sort.temp_prefix);
lm::builder::InitialProbabilitiesConfig &initial = pipeline.initial_probs;

View File

@ -1,180 +0,0 @@
#ifndef LM_BUILDER_MULTI_STREAM__
#define LM_BUILDER_MULTI_STREAM__
#include "lm/builder/ngram_stream.hh"
#include "util/scoped.hh"
#include "util/stream/chain.hh"
#include <cstddef>
#include <new>
#include <assert.h>
#include <stdlib.h>
namespace lm { namespace builder {
template <class T> class FixedArray {
public:
explicit FixedArray(std::size_t count) {
Init(count);
}
FixedArray() : newed_end_(NULL) {}
void Init(std::size_t count) {
assert(!block_.get());
block_.reset(malloc(sizeof(T) * count));
if (!block_.get()) throw std::bad_alloc();
newed_end_ = begin();
}
FixedArray(const FixedArray &from) {
std::size_t size = from.newed_end_ - static_cast<const T*>(from.block_.get());
Init(size);
for (std::size_t i = 0; i < size; ++i) {
new(end()) T(from[i]);
Constructed();
}
}
~FixedArray() { clear(); }
T *begin() { return static_cast<T*>(block_.get()); }
const T *begin() const { return static_cast<const T*>(block_.get()); }
// Always call Constructed after successful completion of new.
T *end() { return newed_end_; }
const T *end() const { return newed_end_; }
T &back() { return *(end() - 1); }
const T &back() const { return *(end() - 1); }
std::size_t size() const { return end() - begin(); }
bool empty() const { return begin() == end(); }
T &operator[](std::size_t i) { return begin()[i]; }
const T &operator[](std::size_t i) const { return begin()[i]; }
template <class C> void push_back(const C &c) {
new (end()) T(c);
Constructed();
}
void clear() {
for (T *i = begin(); i != end(); ++i)
i->~T();
newed_end_ = begin();
}
protected:
void Constructed() {
++newed_end_;
}
private:
util::scoped_malloc block_;
T *newed_end_;
};
class Chains;
class ChainPositions : public FixedArray<util::stream::ChainPosition> {
public:
ChainPositions() {}
void Init(Chains &chains);
explicit ChainPositions(Chains &chains) {
Init(chains);
}
};
class Chains : public FixedArray<util::stream::Chain> {
private:
template <class T, void (T::*ptr)(const ChainPositions &) = &T::Run> struct CheckForRun {
typedef Chains type;
};
public:
explicit Chains(std::size_t limit) : FixedArray<util::stream::Chain>(limit) {}
template <class Worker> typename CheckForRun<Worker>::type &operator>>(const Worker &worker) {
threads_.push_back(new util::stream::Thread(ChainPositions(*this), worker));
return *this;
}
template <class Worker> typename CheckForRun<Worker>::type &operator>>(const boost::reference_wrapper<Worker> &worker) {
threads_.push_back(new util::stream::Thread(ChainPositions(*this), worker));
return *this;
}
Chains &operator>>(const util::stream::Recycler &recycler) {
for (util::stream::Chain *i = begin(); i != end(); ++i)
*i >> recycler;
return *this;
}
void Wait(bool release_memory = true) {
threads_.clear();
for (util::stream::Chain *i = begin(); i != end(); ++i) {
i->Wait(release_memory);
}
}
private:
boost::ptr_vector<util::stream::Thread> threads_;
Chains(const Chains &);
void operator=(const Chains &);
};
inline void ChainPositions::Init(Chains &chains) {
FixedArray<util::stream::ChainPosition>::Init(chains.size());
for (util::stream::Chain *i = chains.begin(); i != chains.end(); ++i) {
new (end()) util::stream::ChainPosition(i->Add()); Constructed();
}
}
inline Chains &operator>>(Chains &chains, ChainPositions &positions) {
positions.Init(chains);
return chains;
}
class NGramStreams : public FixedArray<NGramStream> {
public:
NGramStreams() {}
// This puts a dummy NGramStream at the beginning (useful to algorithms that need to reference something at the beginning).
void InitWithDummy(const ChainPositions &positions) {
FixedArray<NGramStream>::Init(positions.size() + 1);
new (end()) NGramStream(); Constructed();
for (const util::stream::ChainPosition *i = positions.begin(); i != positions.end(); ++i) {
push_back(*i);
}
}
// Limit restricts to positions[0,limit)
void Init(const ChainPositions &positions, std::size_t limit) {
FixedArray<NGramStream>::Init(limit);
for (const util::stream::ChainPosition *i = positions.begin(); i != positions.begin() + limit; ++i) {
push_back(*i);
}
}
void Init(const ChainPositions &positions) {
Init(positions, positions.size());
}
NGramStreams(const ChainPositions &positions) {
Init(positions);
}
};
inline Chains &operator>>(Chains &chains, NGramStreams &streams) {
ChainPositions positions;
chains >> positions;
streams.Init(positions);
return chains;
}
}} // namespaces
#endif // LM_BUILDER_MULTI_STREAM__

View File

@ -1,5 +1,5 @@
#ifndef LM_BUILDER_NGRAM__
#define LM_BUILDER_NGRAM__
#ifndef LM_BUILDER_NGRAM_H
#define LM_BUILDER_NGRAM_H
#include "lm/weights.hh"
#include "lm/word_index.hh"
@ -26,7 +26,7 @@ union Payload {
class NGram {
public:
NGram(void *begin, std::size_t order)
NGram(void *begin, std::size_t order)
: begin_(static_cast<WordIndex*>(begin)), end_(begin_ + order) {}
const uint8_t *Base() const { return reinterpret_cast<const uint8_t*>(begin_); }
@ -38,12 +38,12 @@ class NGram {
end_ = begin_ + difference;
}
// Would do operator++ but that can get confusing for a stream.
// Would do operator++ but that can get confusing for a stream.
void NextInMemory() {
ReBase(&Value() + 1);
}
// Lower-case in deference to STL.
// Lower-case in deference to STL.
const WordIndex *begin() const { return begin_; }
WordIndex *begin() { return begin_; }
const WordIndex *end() const { return end_; }
@ -61,7 +61,7 @@ class NGram {
return order * sizeof(WordIndex) + sizeof(Payload);
}
std::size_t TotalSize() const {
// Compiler should optimize this.
// Compiler should optimize this.
return TotalSize(Order());
}
static std::size_t OrderFromSize(std::size_t size) {
@ -69,6 +69,31 @@ class NGram {
assert(size == TotalSize(ret));
return ret;
}
// manipulate msb to signal that ngram can be pruned
/*mjd**********************************************************************/
bool IsMarked() const {
return Value().count >> (sizeof(Value().count) * 8 - 1);
}
void Mark() {
Value().count |= (1ul << (sizeof(Value().count) * 8 - 1));
}
void Unmark() {
Value().count &= ~(1ul << (sizeof(Value().count) * 8 - 1));
}
uint64_t UnmarkedCount() const {
return Value().count & ~(1ul << (sizeof(Value().count) * 8 - 1));
}
uint64_t CutoffCount() const {
return IsMarked() ? 0 : UnmarkedCount();
}
/*mjd**********************************************************************/
private:
WordIndex *begin_, *end_;
@ -81,4 +106,4 @@ const WordIndex kEOS = 2;
} // namespace builder
} // namespace lm
#endif // LM_BUILDER_NGRAM__
#endif // LM_BUILDER_NGRAM_H

View File

@ -1,8 +1,9 @@
#ifndef LM_BUILDER_NGRAM_STREAM__
#define LM_BUILDER_NGRAM_STREAM__
#ifndef LM_BUILDER_NGRAM_STREAM_H
#define LM_BUILDER_NGRAM_STREAM_H
#include "lm/builder/ngram.hh"
#include "util/stream/chain.hh"
#include "util/stream/multi_stream.hh"
#include "util/stream/stream.hh"
#include <cstddef>
@ -51,5 +52,7 @@ inline util::stream::Chain &operator>>(util::stream::Chain &chain, NGramStream &
return chain;
}
typedef util::stream::GenericStreams<NGramStream> NGramStreams;
}} // namespaces
#endif // LM_BUILDER_NGRAM_STREAM__
#endif // LM_BUILDER_NGRAM_STREAM_H

View File

@ -2,6 +2,7 @@
#include "lm/builder/adjust_counts.hh"
#include "lm/builder/corpus_count.hh"
#include "lm/builder/hash_gamma.hh"
#include "lm/builder/initial_probabilities.hh"
#include "lm/builder/interpolate.hh"
#include "lm/builder/print.hh"
@ -20,10 +21,13 @@
namespace lm { namespace builder {
namespace {
void PrintStatistics(const std::vector<uint64_t> &counts, const std::vector<Discount> &discounts) {
void PrintStatistics(const std::vector<uint64_t> &counts, const std::vector<uint64_t> &counts_pruned, const std::vector<Discount> &discounts) {
std::cerr << "Statistics:\n";
for (size_t i = 0; i < counts.size(); ++i) {
std::cerr << (i + 1) << ' ' << counts[i];
std::cerr << (i + 1) << ' ' << counts_pruned[i];
if(counts[i] != counts_pruned[i])
std::cerr << "/" << counts[i];
for (size_t d = 1; d <= 3; ++d)
std::cerr << " D" << d << (d == 3 ? "+=" : "=") << discounts[i].amount[d];
std::cerr << '\n';
@ -39,7 +43,7 @@ class Master {
const PipelineConfig &Config() const { return config_; }
Chains &MutableChains() { return chains_; }
util::stream::Chains &MutableChains() { return chains_; }
template <class T> Master &operator>>(const T &worker) {
chains_ >> worker;
@ -64,7 +68,7 @@ class Master {
}
// For initial probabilities, but this is generic.
void SortAndReadTwice(const std::vector<uint64_t> &counts, Sorts<ContextOrder> &sorts, Chains &second, util::stream::ChainConfig second_config) {
void SortAndReadTwice(const std::vector<uint64_t> &counts, Sorts<ContextOrder> &sorts, util::stream::Chains &second, util::stream::ChainConfig second_config) {
// Do merge first before allocating chain memory.
for (std::size_t i = 1; i < config_.order; ++i) {
sorts[i - 1].Merge(0);
@ -198,9 +202,9 @@ class Master {
PipelineConfig config_;
Chains chains_;
util::stream::Chains chains_;
// Often only unigrams, but sometimes all orders.
FixedArray<util::stream::FileBuffer> files_;
util::FixedArray<util::stream::FileBuffer> files_;
};
void CountText(int text_file /* input */, int vocab_file /* output */, Master &master, uint64_t &token_count, std::string &text_file_name) {
@ -221,7 +225,7 @@ void CountText(int text_file /* input */, int vocab_file /* output */, Master &m
WordIndex type_count = config.vocab_estimate;
util::FilePiece text(text_file, NULL, &std::cerr);
text_file_name = text.FileName();
CorpusCount counter(text, vocab_file, token_count, type_count, chain.BlockSize() / chain.EntrySize());
CorpusCount counter(text, vocab_file, token_count, type_count, chain.BlockSize() / chain.EntrySize(), config.disallowed_symbol_action);
chain >> boost::ref(counter);
util::stream::Sort<SuffixOrder, AddCombiner> sorter(chain, config.sort, SuffixOrder(config.order), AddCombiner());
@ -231,21 +235,22 @@ void CountText(int text_file /* input */, int vocab_file /* output */, Master &m
master.InitForAdjust(sorter, type_count);
}
void InitialProbabilities(const std::vector<uint64_t> &counts, const std::vector<Discount> &discounts, Master &master, Sorts<SuffixOrder> &primary, FixedArray<util::stream::FileBuffer> &gammas) {
void InitialProbabilities(const std::vector<uint64_t> &counts, const std::vector<uint64_t> &counts_pruned, const std::vector<Discount> &discounts, Master &master, Sorts<SuffixOrder> &primary,
util::FixedArray<util::stream::FileBuffer> &gammas, const std::vector<uint64_t> &prune_thresholds) {
const PipelineConfig &config = master.Config();
Chains second(config.order);
util::stream::Chains second(config.order);
{
Sorts<ContextOrder> sorts;
master.SetupSorts(sorts);
PrintStatistics(counts, discounts);
lm::ngram::ShowSizes(counts);
PrintStatistics(counts, counts_pruned, discounts);
lm::ngram::ShowSizes(counts_pruned);
std::cerr << "=== 3/5 Calculating and sorting initial probabilities ===" << std::endl;
master.SortAndReadTwice(counts, sorts, second, config.initial_probs.adder_in);
master.SortAndReadTwice(counts_pruned, sorts, second, config.initial_probs.adder_in);
}
Chains gamma_chains(config.order);
InitialProbabilities(config.initial_probs, discounts, master.MutableChains(), second, gamma_chains);
util::stream::Chains gamma_chains(config.order);
InitialProbabilities(config.initial_probs, discounts, master.MutableChains(), second, gamma_chains, prune_thresholds);
// Don't care about gamma for 0.
gamma_chains[0] >> util::stream::kRecycle;
gammas.Init(config.order - 1);
@ -257,19 +262,25 @@ void InitialProbabilities(const std::vector<uint64_t> &counts, const std::vector
master.SetupSorts(primary);
}
void InterpolateProbabilities(const std::vector<uint64_t> &counts, Master &master, Sorts<SuffixOrder> &primary, FixedArray<util::stream::FileBuffer> &gammas) {
void InterpolateProbabilities(const std::vector<uint64_t> &counts, Master &master, Sorts<SuffixOrder> &primary, util::FixedArray<util::stream::FileBuffer> &gammas) {
std::cerr << "=== 4/5 Calculating and writing order-interpolated probabilities ===" << std::endl;
const PipelineConfig &config = master.Config();
master.MaximumLazyInput(counts, primary);
Chains gamma_chains(config.order - 1);
util::stream::ChainConfig read_backoffs(config.read_backoffs);
read_backoffs.entry_size = sizeof(float);
util::stream::Chains gamma_chains(config.order - 1);
for (std::size_t i = 0; i < config.order - 1; ++i) {
util::stream::ChainConfig read_backoffs(config.read_backoffs);
// Add 1 because here we are skipping unigrams
if(config.prune_thresholds[i + 1] > 0)
read_backoffs.entry_size = sizeof(HashGamma);
else
read_backoffs.entry_size = sizeof(float);
gamma_chains.push_back(read_backoffs);
gamma_chains.back() >> gammas[i].Source();
}
master >> Interpolate(counts[0], ChainPositions(gamma_chains));
master >> Interpolate(std::max(master.Config().vocab_size_for_unk, counts[0] - 1 /* <s> is not included */), util::stream::ChainPositions(gamma_chains), config.prune_thresholds);
gamma_chains >> util::stream::kRecycle;
master.BufferFinal(counts);
}
@ -301,21 +312,22 @@ void Pipeline(PipelineConfig config, int text_file, int out_arpa) {
CountText(text_file, vocab_file.get(), master, token_count, text_file_name);
std::vector<uint64_t> counts;
std::vector<uint64_t> counts_pruned;
std::vector<Discount> discounts;
master >> AdjustCounts(counts, discounts);
master >> AdjustCounts(counts, counts_pruned, discounts, config.prune_thresholds);
{
FixedArray<util::stream::FileBuffer> gammas;
util::FixedArray<util::stream::FileBuffer> gammas;
Sorts<SuffixOrder> primary;
InitialProbabilities(counts, discounts, master, primary, gammas);
InterpolateProbabilities(counts, master, primary, gammas);
InitialProbabilities(counts, counts_pruned, discounts, master, primary, gammas, config.prune_thresholds);
InterpolateProbabilities(counts_pruned, master, primary, gammas);
}
std::cerr << "=== 5/5 Writing ARPA model ===" << std::endl;
VocabReconstitute vocab(vocab_file.get());
UTIL_THROW_IF(vocab.Size() != counts[0], util::Exception, "Vocab words don't match up. Is there a null byte in the input?");
HeaderInfo header_info(text_file_name, token_count);
master >> PrintARPA(vocab, counts, (config.verbose_header ? &header_info : NULL), out_arpa) >> util::stream::kRecycle;
master >> PrintARPA(vocab, counts_pruned, (config.verbose_header ? &header_info : NULL), out_arpa) >> util::stream::kRecycle;
master.MutableChains().Wait(true);
}

View File

@ -1,8 +1,9 @@
#ifndef LM_BUILDER_PIPELINE__
#define LM_BUILDER_PIPELINE__
#ifndef LM_BUILDER_PIPELINE_H
#define LM_BUILDER_PIPELINE_H
#include "lm/builder/initial_probabilities.hh"
#include "lm/builder/header_info.hh"
#include "lm/lm_exception.hh"
#include "lm/word_index.hh"
#include "util/stream/config.hh"
#include "util/file_piece.hh"
@ -30,6 +31,28 @@ struct PipelineConfig {
// Number of blocks to use. This will be overridden to 1 if everything fits.
std::size_t block_count;
// n-gram count thresholds for pruning. 0 values means no pruning for
// corresponding n-gram order
std::vector<uint64_t> prune_thresholds; //mjd
/* Computing the perplexity of LMs with different vocabularies is hard. For
* example, the lowest perplexity is attained by a unigram model that
* predicts p(<unk>) = 1 and has no other vocabulary. Also, linearly
* interpolated models will sum to more than 1 because <unk> is duplicated
* (SRI just pretends p(<unk>) = 0 for these purposes, which makes it sum to
* 1 but comes with its own problems). This option will make the vocabulary
* a particular size by replicating <unk> multiple times for purposes of
* computing vocabulary size. It has no effect if the actual vocabulary is
* larger. This parameter serves the same purpose as IRSTLM's "dub".
*/
uint64_t vocab_size_for_unk;
/* What to do the first time <s>, </s>, or <unk> appears in the input. If
* this is anything but THROW_UP, then the symbol will always be treated as
* whitespace.
*/
WarningAction disallowed_symbol_action;
const std::string &TempPrefix() const { return sort.temp_prefix; }
std::size_t TotalMemory() const { return sort.total_memory; }
};
@ -38,4 +61,4 @@ struct PipelineConfig {
void Pipeline(PipelineConfig config, int text_file, int out_arpa);
}} // namespaces
#endif // LM_BUILDER_PIPELINE__
#endif // LM_BUILDER_PIPELINE_H

View File

@ -42,14 +42,14 @@ PrintARPA::PrintARPA(const VocabReconstitute &vocab, const std::vector<uint64_t>
util::WriteOrThrow(out_fd, as_string.data(), as_string.size());
}
void PrintARPA::Run(const ChainPositions &positions) {
void PrintARPA::Run(const util::stream::ChainPositions &positions) {
util::scoped_fd closer(out_fd_);
UTIL_TIMER("(%w s) Wrote ARPA file\n");
util::FakeOFStream out(out_fd_);
for (unsigned order = 1; order <= positions.size(); ++order) {
out << "\\" << order << "-grams:" << '\n';
for (NGramStream stream(positions[order - 1]); stream; ++stream) {
// Correcting for numerical precision issues. Take that IRST.
// Correcting for numerical precision issues. Take that IRST.
out << std::min(0.0f, stream->Value().complete.prob) << '\t' << vocab_.Lookup(*stream->begin());
for (const WordIndex *i = stream->begin() + 1; i != stream->end(); ++i) {
out << ' ' << vocab_.Lookup(*i);
@ -58,6 +58,7 @@ void PrintARPA::Run(const ChainPositions &positions) {
if (backoff != 0.0)
out << '\t' << backoff;
out << '\n';
}
out << '\n';
}

View File

@ -1,8 +1,8 @@
#ifndef LM_BUILDER_PRINT__
#define LM_BUILDER_PRINT__
#ifndef LM_BUILDER_PRINT_H
#define LM_BUILDER_PRINT_H
#include "lm/builder/ngram.hh"
#include "lm/builder/multi_stream.hh"
#include "lm/builder/ngram_stream.hh"
#include "lm/builder/header_info.hh"
#include "util/file.hh"
#include "util/mmap.hh"
@ -59,7 +59,7 @@ template <class V> class Print {
public:
explicit Print(const VocabReconstitute &vocab, std::ostream &to) : vocab_(vocab), to_(to) {}
void Run(const ChainPositions &chains) {
void Run(const util::stream::ChainPositions &chains) {
NGramStreams streams(chains);
for (NGramStream *s = streams.begin(); s != streams.end(); ++s) {
DumpStream(*s);
@ -92,7 +92,7 @@ class PrintARPA {
// Takes ownership of out_fd upon Run().
explicit PrintARPA(const VocabReconstitute &vocab, const std::vector<uint64_t> &counts, const HeaderInfo* header_info, int out_fd);
void Run(const ChainPositions &positions);
void Run(const util::stream::ChainPositions &positions);
private:
const VocabReconstitute &vocab_;
@ -100,4 +100,4 @@ class PrintARPA {
};
}} // namespaces
#endif // LM_BUILDER_PRINT__
#endif // LM_BUILDER_PRINT_H

View File

@ -1,7 +1,7 @@
#ifndef LM_BUILDER_SORT__
#define LM_BUILDER_SORT__
#ifndef LM_BUILDER_SORT_H
#define LM_BUILDER_SORT_H
#include "lm/builder/multi_stream.hh"
#include "lm/builder/ngram_stream.hh"
#include "lm/builder/ngram.hh"
#include "lm/word_index.hh"
#include "util/stream/sort.hh"
@ -14,24 +14,71 @@
namespace lm {
namespace builder {
/**
* Abstract parent class for defining custom n-gram comparators.
*/
template <class Child> class Comparator : public std::binary_function<const void *, const void *, bool> {
public:
/**
* Constructs a comparator capable of comparing two n-grams.
*
* @param order Number of words in each n-gram
*/
explicit Comparator(std::size_t order) : order_(order) {}
/**
* Applies the comparator using the Compare method that must be defined in any class that inherits from this class.
*
* @param lhs A pointer to the n-gram on the left-hand side of the comparison
* @param rhs A pointer to the n-gram on the right-hand side of the comparison
*
* @see ContextOrder::Compare
* @see PrefixOrder::Compare
* @see SuffixOrder::Compare
*/
inline bool operator()(const void *lhs, const void *rhs) const {
return static_cast<const Child*>(this)->Compare(static_cast<const WordIndex*>(lhs), static_cast<const WordIndex*>(rhs));
}
/** Gets the n-gram order defined for this comparator. */
std::size_t Order() const { return order_; }
protected:
std::size_t order_;
};
/**
* N-gram comparator that compares n-grams according to their reverse (suffix) order.
*
* This comparator compares n-grams lexicographically, one word at a time,
* beginning with the last word of each n-gram and ending with the first word of each n-gram.
*
* Some examples of n-gram comparisons as defined by this comparator:
* - a b c == a b c
* - a b c < a b d
* - a b c > a d b
* - a b c > a b b
* - a b c > x a c
* - a b c < x y z
*/
class SuffixOrder : public Comparator<SuffixOrder> {
public:
/**
* Constructs a comparator capable of comparing two n-grams.
*
* @param order Number of words in each n-gram
*/
explicit SuffixOrder(std::size_t order) : Comparator<SuffixOrder>(order) {}
/**
* Compares two n-grams lexicographically, one word at a time,
* beginning with the last word of each n-gram and ending with the first word of each n-gram.
*
* @param lhs A pointer to the n-gram on the left-hand side of the comparison
* @param rhs A pointer to the n-gram on the right-hand side of the comparison
*/
inline bool Compare(const WordIndex *lhs, const WordIndex *rhs) const {
for (std::size_t i = order_ - 1; i != 0; --i) {
if (lhs[i] != rhs[i])
@ -43,10 +90,40 @@ class SuffixOrder : public Comparator<SuffixOrder> {
static const unsigned kMatchOffset = 1;
};
/**
* N-gram comparator that compares n-grams according to the reverse (suffix) order of the n-gram context.
*
* This comparator compares n-grams lexicographically, one word at a time,
* beginning with the penultimate word of each n-gram and ending with the first word of each n-gram;
* finally, this comparator compares the last word of each n-gram.
*
* Some examples of n-gram comparisons as defined by this comparator:
* - a b c == a b c
* - a b c < a b d
* - a b c < a d b
* - a b c > a b b
* - a b c > x a c
* - a b c < x y z
*/
class ContextOrder : public Comparator<ContextOrder> {
public:
/**
* Constructs a comparator capable of comparing two n-grams.
*
* @param order Number of words in each n-gram
*/
explicit ContextOrder(std::size_t order) : Comparator<ContextOrder>(order) {}
/**
* Compares two n-grams lexicographically, one word at a time,
* beginning with the penultimate word of each n-gram and ending with the first word of each n-gram;
* finally, this comparator compares the last word of each n-gram.
*
* @param lhs A pointer to the n-gram on the left-hand side of the comparison
* @param rhs A pointer to the n-gram on the right-hand side of the comparison
*/
inline bool Compare(const WordIndex *lhs, const WordIndex *rhs) const {
for (int i = order_ - 2; i >= 0; --i) {
if (lhs[i] != rhs[i])
@ -56,10 +133,37 @@ class ContextOrder : public Comparator<ContextOrder> {
}
};
/**
* N-gram comparator that compares n-grams according to their natural (prefix) order.
*
* This comparator compares n-grams lexicographically, one word at a time,
* beginning with the first word of each n-gram and ending with the last word of each n-gram.
*
* Some examples of n-gram comparisons as defined by this comparator:
* - a b c == a b c
* - a b c < a b d
* - a b c < a d b
* - a b c > a b b
* - a b c < x a c
* - a b c < x y z
*/
class PrefixOrder : public Comparator<PrefixOrder> {
public:
/**
* Constructs a comparator capable of comparing two n-grams.
*
* @param order Number of words in each n-gram
*/
explicit PrefixOrder(std::size_t order) : Comparator<PrefixOrder>(order) {}
/**
* Compares two n-grams lexicographically, one word at a time,
* beginning with the first word of each n-gram and ending with the last word of each n-gram.
*
* @param lhs A pointer to the n-gram on the left-hand side of the comparison
* @param rhs A pointer to the n-gram on the right-hand side of the comparison
*/
inline bool Compare(const WordIndex *lhs, const WordIndex *rhs) const {
for (std::size_t i = 0; i < order_; ++i) {
if (lhs[i] != rhs[i])
@ -84,15 +188,52 @@ struct AddCombiner {
};
// The combiner is only used on a single chain, so I didn't bother to allow
// that template.
template <class Compare> class Sorts : public FixedArray<util::stream::Sort<Compare> > {
// that template.
/**
* Represents an @ref util::FixedArray "array" capable of storing @ref util::stream::Sort "Sort" objects.
*
* In the anticipated use case, an instance of this class will maintain one @ref util::stream::Sort "Sort" object
* for each n-gram order (ranging from 1 up to the maximum n-gram order being processed).
* Use in this manner would enable the n-grams each n-gram order to be sorted, in parallel.
*
* @tparam Compare An @ref Comparator "ngram comparator" to use during sorting.
*/
template <class Compare> class Sorts : public util::FixedArray<util::stream::Sort<Compare> > {
private:
typedef util::stream::Sort<Compare> S;
typedef FixedArray<S> P;
typedef util::FixedArray<S> P;
public:
/**
* Constructs, but does not initialize.
*
* @ref util::FixedArray::Init() "Init" must be called before use.
*
* @see util::FixedArray::Init()
*/
Sorts() {}
/**
* Constructs an @ref util::FixedArray "array" capable of storing a fixed number of @ref util::stream::Sort "Sort" objects.
*
* @param number The maximum number of @ref util::stream::Sort "sorters" that can be held by this @ref util::FixedArray "array"
* @see util::FixedArray::FixedArray()
*/
explicit Sorts(std::size_t number) : util::FixedArray<util::stream::Sort<Compare> >(number) {}
/**
* Constructs a new @ref util::stream::Sort "Sort" object which is stored in this @ref util::FixedArray "array".
*
* The new @ref util::stream::Sort "Sort" object is constructed using the provided @ref util::stream::SortConfig "SortConfig" and @ref Comparator "ngram comparator";
* once constructed, a new worker @ref util::stream::Thread "thread" (owned by the @ref util::stream::Chain "chain") will sort the n-gram data stored
* in the @ref util::stream::Block "blocks" of the provided @ref util::stream::Chain "chain".
*
* @see util::stream::Sort::Sort()
* @see util::stream::Chain::operator>>()
*/
void push_back(util::stream::Chain &chain, const util::stream::SortConfig &config, const Compare &compare) {
new (P::end()) S(chain, config, compare);
new (P::end()) S(chain, config, compare); // use "placement new" syntax to initalize S in an already-allocated memory location
P::Constructed();
}
};
@ -100,4 +241,4 @@ template <class Compare> class Sorts : public FixedArray<util::stream::Sort<Comp
} // namespace builder
} // namespace lm
#endif // LM_BUILDER_SORT__
#endif // LM_BUILDER_SORT_H

View File

@ -1,5 +1,5 @@
#ifndef LM_CONFIG__
#define LM_CONFIG__
#ifndef LM_CONFIG_H
#define LM_CONFIG_H
#include "lm/lm_exception.hh"
#include "util/mmap.hh"
@ -120,4 +120,4 @@ struct Config {
} /* namespace ngram */ } /* namespace lm */
#endif // LM_CONFIG__
#endif // LM_CONFIG_H

View File

@ -1,5 +1,5 @@
#ifndef LM_ENUMERATE_VOCAB__
#define LM_ENUMERATE_VOCAB__
#ifndef LM_ENUMERATE_VOCAB_H
#define LM_ENUMERATE_VOCAB_H
#include "lm/word_index.hh"
#include "util/string_piece.hh"
@ -24,5 +24,5 @@ class EnumerateVocab {
} // namespace lm
#endif // LM_ENUMERATE_VOCAB__
#endif // LM_ENUMERATE_VOCAB_H

View File

@ -1,5 +1,5 @@
#ifndef LM_FACADE__
#define LM_FACADE__
#ifndef LM_FACADE_H
#define LM_FACADE_H
#include "lm/virtual_interface.hh"
#include "util/string_piece.hh"
@ -70,4 +70,4 @@ template <class Child, class StateT, class VocabularyT> class ModelFacade : publ
} // mamespace base
} // namespace lm
#endif // LM_FACADE__
#endif // LM_FACADE_H

View File

@ -1,5 +1,5 @@
#ifndef LM_FILTER_ARPA_IO__
#define LM_FILTER_ARPA_IO__
#ifndef LM_FILTER_ARPA_IO_H
#define LM_FILTER_ARPA_IO_H
/* Input and output for ARPA format language model files.
*/
#include "lm/read_arpa.hh"
@ -111,4 +111,4 @@ template <class Output> void ReadARPA(util::FilePiece &in_lm, Output &out) {
} // namespace lm
#endif // LM_FILTER_ARPA_IO__
#endif // LM_FILTER_ARPA_IO_H

View File

@ -1,5 +1,5 @@
#ifndef LM_FILTER_COUNT_IO__
#define LM_FILTER_COUNT_IO__
#ifndef LM_FILTER_COUNT_IO_H
#define LM_FILTER_COUNT_IO_H
#include <fstream>
#include <iostream>
@ -86,4 +86,4 @@ template <class Output> void ReadCount(util::FilePiece &in_file, Output &out) {
} // namespace lm
#endif // LM_FILTER_COUNT_IO__
#endif // LM_FILTER_COUNT_IO_H

View File

@ -1,5 +1,5 @@
#ifndef LM_FILTER_FORMAT_H__
#define LM_FILTER_FORMAT_H__
#ifndef LM_FILTER_FORMAT_H
#define LM_FILTER_FORMAT_H
#include "lm/filter/arpa_io.hh"
#include "lm/filter/count_io.hh"
@ -247,4 +247,4 @@ class MultipleOutputBuffer {
} // namespace lm
#endif // LM_FILTER_FORMAT_H__
#endif // LM_FILTER_FORMAT_H

View File

@ -1,5 +1,5 @@
#ifndef LM_FILTER_PHRASE_H__
#define LM_FILTER_PHRASE_H__
#ifndef LM_FILTER_PHRASE_H
#define LM_FILTER_PHRASE_H
#include "util/murmur_hash.hh"
#include "util/string_piece.hh"
@ -165,4 +165,4 @@ class Multiple : public detail::ConditionCommon {
} // namespace phrase
} // namespace lm
#endif // LM_FILTER_PHRASE_H__
#endif // LM_FILTER_PHRASE_H

View File

@ -1,5 +1,5 @@
#ifndef LM_FILTER_THREAD_H__
#define LM_FILTER_THREAD_H__
#ifndef LM_FILTER_THREAD_H
#define LM_FILTER_THREAD_H
#include "util/thread_pool.hh"
@ -164,4 +164,4 @@ template <class Filter, class OutputBuffer, class RealOutput> class Controller :
} // namespace lm
#endif // LM_FILTER_THREAD_H__
#endif // LM_FILTER_THREAD_H

View File

@ -1,5 +1,5 @@
#ifndef LM_FILTER_VOCAB_H__
#define LM_FILTER_VOCAB_H__
#ifndef LM_FILTER_VOCAB_H
#define LM_FILTER_VOCAB_H
// Vocabulary-based filters for language models.
@ -130,4 +130,4 @@ class Multiple {
} // namespace vocab
} // namespace lm
#endif // LM_FILTER_VOCAB_H__
#endif // LM_FILTER_VOCAB_H

View File

@ -1,5 +1,5 @@
#ifndef LM_FILTER_WRAPPER_H__
#define LM_FILTER_WRAPPER_H__
#ifndef LM_FILTER_WRAPPER_H
#define LM_FILTER_WRAPPER_H
#include "util/string_piece.hh"
@ -53,4 +53,4 @@ template <class FilterT> class ContextFilter {
} // namespace lm
#endif // LM_FILTER_WRAPPER_H__
#endif // LM_FILTER_WRAPPER_H

View File

@ -35,8 +35,8 @@
* phrase, even if hypotheses are generated left-to-right.
*/
#ifndef LM_LEFT__
#define LM_LEFT__
#ifndef LM_LEFT_H
#define LM_LEFT_H
#include "lm/max_order.hh"
#include "lm/state.hh"
@ -213,4 +213,4 @@ template <class M> class RuleScore {
} // namespace ngram
} // namespace lm
#endif // LM_LEFT__
#endif // LM_LEFT_H

View File

@ -1,5 +1,5 @@
#ifndef LM_LM_EXCEPTION__
#define LM_LM_EXCEPTION__
#ifndef LM_LM_EXCEPTION_H
#define LM_LM_EXCEPTION_H
// Named to avoid conflict with util/exception.hh.

View File

@ -1,9 +1,13 @@
/* IF YOUR BUILD SYSTEM PASSES -DKENLM_MAX_ORDER, THEN CHANGE THE BUILD SYSTEM.
#ifndef LM_MAX_ORDER_H
#define LM_MAX_ORDER_H
/* IF YOUR BUILD SYSTEM PASSES -DKENLM_MAX_ORDER_H, THEN CHANGE THE BUILD SYSTEM.
* If not, this is the default maximum order.
* Having this limit means that State can be
* (kMaxOrder - 1) * sizeof(float) bytes instead of
* sizeof(float*) + (kMaxOrder - 1) * sizeof(float) + malloc overhead
*/
#ifndef KENLM_ORDER_MESSAGE
#define KENLM_ORDER_MESSAGE "If your build system supports changing KENLM_MAX_ORDER, change it there and recompile. In the KenLM tarball or Moses, use e.g. `bjam --max-kenlm-order=6 -a'. Otherwise, edit lm/max_order.hh."
#define KENLM_ORDER_MESSAGE "If your build system supports changing KENLM_MAX_ORDER_H, change it there and recompile. In the KenLM tarball or Moses, use e.g. `bjam --max-kenlm-order=6 -a'. Otherwise, edit lm/max_order.hh."
#endif
#endif // LM_MAX_ORDER_H

View File

@ -1,5 +1,5 @@
#ifndef LM_MODEL__
#define LM_MODEL__
#ifndef LM_MODEL_H
#define LM_MODEL_H
#include "lm/bhiksha.hh"
#include "lm/binary_format.hh"
@ -153,4 +153,4 @@ base::Model *LoadVirtual(const char *file_name, const Config &config = Config(),
} // namespace ngram
} // namespace lm
#endif // LM_MODEL__
#endif // LM_MODEL_H

View File

@ -1,5 +1,5 @@
#ifndef LM_MODEL_TYPE__
#define LM_MODEL_TYPE__
#ifndef LM_MODEL_TYPE_H
#define LM_MODEL_TYPE_H
namespace lm {
namespace ngram {
@ -20,4 +20,4 @@ const static ModelType kArrayAdd = static_cast<ModelType>(ARRAY_TRIE - TRIE);
} // namespace ngram
} // namespace lm
#endif // LM_MODEL_TYPE__
#endif // LM_MODEL_TYPE_H

View File

@ -1,8 +1,9 @@
#ifndef LM_NGRAM_QUERY__
#define LM_NGRAM_QUERY__
#ifndef LM_NGRAM_QUERY_H
#define LM_NGRAM_QUERY_H
#include "lm/enumerate_vocab.hh"
#include "lm/model.hh"
#include "util/file_piece.hh"
#include "util/usage.hh"
#include <cstdlib>
@ -16,64 +17,94 @@
namespace lm {
namespace ngram {
template <class Model> void Query(const Model &model, bool sentence_context, std::istream &in_stream, std::ostream &out_stream) {
struct BasicPrint {
void Word(StringPiece, WordIndex, const FullScoreReturn &) const {}
void Line(uint64_t oov, float total) const {
std::cout << "Total: " << total << " OOV: " << oov << '\n';
}
void Summary(double, double, uint64_t, uint64_t) {}
};
struct FullPrint : public BasicPrint {
void Word(StringPiece surface, WordIndex vocab, const FullScoreReturn &ret) const {
std::cout << surface << '=' << vocab << ' ' << static_cast<unsigned int>(ret.ngram_length) << ' ' << ret.prob << '\t';
}
void Summary(double ppl_including_oov, double ppl_excluding_oov, uint64_t corpus_oov, uint64_t corpus_tokens) {
std::cout <<
"Perplexity including OOVs:\t" << ppl_including_oov << "\n"
"Perplexity excluding OOVs:\t" << ppl_excluding_oov << "\n"
"OOVs:\t" << corpus_oov << "\n"
"Tokenss:\t" << corpus_tokens << '\n'
;
}
};
template <class Model, class Printer> void Query(const Model &model, bool sentence_context) {
Printer printer;
typename Model::State state, out;
lm::FullScoreReturn ret;
std::string word;
StringPiece word;
util::FilePiece in(0);
double corpus_total = 0.0;
double corpus_total_oov_only = 0.0;
uint64_t corpus_oov = 0;
uint64_t corpus_tokens = 0;
while (in_stream) {
while (true) {
state = sentence_context ? model.BeginSentenceState() : model.NullContextState();
float total = 0.0;
bool got = false;
uint64_t oov = 0;
while (in_stream >> word) {
got = true;
while (in.ReadWordSameLine(word)) {
lm::WordIndex vocab = model.GetVocabulary().Index(word);
if (vocab == 0) ++oov;
ret = model.FullScore(state, vocab, out);
if (vocab == model.GetVocabulary().NotFound()) {
++oov;
corpus_total_oov_only += ret.prob;
}
total += ret.prob;
out_stream << word << '=' << vocab << ' ' << static_cast<unsigned int>(ret.ngram_length) << ' ' << ret.prob << '\t';
printer.Word(word, vocab, ret);
++corpus_tokens;
state = out;
char c;
while (true) {
c = in_stream.get();
if (!in_stream) break;
if (c == '\n') break;
if (!isspace(c)) {
in_stream.unget();
break;
}
}
if (c == '\n') break;
}
if (!got && !in_stream) break;
// If people don't have a newline after their last query, this won't add a </s>.
// Sue me.
try {
UTIL_THROW_IF('\n' != in.get(), util::Exception, "FilePiece is confused.");
} catch (const util::EndOfFileException &e) { break; }
if (sentence_context) {
ret = model.FullScore(state, model.GetVocabulary().EndSentence(), out);
total += ret.prob;
++corpus_tokens;
out_stream << "</s>=" << model.GetVocabulary().EndSentence() << ' ' << static_cast<unsigned int>(ret.ngram_length) << ' ' << ret.prob << '\t';
printer.Word("</s>", model.GetVocabulary().EndSentence(), ret);
}
out_stream << "Total: " << total << " OOV: " << oov << '\n';
printer.Line(oov, total);
corpus_total += total;
corpus_oov += oov;
}
out_stream << "Perplexity " << pow(10.0, -(corpus_total / static_cast<double>(corpus_tokens))) << std::endl;
printer.Summary(
pow(10.0, -(corpus_total / static_cast<double>(corpus_tokens))), // PPL including OOVs
pow(10.0, -((corpus_total - corpus_total_oov_only) / static_cast<double>(corpus_tokens - corpus_oov))), // PPL excluding OOVs
corpus_oov,
corpus_tokens);
}
template <class M> void Query(const char *file, bool sentence_context, std::istream &in_stream, std::ostream &out_stream) {
Config config;
M model(file, config);
Query(model, sentence_context, in_stream, out_stream);
template <class Model> void Query(const char *file, const Config &config, bool sentence_context, bool show_words) {
Model model(file, config);
if (show_words) {
Query<Model, FullPrint>(model, sentence_context);
} else {
Query<Model, BasicPrint>(model, sentence_context);
}
}
} // namespace ngram
} // namespace lm
#endif // LM_NGRAM_QUERY__
#endif // LM_NGRAM_QUERY_H

View File

@ -1,5 +1,5 @@
#ifndef LM_PARTIAL__
#define LM_PARTIAL__
#ifndef LM_PARTIAL_H
#define LM_PARTIAL_H
#include "lm/return.hh"
#include "lm/state.hh"
@ -164,4 +164,4 @@ template <class Model> float Subsume(const Model &model, Left &first_left, const
} // namespace ngram
} // namespace lm
#endif // LM_PARTIAL__
#endif // LM_PARTIAL_H

View File

@ -1,5 +1,5 @@
#ifndef LM_QUANTIZE_H__
#define LM_QUANTIZE_H__
#ifndef LM_QUANTIZE_H
#define LM_QUANTIZE_H
#include "lm/blank.hh"
#include "lm/config.hh"
@ -230,4 +230,4 @@ class SeparatelyQuantize {
} // namespace ngram
} // namespace lm
#endif // LM_QUANTIZE_H__
#endif // LM_QUANTIZE_H

View File

@ -1,4 +1,5 @@
#include "lm/ngram_query.hh"
#include "util/getopt.hh"
#ifdef WITH_NPLM
#include "lm/wrappers/nplm.hh"
@ -7,47 +8,76 @@
#include <stdlib.h>
void Usage(const char *name) {
std::cerr << "KenLM was compiled with maximum order " << KENLM_MAX_ORDER << "." << std::endl;
std::cerr << "Usage: " << name << " [-n] lm_file" << std::endl;
std::cerr << "Input is wrapped in <s> and </s> unless -n is passed." << std::endl;
std::cerr <<
"KenLM was compiled with maximum order " << KENLM_MAX_ORDER << ".\n"
"Usage: " << name << " [-n] [-s] lm_file\n"
"-n: Do not wrap the input in <s> and </s>.\n"
"-s: Sentence totals only.\n"
"-l lazy|populate|read|parallel: Load lazily, with populate, or malloc+read\n"
"The default loading method is populate on Linux and read on others.\n";
exit(1);
}
int main(int argc, char *argv[]) {
if (argc == 1 || (argc == 2 && !strcmp(argv[1], "--help")))
Usage(argv[0]);
lm::ngram::Config config;
bool sentence_context = true;
const char *file = NULL;
for (char **arg = argv + 1; arg != argv + argc; ++arg) {
if (!strcmp(*arg, "-n")) {
sentence_context = false;
} else if (!strcmp(*arg, "-h") || !strcmp(*arg, "--help") || file) {
Usage(argv[0]);
} else {
file = *arg;
bool show_words = true;
int opt;
while ((opt = getopt(argc, argv, "hnsl:")) != -1) {
switch (opt) {
case 'n':
sentence_context = false;
break;
case 's':
show_words = false;
break;
case 'l':
if (!strcmp(optarg, "lazy")) {
config.load_method = util::LAZY;
} else if (!strcmp(optarg, "populate")) {
config.load_method = util::POPULATE_OR_READ;
} else if (!strcmp(optarg, "read")) {
config.load_method = util::READ;
} else if (!strcmp(optarg, "parallel")) {
config.load_method = util::PARALLEL_READ;
} else {
Usage(argv[0]);
}
break;
case 'h':
default:
Usage(argv[0]);
}
}
if (!file) Usage(argv[0]);
if (optind + 1 != argc)
Usage(argv[0]);
const char *file = argv[optind];
try {
using namespace lm::ngram;
ModelType model_type;
if (RecognizeBinary(file, model_type)) {
switch(model_type) {
case PROBING:
Query<lm::ngram::ProbingModel>(file, sentence_context, std::cin, std::cout);
Query<lm::ngram::ProbingModel>(file, config, sentence_context, show_words);
break;
case REST_PROBING:
Query<lm::ngram::RestProbingModel>(file, sentence_context, std::cin, std::cout);
Query<lm::ngram::RestProbingModel>(file, config, sentence_context, show_words);
break;
case TRIE:
Query<TrieModel>(file, sentence_context, std::cin, std::cout);
Query<TrieModel>(file, config, sentence_context, show_words);
break;
case QUANT_TRIE:
Query<QuantTrieModel>(file, sentence_context, std::cin, std::cout);
Query<QuantTrieModel>(file, config, sentence_context, show_words);
break;
case ARRAY_TRIE:
Query<ArrayTrieModel>(file, sentence_context, std::cin, std::cout);
Query<ArrayTrieModel>(file, config, sentence_context, show_words);
break;
case QUANT_ARRAY_TRIE:
Query<QuantArrayTrieModel>(file, sentence_context, std::cin, std::cout);
Query<QuantArrayTrieModel>(file, config, sentence_context, show_words);
break;
default:
std::cerr << "Unrecognized kenlm model type " << model_type << std::endl;
@ -56,12 +86,15 @@ int main(int argc, char *argv[]) {
#ifdef WITH_NPLM
} else if (lm::np::Model::Recognize(file)) {
lm::np::Model model(file);
Query(model, sentence_context, std::cin, std::cout);
if (show_words) {
Query<lm::np::Model, lm::ngram::FullPrint>(model, sentence_context);
} else {
Query<lm::np::Model, lm::ngram::BasicPrint>(model, sentence_context);
}
#endif
} else {
Query<ProbingModel>(file, sentence_context, std::cin, std::cout);
Query<ProbingModel>(file, config, sentence_context, show_words);
}
std::cerr << "Total time including destruction:\n";
util::PrintUsage(std::cerr);
} catch (const std::exception &e) {
std::cerr << e.what() << std::endl;

View File

@ -1,5 +1,5 @@
#ifndef LM_READ_ARPA__
#define LM_READ_ARPA__
#ifndef LM_READ_ARPA_H
#define LM_READ_ARPA_H
#include "lm/lm_exception.hh"
#include "lm/word_index.hh"
@ -28,7 +28,7 @@ void ReadEnd(util::FilePiece &in);
extern const bool kARPASpaces[256];
// Positive log probability warning.
// Positive log probability warning.
class PositiveProbWarn {
public:
PositiveProbWarn() : action_(THROW_UP) {}
@ -41,24 +41,29 @@ class PositiveProbWarn {
WarningAction action_;
};
template <class Voc, class Weights> void Read1Gram(util::FilePiece &f, Voc &vocab, Weights *unigrams, PositiveProbWarn &warn) {
template <class Weights> StringPiece Read1Gram(util::FilePiece &f, Weights &weights, PositiveProbWarn &warn) {
try {
float prob = f.ReadFloat();
if (prob > 0.0) {
warn.Warn(prob);
prob = 0.0;
weights.prob = f.ReadFloat();
if (weights.prob > 0.0) {
warn.Warn(weights.prob);
weights.prob = 0.0;
}
if (f.get() != '\t') UTIL_THROW(FormatLoadException, "Expected tab after probability");
Weights &value = unigrams[vocab.Insert(f.ReadDelimited(kARPASpaces))];
value.prob = prob;
ReadBackoff(f, value);
UTIL_THROW_IF(f.get() != '\t', FormatLoadException, "Expected tab after probability");
StringPiece ret(f.ReadDelimited(kARPASpaces));
ReadBackoff(f, weights);
return ret;
} catch(util::Exception &e) {
e << " in the 1-gram at byte " << f.Offset();
throw;
}
}
// Return true if a positive log probability came out.
template <class Voc, class Weights> void Read1Gram(util::FilePiece &f, Voc &vocab, Weights *unigrams, PositiveProbWarn &warn) {
Weights temp;
WordIndex word = vocab.Insert(Read1Gram(f, temp, warn));
unigrams[word] = temp;
}
template <class Voc, class Weights> void Read1Grams(util::FilePiece &f, std::size_t count, Voc &vocab, Weights *unigrams, PositiveProbWarn &warn) {
ReadNGramHeader(f, 1);
for (std::size_t i = 0; i < count; ++i) {
@ -67,16 +72,16 @@ template <class Voc, class Weights> void Read1Grams(util::FilePiece &f, std::siz
vocab.FinishedLoading(unigrams);
}
// Return true if a positive log probability came out.
template <class Voc, class Weights> void ReadNGram(util::FilePiece &f, const unsigned char n, const Voc &vocab, WordIndex *const reverse_indices, Weights &weights, PositiveProbWarn &warn) {
// Read ngram, write vocab ids to indices_out.
template <class Voc, class Weights, class Iterator> void ReadNGram(util::FilePiece &f, const unsigned char n, const Voc &vocab, Iterator indices_out, Weights &weights, PositiveProbWarn &warn) {
try {
weights.prob = f.ReadFloat();
if (weights.prob > 0.0) {
warn.Warn(weights.prob);
weights.prob = 0.0;
}
for (WordIndex *vocab_out = reverse_indices + n - 1; vocab_out >= reverse_indices; --vocab_out) {
*vocab_out = vocab.Index(f.ReadDelimited(kARPASpaces));
for (unsigned char i = 0; i < n; ++i, ++indices_out) {
*indices_out = vocab.Index(f.ReadDelimited(kARPASpaces));
}
ReadBackoff(f, weights);
} catch(util::Exception &e) {
@ -87,4 +92,4 @@ template <class Voc, class Weights> void ReadNGram(util::FilePiece &f, const uns
} // namespace lm
#endif // LM_READ_ARPA__
#endif // LM_READ_ARPA_H

View File

@ -1,5 +1,5 @@
#ifndef LM_RETURN__
#define LM_RETURN__
#ifndef LM_RETURN_H
#define LM_RETURN_H
#include <stdint.h>
@ -39,4 +39,4 @@ struct FullScoreReturn {
};
} // namespace lm
#endif // LM_RETURN__
#endif // LM_RETURN_H

View File

@ -178,7 +178,7 @@ template <class Build, class Activate, class Store> void ReadNGrams(
typename Store::Entry entry;
std::vector<typename Value::Weights *> between;
for (size_t i = 0; i < count; ++i) {
ReadNGram(f, n, vocab, &*vocab_ids.begin(), entry.value, warn);
ReadNGram(f, n, vocab, vocab_ids.rbegin(), entry.value, warn);
build.SetRest(&*vocab_ids.begin(), n, entry.value);
keys[0] = detail::CombineWordHash(static_cast<uint64_t>(vocab_ids.front()), vocab_ids[1]);

View File

@ -1,5 +1,5 @@
#ifndef LM_SEARCH_HASHED__
#define LM_SEARCH_HASHED__
#ifndef LM_SEARCH_HASHED_H
#define LM_SEARCH_HASHED_H
#include "lm/model_type.hh"
#include "lm/config.hh"
@ -189,4 +189,4 @@ template <class Value> class HashedSearch {
} // namespace ngram
} // namespace lm
#endif // LM_SEARCH_HASHED__
#endif // LM_SEARCH_HASHED_H

View File

@ -561,6 +561,7 @@ template <class Quant, class Bhiksha> uint8_t *TrieSearch<Quant, Bhiksha>::Setup
}
// Crazy backwards thing so we initialize using pointers to ones that have already been initialized
for (unsigned char i = counts.size() - 1; i >= 2; --i) {
// use "placement new" syntax to initalize Middle in an already-allocated memory location
new (middle_begin_ + i - 2) Middle(
middle_starts[i-2],
quant_.MiddleBits(config),

View File

@ -1,5 +1,5 @@
#ifndef LM_SEARCH_TRIE__
#define LM_SEARCH_TRIE__
#ifndef LM_SEARCH_TRIE_H
#define LM_SEARCH_TRIE_H
#include "lm/config.hh"
#include "lm/model_type.hh"
@ -127,4 +127,4 @@ template <class Quant, class Bhiksha> class TrieSearch {
} // namespace ngram
} // namespace lm
#endif // LM_SEARCH_TRIE__
#endif // LM_SEARCH_TRIE_H

View File

@ -1,5 +1,5 @@
#ifndef LM_SIZES__
#define LM_SIZES__
#ifndef LM_SIZES_H
#define LM_SIZES_H
#include <vector>
@ -14,4 +14,4 @@ void ShowSizes(const std::vector<uint64_t> &counts);
void ShowSizes(const char *file, const lm::ngram::Config &config);
}} // namespaces
#endif // LM_SIZES__
#endif // LM_SIZES_H

View File

@ -1,5 +1,5 @@
#ifndef LM_STATE__
#define LM_STATE__
#ifndef LM_STATE_H
#define LM_STATE_H
#include "lm/max_order.hh"
#include "lm/word_index.hh"
@ -122,4 +122,4 @@ inline uint64_t hash_value(const ChartState &state) {
} // namespace ngram
} // namespace lm
#endif // LM_STATE__
#endif // LM_STATE_H

View File

@ -99,8 +99,11 @@ template <class Bhiksha> util::BitAddress BitPackedMiddle<Bhiksha>::Find(WordInd
}
template <class Bhiksha> void BitPackedMiddle<Bhiksha>::FinishedLoading(uint64_t next_end, const Config &config) {
uint64_t last_next_write = (insert_index_ + 1) * total_bits_ - bhiksha_.InlineBits();
bhiksha_.WriteNext(base_, last_next_write, insert_index_ + 1, next_end);
// Write at insert_index. . .
uint64_t last_next_write = insert_index_ * total_bits_ +
// at the offset where the next pointers are stored.
(total_bits_ - bhiksha_.InlineBits());
bhiksha_.WriteNext(base_, last_next_write, insert_index_, next_end);
bhiksha_.FinishedLoading(config);
}

View File

@ -1,5 +1,5 @@
#ifndef LM_TRIE__
#define LM_TRIE__
#ifndef LM_TRIE_H
#define LM_TRIE_H
#include "lm/weights.hh"
#include "lm/word_index.hh"
@ -143,4 +143,4 @@ class BitPackedLongest : public BitPacked {
} // namespace ngram
} // namespace lm
#endif // LM_TRIE__
#endif // LM_TRIE_H

View File

@ -16,6 +16,7 @@
#include <cstdio>
#include <cstdlib>
#include <deque>
#include <iterator>
#include <limits>
#include <vector>
@ -248,11 +249,13 @@ void SortedFiles::ConvertToSorted(util::FilePiece &f, const SortedVocabulary &vo
uint8_t *out_end = out + std::min(count - done, batch_size) * entry_size;
if (order == counts.size()) {
for (; out != out_end; out += entry_size) {
ReadNGram(f, order, vocab, reinterpret_cast<WordIndex*>(out), *reinterpret_cast<Prob*>(out + words_size), warn);
std::reverse_iterator<WordIndex*> it(reinterpret_cast<WordIndex*>(out) + order);
ReadNGram(f, order, vocab, it, *reinterpret_cast<Prob*>(out + words_size), warn);
}
} else {
for (; out != out_end; out += entry_size) {
ReadNGram(f, order, vocab, reinterpret_cast<WordIndex*>(out), *reinterpret_cast<ProbBackoff*>(out + words_size), warn);
std::reverse_iterator<WordIndex*> it(reinterpret_cast<WordIndex*>(out) + order);
ReadNGram(f, order, vocab, it, *reinterpret_cast<ProbBackoff*>(out + words_size), warn);
}
}
// Sort full records by full n-gram.

View File

@ -1,7 +1,7 @@
// Step of trie builder: create sorted files.
#ifndef LM_TRIE_SORT__
#define LM_TRIE_SORT__
#ifndef LM_TRIE_SORT_H
#define LM_TRIE_SORT_H
#include "lm/max_order.hh"
#include "lm/word_index.hh"
@ -111,4 +111,4 @@ class SortedFiles {
} // namespace ngram
} // namespace lm
#endif // LM_TRIE_SORT__
#endif // LM_TRIE_SORT_H

View File

@ -1,5 +1,5 @@
#ifndef LM_VALUE__
#define LM_VALUE__
#ifndef LM_VALUE_H
#define LM_VALUE_H
#include "lm/model_type.hh"
#include "lm/value_build.hh"
@ -154,4 +154,4 @@ struct RestValue {
} // namespace ngram
} // namespace lm
#endif // LM_VALUE__
#endif // LM_VALUE_H

View File

@ -1,5 +1,5 @@
#ifndef LM_VALUE_BUILD__
#define LM_VALUE_BUILD__
#ifndef LM_VALUE_BUILD_H
#define LM_VALUE_BUILD_H
#include "lm/weights.hh"
#include "lm/word_index.hh"
@ -94,4 +94,4 @@ template <class Model> class LowerRestBuild {
} // namespace ngram
} // namespace lm
#endif // LM_VALUE_BUILD__
#endif // LM_VALUE_BUILD_H

View File

@ -1,5 +1,5 @@
#ifndef LM_VIRTUAL_INTERFACE__
#define LM_VIRTUAL_INTERFACE__
#ifndef LM_VIRTUAL_INTERFACE_H
#define LM_VIRTUAL_INTERFACE_H
#include "lm/return.hh"
#include "lm/word_index.hh"
@ -157,4 +157,4 @@ class Model {
} // mamespace base
} // namespace lm
#endif // LM_VIRTUAL_INTERFACE__
#endif // LM_VIRTUAL_INTERFACE_H

View File

@ -170,11 +170,15 @@ struct ProbingVocabularyHeader {
ProbingVocabulary::ProbingVocabulary() : enumerate_(NULL) {}
uint64_t ProbingVocabulary::Size(uint64_t entries, const Config &config) {
return ALIGN8(sizeof(detail::ProbingVocabularyHeader)) + Lookup::Size(entries, config.probing_multiplier);
uint64_t ProbingVocabulary::Size(uint64_t entries, float probing_multiplier) {
return ALIGN8(sizeof(detail::ProbingVocabularyHeader)) + Lookup::Size(entries, probing_multiplier);
}
void ProbingVocabulary::SetupMemory(void *start, std::size_t allocated, std::size_t /*entries*/, const Config &/*config*/) {
uint64_t ProbingVocabulary::Size(uint64_t entries, const Config &config) {
return Size(entries, config.probing_multiplier);
}
void ProbingVocabulary::SetupMemory(void *start, std::size_t allocated) {
header_ = static_cast<detail::ProbingVocabularyHeader*>(start);
lookup_ = Lookup(static_cast<uint8_t*>(start) + ALIGN8(sizeof(detail::ProbingVocabularyHeader)), allocated);
bound_ = 1;
@ -201,12 +205,12 @@ WordIndex ProbingVocabulary::Insert(const StringPiece &str) {
return 0;
} else {
if (enumerate_) enumerate_->Add(bound_, str);
lookup_.Insert(ProbingVocabuaryEntry::Make(hashed, bound_));
lookup_.Insert(ProbingVocabularyEntry::Make(hashed, bound_));
return bound_++;
}
}
void ProbingVocabulary::InternalFinishedLoading() {
void ProbingVocabulary::FinishedLoading() {
lookup_.FinishedInserting();
header_->bound = bound_;
header_->version = kProbingVocabularyVersion;

View File

@ -1,9 +1,11 @@
#ifndef LM_VOCAB__
#define LM_VOCAB__
#ifndef LM_VOCAB_H
#define LM_VOCAB_H
#include "lm/enumerate_vocab.hh"
#include "lm/lm_exception.hh"
#include "lm/virtual_interface.hh"
#include "util/fake_ofstream.hh"
#include "util/murmur_hash.hh"
#include "util/pool.hh"
#include "util/probing_hash_table.hh"
#include "util/sorted_uniform.hh"
@ -104,17 +106,16 @@ class SortedVocabulary : public base::Vocabulary {
#pragma pack(push)
#pragma pack(4)
struct ProbingVocabuaryEntry {
struct ProbingVocabularyEntry {
uint64_t key;
WordIndex value;
typedef uint64_t Key;
uint64_t GetKey() const {
return key;
}
uint64_t GetKey() const { return key; }
void SetKey(uint64_t to) { key = to; }
static ProbingVocabuaryEntry Make(uint64_t key, WordIndex value) {
ProbingVocabuaryEntry ret;
static ProbingVocabularyEntry Make(uint64_t key, WordIndex value) {
ProbingVocabularyEntry ret;
ret.key = key;
ret.value = value;
return ret;
@ -132,13 +133,18 @@ class ProbingVocabulary : public base::Vocabulary {
return lookup_.Find(detail::HashForVocab(str), i) ? i->value : 0;
}
static uint64_t Size(uint64_t entries, float probing_multiplier);
// This just unwraps Config to get the probing_multiplier.
static uint64_t Size(uint64_t entries, const Config &config);
// Vocab words are [0, Bound()).
WordIndex Bound() const { return bound_; }
// Everything else is for populating. I'm too lazy to hide and friend these, but you'll only get a const reference anyway.
void SetupMemory(void *start, std::size_t allocated, std::size_t entries, const Config &config);
void SetupMemory(void *start, std::size_t allocated);
void SetupMemory(void *start, std::size_t allocated, std::size_t /*entries*/, const Config &/*config*/) {
SetupMemory(start, allocated);
}
void Relocate(void *new_start);
@ -147,8 +153,9 @@ class ProbingVocabulary : public base::Vocabulary {
WordIndex Insert(const StringPiece &str);
template <class Weights> void FinishedLoading(Weights * /*reorder_vocab*/) {
InternalFinishedLoading();
FinishedLoading();
}
void FinishedLoading();
std::size_t UnkCountChangePadding() const { return 0; }
@ -157,9 +164,7 @@ class ProbingVocabulary : public base::Vocabulary {
void LoadedBinary(bool have_words, int fd, EnumerateVocab *to, uint64_t offset);
private:
void InternalFinishedLoading();
typedef util::ProbingHashTable<ProbingVocabuaryEntry, util::IdentityHash> Lookup;
typedef util::ProbingHashTable<ProbingVocabularyEntry, util::IdentityHash> Lookup;
Lookup lookup_;
@ -181,7 +186,64 @@ template <class Vocab> void CheckSpecials(const Config &config, const Vocab &voc
if (vocab.EndSentence() == vocab.NotFound()) MissingSentenceMarker(config, "</s>");
}
class WriteUniqueWords {
public:
explicit WriteUniqueWords(int fd) : word_list_(fd) {}
void operator()(const StringPiece &word) {
word_list_ << word << '\0';
}
private:
util::FakeOFStream word_list_;
};
class NoOpUniqueWords {
public:
NoOpUniqueWords() {}
void operator()(const StringPiece &word) {}
};
template <class NewWordAction = NoOpUniqueWords> class GrowableVocab {
public:
static std::size_t MemUsage(WordIndex content) {
return Lookup::MemUsage(content > 2 ? content : 2);
}
// Does not take ownership of write_wordi
template <class NewWordConstruct> GrowableVocab(WordIndex initial_size, const NewWordConstruct &new_word_construct = NewWordAction())
: lookup_(initial_size), new_word_(new_word_construct) {
FindOrInsert("<unk>"); // Force 0
FindOrInsert("<s>"); // Force 1
FindOrInsert("</s>"); // Force 2
}
WordIndex Index(const StringPiece &str) const {
Lookup::ConstIterator i;
return lookup_.Find(detail::HashForVocab(str), i) ? i->value : 0;
}
WordIndex FindOrInsert(const StringPiece &word) {
ProbingVocabularyEntry entry = ProbingVocabularyEntry::Make(util::MurmurHashNative(word.data(), word.size()), Size());
Lookup::MutableIterator it;
if (!lookup_.FindOrInsert(entry, it)) {
new_word_(word);
UTIL_THROW_IF(Size() >= std::numeric_limits<lm::WordIndex>::max(), VocabLoadException, "Too many vocabulary words. Change WordIndex to uint64_t in lm/word_index.hh");
}
return it->value;
}
WordIndex Size() const { return lookup_.Size(); }
private:
typedef util::AutoProbing<ProbingVocabularyEntry, util::IdentityHash> Lookup;
Lookup lookup_;
NewWordAction new_word_;
};
} // namespace ngram
} // namespace lm
#endif // LM_VOCAB__
#endif // LM_VOCAB_H

View File

@ -1,5 +1,5 @@
#ifndef LM_WEIGHTS__
#define LM_WEIGHTS__
#ifndef LM_WEIGHTS_H
#define LM_WEIGHTS_H
// Weights for n-grams. Probability and possibly a backoff.
@ -19,4 +19,4 @@ struct RestWeights {
};
} // namespace lm
#endif // LM_WEIGHTS__
#endif // LM_WEIGHTS_H

View File

@ -1,6 +1,6 @@
// Separate header because this is used often.
#ifndef LM_WORD_INDEX__
#define LM_WORD_INDEX__
#ifndef LM_WORD_INDEX_H
#define LM_WORD_INDEX_H
#include <limits.h>

View File

@ -519,7 +519,7 @@ int main(int argc, char** argv)
}
// get reference to feature functions
const vector<FeatureFunction*> &featureFunctions = FeatureFunction::GetFeatureFunctions();
// const vector<FeatureFunction*> &featureFunctions = FeatureFunction::GetFeatureFunctions();
ScoreComponentCollection initialWeights = decoder->getWeights();
if (add2lm != 0) {
@ -665,7 +665,7 @@ int main(int argc, char** argv)
}
// number of weight dumps this epoch
size_t weightMixingThisEpoch = 0;
// size_t weightMixingThisEpoch = 0;
size_t weightEpochDump = 0;
size_t shardPosition = 0;

View File

@ -183,9 +183,9 @@ size_t MiraOptimiser::updateWeightsHopeFear(
// iterate over input sentences (1 (online) or more (batch))
for (size_t i = 0; i < featureValuesHope.size(); ++i) {
if (updatePosition != -1) {
if (i < updatePosition)
if (int(i) < updatePosition)
continue;
else if (i > updatePosition)
else if (int(i) > updatePosition)
break;
}

View File

@ -87,9 +87,15 @@ int main(int argc, char** argv)
c_mask.push_back(0);
}
Phrase e( 0),f(0),c(0);
e.CreateFromString(Output, e_mask, query_e, "|", NULL);
f.CreateFromString(Input, f_mask, query_f, "|", NULL);
c.CreateFromString(Input, c_mask, query_c,"|", NULL);
// e.CreateFromString(Output, e_mask, query_e, "|", NULL);
// f.CreateFromString(Input, f_mask, query_f, "|", NULL);
// c.CreateFromString(Input, c_mask, query_c,"|", NULL);
// Phrase.CreateFromString() calls Word.CreateFromSting(), which gets
// the factor delimiter from StaticData, so it should not be hardcoded
// here. [UG], thus:
e.CreateFromString(Output, e_mask, query_e, NULL);
f.CreateFromString(Input, f_mask, query_f, NULL);
c.CreateFromString(Input, c_mask, query_c, NULL);
LexicalReorderingTable* table;
if(FileExists(inFilePath+".binlexr.idx")) {
std::cerr << "Loading binary table...\n";

View File

@ -55,7 +55,10 @@ int main(int argc, char **argv)
std::vector<float> weight(nscores, 0);
Parameter *parameter = new Parameter();
const_cast<std::vector<std::string>&>(parameter->GetParam("factor-delimiter")).resize(1, "||dummy_string||");
// const_cast<std::vector<std::string>&>(parameter->GetParam("factor-delimiter")).resize(1, "||dummy_string||");
// UG: I assume "||dummy_string||" means: I'm not using factored data;
// This is now expressed by setting the factor delimiter to the empty string
const_cast<std::vector<std::string>&>(parameter->GetParam("factor-delimiter")).resize(1, "");
const_cast<std::vector<std::string>&>(parameter->GetParam("input-factors")).resize(1, "0");
const_cast<std::vector<std::string>&>(parameter->GetParam("verbose")).resize(1, "0");
//const_cast<std::vector<std::string>&>(parameter->GetParam("weight-w")).resize(1, "0");

View File

@ -48,6 +48,7 @@ POSSIBILITY OF SUCH DAMAGE.
#include "moses/FF/StatefulFeatureFunction.h"
#include "moses/FF/StatelessFeatureFunction.h"
#include "moses/FF/TreeStructureFeature.h"
#include "moses/PP/TreeStructurePhraseProperty.h"
#include "util/exception.hh"
using namespace std;
@ -410,17 +411,15 @@ void IOWrapper::OutputTreeFragmentsTranslationOptions(std::ostream &out, Applica
if (hypo != NULL) {
OutputTranslationOption(out, applicationContext, hypo, sentence, translationId);
const std::string key = "Tree";
std::string value;
bool hasProperty;
const TargetPhrase &currTarPhr = hypo->GetCurrTargetPhrase();
currTarPhr.GetProperty(key, value, hasProperty);
boost::shared_ptr<PhraseProperty> property;
out << " ||| ";
if (hasProperty)
out << " " << value;
else
if (currTarPhr.GetProperty("Tree", property)) {
out << " " << property->GetValueString();
} else {
out << " " << "noTreeInfo";
}
out << std::endl;
}
@ -439,17 +438,15 @@ void IOWrapper::OutputTreeFragmentsTranslationOptions(std::ostream &out, Applica
if (applied != NULL) {
OutputTranslationOption(out, applicationContext, applied, sentence, translationId);
const std::string key = "Tree";
std::string value;
bool hasProperty;
const TargetPhrase &currTarPhr = *static_cast<const TargetPhrase*>(applied->GetNote().vp);
currTarPhr.GetProperty(key, value, hasProperty);
boost::shared_ptr<PhraseProperty> property;
out << " ||| ";
if (hasProperty)
out << " " << value;
else
if (currTarPhr.GetProperty("Tree", property)) {
out << " " << property->GetValueString();
} else {
out << " " << "noTreeInfo";
}
out << std::endl;
}

View File

@ -161,7 +161,9 @@ ChartKBestExtractor::FindOrCreateVertex(const ChartHypothesis &h)
bestEdge.tail[i] = FindOrCreateVertex(*prevHypo);
}
boost::shared_ptr<Derivation> bestDerivation(new Derivation(bestEdge));
std::pair<DerivationSet::iterator, bool> q =
#ifndef NDEBUG
std::pair<DerivationSet::iterator, bool> q =
#endif
m_derivations.insert(bestDerivation);
assert(q.second);
sp->kBestList.push_back(bestDerivation);

View File

@ -192,12 +192,7 @@ void ChartTranslationOptionList::Evaluate(const InputType &input, const InputPat
}
size_t newSize = m_size - numDiscard;
if (numDiscard) {
cerr << "LIST numDiscard=" << numDiscard << " newSize=" << newSize << endl;
}
m_size = newSize;
}
void ChartTranslationOptionList::SwapTranslationOptions(size_t a, size_t b)

View File

@ -79,10 +79,6 @@ void ChartTranslationOptions::Evaluate(const InputType &input, const InputPath &
}
size_t newSize = m_collection.size() - numDiscard;
if (numDiscard) {
cerr << "numDiscard=" << numDiscard << " newSize=" << newSize << endl;
}
m_collection.resize(newSize);
}

Some files were not shown because too many files have changed in this diff Show More