Commit Graph

260 Commits

Author SHA1 Message Date
Jeroen Vermeulen
09c982c1de Remove bad initialization.
Setting lastLine[0] when lastLine is empty probably doesn't do anything, but
in C++11 is definitely undefined.  The value wasn't used anyway!
2015-05-01 18:42:04 +07:00
Jeroen Vermeulen
eca5824100 Remove trailing whitespace in C++ files. 2015-04-30 12:05:11 +07:00
Jeroen Vermeulen
616b589da3 Fix a bunch of compiler warnings.
Warnings are useful, but only if there are few!
2015-04-29 21:18:51 +07:00
Jeroen Vermeulen
fc810e363e Remove conflicting definition of isNonTerminal.
This only affects configurations where inline functions become regular,
non-weak symbols, leading to link conflicts.  The extra definition was not
used anywhere.

The removed definition was probably less efficient.  However the only
functional difference was that it returned false for the empty nonterminal,
i.e. "[]".
2015-04-22 10:43:15 +07:00
Jeroen Vermeulen
32722ab5b1 Support tokenize(const std::string &) as well.
Convenience wrapper: the actual function takes a const char[], but many of
the call sites want to pass a string and have to call its c_str() first.
2015-04-22 10:35:18 +07:00
Jeroen Vermeulen
b2d821a141 Unify tokenize() into util, and unit-test it.
The duplicate definition works fine in environments where the inline
definition becomes a weak symbol in the object file, but if it gets
generated as a regular definition, the duplicate definition causes link
problems.

In most call sites the return value could easily be made const, which
gives both the reader and the compiler a bit more certainty about the code's
intentions.  In theory this may help performance, but it's mainly for clarity.

The comments are based on reverse-engineering, and the unit tests are based
on the comments.  It's possible that some of what's in there is not essential,
in which case, don't feel bad about changing it!

I left a third identical definition in place, though I updated it with my
changes to avoid creeping divergence, and noted the duplication in a comment.
It would be nice to get rid of this definition as well, but it'd introduce
headers from the main Moses tree into biconcor, which may be against policy.
2015-04-22 09:59:05 +07:00
Matthias Huck
633e7be8f0 integer overflows in Good-Turing discounting 2015-03-30 17:42:55 +01:00
Jeroen Vermeulen
c634f6ee5b Remove some unused variables.
This silences a few more compiler warnings.
2015-03-30 10:26:39 +07:00
Jeroen Vermeulen
789a2e2bc3 Fix some compile warnings (gcc 4.9.2).
Mostly signed/unsigned comparisons and reordered member
initializations; also a few unused variables.

There are more, but if I chip away at them for a while, who knows, it
may catch on and warnings may eventually become socially stigmatizing.
:)
2015-03-29 18:10:51 +07:00
Jeroen Vermeulen
9852a0c2ff Modernize "C" includes in phrase-extract.
This is one of those little chores in managing a long-lived C++
project: standard C headers like stdio.h and math.h now have their own
place in the C++ standard as resp. cstdio, cmath, and so on.  In this
branch the #include names are updated for the phrase-extract/
subdirectory; more branches to follow.

C++11 adds cstdint, but to support compilation with the previous
standard, that change is left for later.
2015-03-28 19:56:20 +07:00
Matthias Huck
534a894c0b glue rules with stripped BitPar labels 2015-03-10 22:02:21 +00:00
Matthias Huck
01bed83cf9 GHKM extraction: option to strip non-terminal labels from BitPar syntactic parses right during extraction (i.e., remove any suffix starting with a hyphen from the label) 2015-03-10 21:25:32 +00:00
Hieu Hoang
ad73919979 merge with private branch 2015-03-10 15:28:45 +00:00
Phil Williams
9e88f794e6 Add phrase-extract/postprocess-egret-forests
This performs some minor transformations to Egret forests: escaping of
Moses special characters; removal of "^g" suffixes from constituent labels;
and marking of slash/hyphen split points (using @ characters).
2015-03-10 13:51:30 +00:00
Matthias Huck
25f5470216 GHKM: write target parts-of-speech as a factor 2015-03-09 21:54:03 +00:00
Matthias Huck
524ed4406e pragma once 2015-03-09 21:44:54 +00:00
Matthias Huck
559077f6f8 some moderate modifications in phrase-extract/score-main.cpp
(e.g., use Moses::Scan<>() rather than atof()/atoi())
2015-03-09 18:49:32 +00:00
Matthias Huck
973fd98052 conservative update of some old code in phrase-extract/consolidate-main.cpp 2015-03-09 18:47:28 +00:00
Matthias Huck
0c79e19ff9 consolidate properties: fixing bug from commit b08d3ed 2015-03-09 18:44:02 +00:00
Hieu Hoang
b08d3ed0fe merge with private branch. Add --Count arg 2015-03-09 00:47:51 +00:00
Matthias Huck
99b8f65fb1 GHKM: POS factor in glue rules: target side only 2015-03-06 16:47:44 +00:00
Matthias Huck
aa077ab66c GHKM extraction / consolidate: write most frequent POS sequence from property to factor (for usage with a POS LM) 2015-03-05 22:25:32 +00:00
Matthias Huck
773a16b5fd POS property in glue rules 2015-03-04 23:05:45 +00:00
Matthias Huck
638e9c3f60 POS property: map tags to indices in consolidate 2015-03-04 22:48:34 +00:00
Matthias Huck
06e87d851e GHKM: extract POS phrase property (from preterminals in the syntactic parse tree) 2015-03-04 21:40:56 +00:00
Phil Williams
0346fbb138 filter-rule-table: stopgap (non-) filtering for T2S/SCFG 2015-02-23 11:27:20 +00:00
Hieu Hoang
32de075022 beautify 2015-02-19 12:27:23 +00:00
Phil Williams
e1d60211a4 filter-rule-table: comments + minor clean-up. 2015-02-11 12:03:27 +00:00
Phil Williams
02f5ada680 filter-rule-table: support for "hierarchical" and "s2t" model types
Output should match filter-rule-table.py, but filtering is faster.  Some rough
timings:

             That        This
  System A    0h 13m     0h 04m
  System B   18h 03m     0h 51m

System A is WMT14, en-de, string-to-tree (32M rules, 3,000 test sentences)
System B is WMT14, cs-en, string-to-tree (293M rules, 13,071 test sentences)
2015-02-10 15:11:10 +00:00
Hieu Hoang
70e8eb54ce Using boost for prefix/suffix checks /Jeroen Vermeulen 2015-02-05 16:23:47 +00:00
Philipp Koehn
f69c1dab02 more efficient default recaser training 2015-02-04 09:18:09 +00:00
Phil Williams
6b9da6c585 filter-rule-table: merge changes from t2s branch (still WIP) 2015-02-03 11:33:10 +00:00
Matthias Huck
9987beb453 SoftSourceSyntacticConstraintsFeature: Now for both non-terminals (as before) _and_ terminals.
Also added score components based on relative frequency.
(TODO: logprobs right now; are plain probabilities better?)
2015-01-23 18:41:18 +00:00
Matthias Huck
b50c197313 forgot to check this in some time ago 2015-01-20 21:41:41 +00:00
Matthias Huck
a6c09e57d0 domain features in GHKM extraction 2015-01-20 21:36:55 +00:00
Hieu Hoang
b50b3164fa beautify 2015-01-15 11:18:39 +00:00
Hieu Hoang
6289b39fd8 update extract-mixed-syntax 2015-01-15 09:53:57 +00:00
Hieu Hoang
6d61db28fa use astyle 2.01. It's on Edinburgh server and doesn't screw up enum 2015-01-14 19:21:11 +00:00
Hieu Hoang
05ead45e71 beautify 2015-01-14 11:07:42 +00:00
Matthias Huck
168118d252 PhraseOrientationFeature efficiency improvement 2015-01-09 14:03:18 +00:00
Phil Williams
7cc75a0fa1 score-stsg: add --TreeScore option 2014-12-30 18:57:23 +00:00
Philipp Koehn
831f947874 long overdue feature: do not produce very low scoring translation table entries that are never used and just gum up the works 2014-12-21 01:14:42 +00:00
Nicola Bertoldi
e4eb201c52 merged master into dynamic-models and solved conflicts 2014-12-13 12:52:47 +01:00
Phil Williams
b9a382aa78 Add filter-rule-table
This will eventually replace filter-rule-table.py.  At the moment
it can only filter rule tables where the source-side is a STSG
fragment and when the test sentences have parse trees.
2014-12-07 14:56:48 +00:00
Phil Williams
60e56efc6b phrase-extract: add syntax-common sub-library
And remove some (near-)duplicate code from pcfg-common and score-stsg.
2014-12-07 14:27:51 +00:00
Phil Williams
a2708b8431 relax-parse: fix hang
SyntaxTree::Parse() would enter a *very* long loop due to an unintialized
member variable.
2014-12-07 12:56:41 +00:00
Hieu Hoang
4b10c59bea add OutputSearchGraphHypergraph() to API framework. Move m_source to BaseManager 2014-12-05 21:33:59 +00:00
Matthias Huck
7a299de66b avoid necessity of masking "{{" in the data 2014-12-04 15:54:05 +00:00
Matthias Huck
24a8a6a511 PhraseOrientationFeature 2014-12-03 20:04:26 +00:00
Hieu Hoang
49a2ff1faa Merge branch 'merge-cmd' 2014-12-02 19:09:34 +00:00