Commit Graph

288 Commits

Author SHA1 Message Date
MosesAdmin
47c793ca46 daily automatic beautifier 2015-06-10 00:00:40 +01:00
Phil Williams
fa51da28c5 moses/phrase-extract refactoring
Final commit in this round of refactoring (which started with commit
2f735998...).  The main changes are:

  - a general storage mechanism for attribute/value pairs in XML-style
    tree / lattice input.  E.g. the "pcfg-score" and "semantic-role"
    attributes in:

     <tree label="PRP" pcfg-score="1.0" semantic-role="AGENT"> I </tree>

  - consolidation of the various near-duplicate Tree / XmlTreeParser classes
    that have accumulated over the years (my fault)

  - miscellaneous de-crufting
2015-06-09 16:50:27 +01:00
Phil Williams
c6a3d8e54a Ongoing moses/phrase-extract refactoring 2015-06-04 16:54:31 +01:00
Phil Williams
f6ddc45224 Ongoing moses/phrase-extract refactoring 2015-06-04 14:36:39 +01:00
MosesAdmin
5696a59ae4 daily automatic beautifier 2015-06-04 13:41:46 +01:00
Phil Williams
8653bd8159 Ongoing moses/phrase-extract refactoring 2015-06-03 14:20:00 +01:00
Phil Williams
9097fd8965 Ongoing moses/phrase-extract refactoring 2015-06-03 14:09:49 +01:00
Phil Williams
ed321791a7 Ongoing moses/phrase-extract refactoring 2015-06-03 11:10:45 +01:00
Phil Williams
5e09d3dc71 Ongoing moses/phrase-extract refactoring 2015-06-03 10:33:46 +01:00
Phil Williams
2e21f051f2 Ongoing moses/phrase-extract refactoring 2015-06-03 10:05:36 +01:00
Phil Williams
6bea23357c Ongoing moses/phrase-extract refactoring 2015-06-03 09:28:38 +01:00
Phil Williams
2f04d4a56e Ongoing moses/phrase-extract refactoring 2015-06-02 15:23:41 +01:00
Phil Williams
5ece895ab4 Ongoing moses/phrase-extract refactoring 2015-06-02 14:00:56 +01:00
Phil Williams
0c61970ac7 Ongoing moses/phrase-extract refactoring 2015-06-02 13:56:03 +01:00
Phil Williams
d3fb4a8002 Ongoing moses/phrase-extract refactoring 2015-06-02 10:16:42 +01:00
Jeroen Vermeulen
0981d23705 Lint-fixing binge. 2015-06-02 16:02:39 +07:00
Phil Williams
8a9505d72f Ongoing moses/phrase-extract refactoring 2015-06-01 16:54:12 +01:00
Phil Williams
f37415a259 Ongoing moses/phrase-extract refactoring 2015-06-01 16:40:35 +01:00
Phil Williams
f61091e38d Ongoing moses/phrase-extract refactoring 2015-06-01 14:23:25 +01:00
Phil Williams
bf42fa058c Add LeafIterator and ConstLeafIterator to MosesTraining::Syntax::Tree 2015-06-01 11:01:00 +01:00
Phil Williams
f3ccd68bee Add ConstPreOrderIterator to MosesTraining::Syntax::Tree 2015-06-01 10:35:50 +01:00
Phil Williams
c754aef37a Oops. Fix compile error. 2015-06-01 08:45:04 +01:00
Phil Williams
985e7bbfc3 Ongoing moses/phrase-extract refactoring 2015-05-29 20:57:25 +01:00
Phil Williams
2f735998ca Rename MosesTraining::SyntaxTree to MosesTraining::SyntaxNodeCollection
This is the first step in a small-scale refactoring effort that will touch a
lot of the syntax-related code in moses/phrase-extract.  The end goals are:

  - a storage mechanism for general attribute/value pairs in XML-style
    tree / lattice input.  E.g. the "pcfg-score" and "semantic-role"
    attributes in:

     <tree label="PRP" pcfg-score="1.0" semantic-role="AGENT"> I </tree>

  - consolidation of the various near-duplicate Tree / XmlTreeParser classes
    that have accumulated over the years (my fault)

  - general de-crufting
2015-05-29 18:46:02 +01:00
Jeroen Vermeulen
ea9b097aba OutputFileStream: accept ‘-’ for “stdout”.
This is a common convention: when a program gets a dash as the path of a
file that it should write, it writes to standard output instead.

Enhances portability to systems that don't have /dev/stdout.
2015-05-26 15:06:04 +07:00
Jeroen Vermeulen
a25193cc5d Fix a lot of lint, mostly trailing whitespace.
This is lint reported by the new lint-checking functionality in beautify.py.
(We can change to a different lint checker if we have a better one, but it
would probably still flag these same problems.)

Lint checking can help a lot, but only if we get the lint under control.
2015-05-17 20:04:04 +07:00
Hieu Hoang
39139e7a64 beautify. 2015-05-15 18:09:38 +01:00
Hieu Hoang
cc8c6b7b10 beautify 2015-05-02 11:45:24 +01:00
Jeroen Vermeulen
09c982c1de Remove bad initialization.
Setting lastLine[0] when lastLine is empty probably doesn't do anything, but
in C++11 is definitely undefined.  The value wasn't used anyway!
2015-05-01 18:42:04 +07:00
Jeroen Vermeulen
eca5824100 Remove trailing whitespace in C++ files. 2015-04-30 12:05:11 +07:00
Jeroen Vermeulen
616b589da3 Fix a bunch of compiler warnings.
Warnings are useful, but only if there are few!
2015-04-29 21:18:51 +07:00
Jeroen Vermeulen
fc810e363e Remove conflicting definition of isNonTerminal.
This only affects configurations where inline functions become regular,
non-weak symbols, leading to link conflicts.  The extra definition was not
used anywhere.

The removed definition was probably less efficient.  However the only
functional difference was that it returned false for the empty nonterminal,
i.e. "[]".
2015-04-22 10:43:15 +07:00
Jeroen Vermeulen
32722ab5b1 Support tokenize(const std::string &) as well.
Convenience wrapper: the actual function takes a const char[], but many of
the call sites want to pass a string and have to call its c_str() first.
2015-04-22 10:35:18 +07:00
Jeroen Vermeulen
b2d821a141 Unify tokenize() into util, and unit-test it.
The duplicate definition works fine in environments where the inline
definition becomes a weak symbol in the object file, but if it gets
generated as a regular definition, the duplicate definition causes link
problems.

In most call sites the return value could easily be made const, which
gives both the reader and the compiler a bit more certainty about the code's
intentions.  In theory this may help performance, but it's mainly for clarity.

The comments are based on reverse-engineering, and the unit tests are based
on the comments.  It's possible that some of what's in there is not essential,
in which case, don't feel bad about changing it!

I left a third identical definition in place, though I updated it with my
changes to avoid creeping divergence, and noted the duplication in a comment.
It would be nice to get rid of this definition as well, but it'd introduce
headers from the main Moses tree into biconcor, which may be against policy.
2015-04-22 09:59:05 +07:00
Matthias Huck
633e7be8f0 integer overflows in Good-Turing discounting 2015-03-30 17:42:55 +01:00
Jeroen Vermeulen
c634f6ee5b Remove some unused variables.
This silences a few more compiler warnings.
2015-03-30 10:26:39 +07:00
Jeroen Vermeulen
789a2e2bc3 Fix some compile warnings (gcc 4.9.2).
Mostly signed/unsigned comparisons and reordered member
initializations; also a few unused variables.

There are more, but if I chip away at them for a while, who knows, it
may catch on and warnings may eventually become socially stigmatizing.
:)
2015-03-29 18:10:51 +07:00
Jeroen Vermeulen
9852a0c2ff Modernize "C" includes in phrase-extract.
This is one of those little chores in managing a long-lived C++
project: standard C headers like stdio.h and math.h now have their own
place in the C++ standard as resp. cstdio, cmath, and so on.  In this
branch the #include names are updated for the phrase-extract/
subdirectory; more branches to follow.

C++11 adds cstdint, but to support compilation with the previous
standard, that change is left for later.
2015-03-28 19:56:20 +07:00
Matthias Huck
534a894c0b glue rules with stripped BitPar labels 2015-03-10 22:02:21 +00:00
Matthias Huck
01bed83cf9 GHKM extraction: option to strip non-terminal labels from BitPar syntactic parses right during extraction (i.e., remove any suffix starting with a hyphen from the label) 2015-03-10 21:25:32 +00:00
Hieu Hoang
ad73919979 merge with private branch 2015-03-10 15:28:45 +00:00
Phil Williams
9e88f794e6 Add phrase-extract/postprocess-egret-forests
This performs some minor transformations to Egret forests: escaping of
Moses special characters; removal of "^g" suffixes from constituent labels;
and marking of slash/hyphen split points (using @ characters).
2015-03-10 13:51:30 +00:00
Matthias Huck
25f5470216 GHKM: write target parts-of-speech as a factor 2015-03-09 21:54:03 +00:00
Matthias Huck
524ed4406e pragma once 2015-03-09 21:44:54 +00:00
Matthias Huck
559077f6f8 some moderate modifications in phrase-extract/score-main.cpp
(e.g., use Moses::Scan<>() rather than atof()/atoi())
2015-03-09 18:49:32 +00:00
Matthias Huck
973fd98052 conservative update of some old code in phrase-extract/consolidate-main.cpp 2015-03-09 18:47:28 +00:00
Matthias Huck
0c79e19ff9 consolidate properties: fixing bug from commit b08d3ed 2015-03-09 18:44:02 +00:00
Hieu Hoang
b08d3ed0fe merge with private branch. Add --Count arg 2015-03-09 00:47:51 +00:00
Matthias Huck
99b8f65fb1 GHKM: POS factor in glue rules: target side only 2015-03-06 16:47:44 +00:00
Matthias Huck
aa077ab66c GHKM extraction / consolidate: write most frequent POS sequence from property to factor (for usage with a POS LM) 2015-03-05 22:25:32 +00:00