mirror of
https://github.com/moses-smt/mosesdecoder.git
synced 2024-12-26 21:42:19 +03:00
Merge branch 'master' of ssh://github.com/moses-smt/mosesdecoder
This commit is contained in:
commit
89a9c410c9
1
Jamroot
1
Jamroot
@ -145,6 +145,7 @@ build-projects lm util phrase-extract search moses moses/LM mert moses-cmd moses
|
||||
if [ option.get "with-mm" : : "yes" ]
|
||||
{
|
||||
alias mm :
|
||||
moses/TranslationModel/UG//lookup_mmsapt
|
||||
moses/TranslationModel/UG/mm//mtt-build
|
||||
moses/TranslationModel/UG/mm//mtt-dump
|
||||
moses/TranslationModel/UG/mm//symal2mam
|
||||
|
122
contrib/moses-speedtest/README.md
Normal file
122
contrib/moses-speedtest/README.md
Normal file
@ -0,0 +1,122 @@
|
||||
# Moses speedtesting framework
|
||||
|
||||
### Description
|
||||
|
||||
This is an automatic test framework that is designed to test the day to day performance changes in Moses.
|
||||
|
||||
### Set up
|
||||
|
||||
#### Set up a Moses repo
|
||||
Set up a Moses repo and build it with the desired configuration.
|
||||
```bash
|
||||
git clone https://github.com/moses-smt/mosesdecoder.git
|
||||
cd mosesdecoder
|
||||
./bjam -j10 --with-cmph=/usr/include/
|
||||
```
|
||||
You need to build Moses first, so that the testsuite knows what command you want it to use when rebuilding against newer revisions.
|
||||
|
||||
#### Create a parent directory.
|
||||
Create a parent directory where the **runtests.py** and related scripts and configuration file should reside.
|
||||
This should also be the location of the TEST_DIR and TEST_LOG_DIR as explained in the next section.
|
||||
|
||||
#### Set up a global configuration file.
|
||||
You need a configuration file for the testsuite. A sample configuration file is provided in **testsuite\_config**
|
||||
<pre>
|
||||
MOSES_REPO_PATH: /home/moses-speedtest/moses-standard/mosesdecoder
|
||||
DROP_CACHES_COMM: sys_drop_caches 3
|
||||
TEST_DIR: /home/moses-speedtest/phrase_tables/tests
|
||||
TEST_LOG_DIR: /home/moses-speedtest/phrase_tables/testlogs
|
||||
BASEBRANCH: RELEASE-2.1.1
|
||||
</pre>
|
||||
|
||||
The _MOSES\_REPO\_PATH_ is the place where you have set up and built moses.
|
||||
The _DROP\_CACHES\_COMM_ is the command that would beused to drop caches. It should run without needing root access.
|
||||
_TEST\_DIR_ is the directory where all the tests will reside.
|
||||
_TEST\_LOG\_DIR_ is the directory where the performance logs will be gathered. It should be created before running the testsuite for the first time.
|
||||
_BASEBRANCH_ is the branch against which all new tests will be compared. It should normally be set to be the latest Moses stable release.
|
||||
|
||||
### Creating tests
|
||||
|
||||
In order to create a test one should go into the TEST_DIR and create a new folder. That folder will be used for the name of the test.
|
||||
Inside that folder one should place a configuration file named **config**. The naming is mandatory.
|
||||
An example such configuration file is **test\_config**
|
||||
|
||||
<pre>
|
||||
Command: moses -f ... -i fff #Looks for the command in the /bin directory of the repo specified in the testsuite_config
|
||||
LDPRE: ldpreloads #Comma separated LD_LIBRARY_PATH:/,
|
||||
Variants: vanilla, cached, ldpre #Can't have cached without ldpre or vanilla
|
||||
</pre>
|
||||
|
||||
The _Command:_ line specifies the executable (which is looked up in the /bin directory of the repo.) and any arguments necessary. Before running the test, the script cds to the current test directory so you can use relative paths.
|
||||
The _LDPRE:_ specifies if tests should be run with any LD\_PRELOAD flags.
|
||||
The _Variants:_ line specifies what type of tests should we run. This particular line will run the following tests:
|
||||
1. A Vanilla test meaning just the command after _Command_ will be issued.
|
||||
2. A vanilla cached test meaning that after the vanilla test, the test will be run again without dropping caches in order to benchmark performance on cached filesystem.
|
||||
3. A test with LD_PRELOAD ldpreloads moses -f command. For each available LDPRELOAD comma separated library to preload.
|
||||
4. A cached version of all LD_PRELOAD tests.
|
||||
|
||||
### Running tests.
|
||||
Running the tests is done through the **runtests.py** script.
|
||||
|
||||
#### Running all tests.
|
||||
To run all tests, with the base branch and the latests revision (and generate new basebranch test data if such is missing) do a:
|
||||
```bash
|
||||
python3 runtests.py -c testsuite_config
|
||||
```
|
||||
|
||||
#### Running specific tests.
|
||||
The script allows the user to manually run a particular test or to test against a specific branch or revision:
|
||||
<pre>
|
||||
moses-speedtest@crom:~/phrase_tables$ python3 runtests.py --help
|
||||
usage: runtests.py [-h] -c CONFIGFILE [-s SINGLETESTDIR] [-r REVISION]
|
||||
[-b BRANCH]
|
||||
|
||||
A python based speedtest suite for moses.
|
||||
|
||||
optional arguments:
|
||||
-h, --help show this help message and exit
|
||||
-c CONFIGFILE, --configfile CONFIGFILE
|
||||
Specify test config file
|
||||
-s SINGLETESTDIR, --singletest SINGLETESTDIR
|
||||
Single test name directory. Specify directory name,
|
||||
not full path!
|
||||
-r REVISION, --revision REVISION
|
||||
Specify a specific revison for the test.
|
||||
-b BRANCH, --branch BRANCH
|
||||
Specify a branch for the test.
|
||||
</pre>
|
||||
|
||||
### Generating HTML report.
|
||||
To generate a summary of the test results use the **html\_gen.py** script. It places a file named *index.html* in the current script directory.
|
||||
```bash
|
||||
python3 html_gen.py testsuite_config
|
||||
```
|
||||
You should use the generated file with the **style.css** file provided in the html directory.
|
||||
|
||||
### Command line regression testing.
|
||||
Alternatively you could check for regressions from the command line using the **check\_fo\r_regression.py** script:
|
||||
```bash
|
||||
python3 check_for_regression.py TESTLOGS_DIRECTORY
|
||||
```
|
||||
|
||||
Alternatively the results of all tests are logged inside the the specified TESTLOGS directory so you can manually check them for additional information such as date, time, revision, branch, etc...
|
||||
|
||||
### Create a cron job:
|
||||
Create a cron job to run the tests daily and generate an html report. An example *cronjob* is available.
|
||||
```bash
|
||||
#!/bin/sh
|
||||
cd /home/moses-speedtest/phrase_tables
|
||||
|
||||
python3 runtests.py -c testsuite_config #Run the tests.
|
||||
python3 html_gen.py testsuite_config #Generate html
|
||||
|
||||
cp index.html /fs/thor4/html/www/speed-test/ #Update the html
|
||||
```
|
||||
|
||||
Place the script in _/etc/cron.daily_ for dayly testing
|
||||
|
||||
###### Author
|
||||
Nikolay Bogoychev, 2014
|
||||
|
||||
###### License
|
||||
This software is licensed under the LGPL.
|
63
contrib/moses-speedtest/check_for_regression.py
Normal file
63
contrib/moses-speedtest/check_for_regression.py
Normal file
@ -0,0 +1,63 @@
|
||||
"""Checks if any of the latests tests has performed considerably different than
|
||||
the previous ones. Takes the log directory as an argument."""
|
||||
import os
|
||||
import sys
|
||||
from testsuite_common import Result, processLogLine, bcolors, getLastTwoLines
|
||||
|
||||
LOGDIR = sys.argv[1] #Get the log directory as an argument
|
||||
PERCENTAGE = 5 #Default value for how much a test shoudl change
|
||||
if len(sys.argv) == 3:
|
||||
PERCENTAGE = float(sys.argv[2]) #Default is 5%, but we can specify more
|
||||
#line parameter
|
||||
|
||||
def printResults(regressed, better, unchanged, firsttime):
|
||||
"""Pretty print the results in different colours"""
|
||||
if regressed != []:
|
||||
for item in regressed:
|
||||
print(bcolors.RED + "REGRESSION! " + item.testname + " Was: "\
|
||||
+ str(item.previous) + " Is: " + str(item.current) + " Change: "\
|
||||
+ str(abs(item.percentage)) + "%. Revision: " + item.revision\
|
||||
+ bcolors.ENDC)
|
||||
print('\n')
|
||||
if unchanged != []:
|
||||
for item in unchanged:
|
||||
print(bcolors.BLUE + "UNCHANGED: " + item.testname + " Revision: " +\
|
||||
item.revision + bcolors.ENDC)
|
||||
print('\n')
|
||||
if better != []:
|
||||
for item in better:
|
||||
print(bcolors.GREEN + "IMPROVEMENT! " + item.testname + " Was: "\
|
||||
+ str(item.previous) + " Is: " + str(item.current) + " Change: "\
|
||||
+ str(abs(item.percentage)) + "%. Revision: " + item.revision\
|
||||
+ bcolors.ENDC)
|
||||
if firsttime != []:
|
||||
for item in firsttime:
|
||||
print(bcolors.PURPLE + "First time test! " + item.testname +\
|
||||
" Took: " + str(item.real) + " seconds. Revision: " +\
|
||||
item.revision + bcolors.ENDC)
|
||||
|
||||
|
||||
all_files = os.listdir(LOGDIR)
|
||||
regressed = []
|
||||
better = []
|
||||
unchanged = []
|
||||
firsttime = []
|
||||
|
||||
#Go through all log files and find which tests have performed better.
|
||||
for logfile in all_files:
|
||||
(line1, line2) = getLastTwoLines(logfile, LOGDIR)
|
||||
log1 = processLogLine(line1)
|
||||
if line2 == '\n': # Empty line, only one test ever run
|
||||
firsttime.append(log1)
|
||||
continue
|
||||
log2 = processLogLine(line2)
|
||||
res = Result(log1.testname, log1.real, log2.real, log2.revision,\
|
||||
log2.branch, log1.revision, log1.branch)
|
||||
if res.percentage < -PERCENTAGE:
|
||||
regressed.append(res)
|
||||
elif res.change > PERCENTAGE:
|
||||
better.append(res)
|
||||
else:
|
||||
unchanged.append(res)
|
||||
|
||||
printResults(regressed, better, unchanged, firsttime)
|
7
contrib/moses-speedtest/cronjob
Normal file
7
contrib/moses-speedtest/cronjob
Normal file
@ -0,0 +1,7 @@
|
||||
#!/bin/sh
|
||||
cd /home/moses-speedtest/phrase_tables
|
||||
|
||||
python3 runtests.py -c testsuite_config #Run the tests.
|
||||
python3 html_gen.py testsuite_config #Generate html
|
||||
|
||||
cp index.html /fs/thor4/html/www/speed-test/ #Update the html
|
5
contrib/moses-speedtest/helpers/README.md
Normal file
5
contrib/moses-speedtest/helpers/README.md
Normal file
@ -0,0 +1,5 @@
|
||||
###Helpers
|
||||
|
||||
This is a python script that basically gives you the equivalent of:
|
||||
```echo 3 > /proc/sys/vm/drop_caches```
|
||||
You need to set it up so it is executed with root access without needing a password so that the tests can be automated.
|
22
contrib/moses-speedtest/helpers/sys_drop_caches.py
Normal file
22
contrib/moses-speedtest/helpers/sys_drop_caches.py
Normal file
@ -0,0 +1,22 @@
|
||||
#!/usr/bin/spython
|
||||
from sys import argv, stderr, exit
|
||||
from os import linesep as ls
|
||||
procfile = "/proc/sys/vm/drop_caches"
|
||||
options = ["1","2","3"]
|
||||
flush_type = None
|
||||
try:
|
||||
flush_type = argv[1][0:1]
|
||||
if not flush_type in options:
|
||||
raise IndexError, "not in options"
|
||||
with open(procfile, "w") as f:
|
||||
f.write("%s%s" % (flush_type,ls))
|
||||
exit(0)
|
||||
except IndexError, e:
|
||||
stderr.write("Argument %s required.%s" % (options, ls))
|
||||
except IOError, e:
|
||||
stderr.write("Error writing to file.%s" % ls)
|
||||
except StandardError, e:
|
||||
stderr.write("Unknown Error.%s" % ls)
|
||||
|
||||
exit(1)
|
||||
|
5
contrib/moses-speedtest/html/README.md
Normal file
5
contrib/moses-speedtest/html/README.md
Normal file
@ -0,0 +1,5 @@
|
||||
###HTML files.
|
||||
|
||||
_index.html_ is a sample generated file by this testsuite.
|
||||
|
||||
_style.css_ should be placed in the html directory in which _index.html_ will be placed in order to visualize the test results in a browser.
|
32
contrib/moses-speedtest/html/index.html
Normal file
32
contrib/moses-speedtest/html/index.html
Normal file
File diff suppressed because one or more lines are too long
21
contrib/moses-speedtest/html/style.css
Normal file
21
contrib/moses-speedtest/html/style.css
Normal file
@ -0,0 +1,21 @@
|
||||
table,th,td
|
||||
{
|
||||
border:1px solid black;
|
||||
border-collapse:collapse
|
||||
}
|
||||
|
||||
tr:nth-child(odd) {
|
||||
background-color: Gainsboro;
|
||||
}
|
||||
|
||||
.better {
|
||||
color: Green;
|
||||
}
|
||||
|
||||
.worse {
|
||||
color: Red;
|
||||
}
|
||||
|
||||
.unchanged {
|
||||
color: SkyBlue;
|
||||
}
|
192
contrib/moses-speedtest/html_gen.py
Normal file
192
contrib/moses-speedtest/html_gen.py
Normal file
@ -0,0 +1,192 @@
|
||||
"""Generates HTML page containing the testresults"""
|
||||
from testsuite_common import Result, processLogLine, getLastTwoLines
|
||||
from runtests import parse_testconfig
|
||||
import os
|
||||
import sys
|
||||
|
||||
from datetime import datetime, timedelta
|
||||
|
||||
HTML_HEADING = """<html>
|
||||
<head>
|
||||
<title>Moses speed testing</title>
|
||||
<link rel="stylesheet" type="text/css" href="style.css"></head><body>"""
|
||||
HTML_ENDING = "</table></body></html>\n"
|
||||
|
||||
TABLE_HEADING = """<table><tr class="heading">
|
||||
<th>Date</th>
|
||||
<th>Time</th>
|
||||
<th>Testname</th>
|
||||
<th>Revision</th>
|
||||
<th>Branch</th>
|
||||
<th>Time</th>
|
||||
<th>Prevtime</th>
|
||||
<th>Prevrev</th>
|
||||
<th>Change (%)</th>
|
||||
<th>Time (Basebranch)</th>
|
||||
<th>Change (%, Basebranch)</th>
|
||||
<th>Time (Days -2)</th>
|
||||
<th>Change (%, Days -2)</th>
|
||||
<th>Time (Days -3)</th>
|
||||
<th>Change (%, Days -3)</th>
|
||||
<th>Time (Days -4)</th>
|
||||
<th>Change (%, Days -4)</th>
|
||||
<th>Time (Days -5)</th>
|
||||
<th>Change (%, Days -5)</th>
|
||||
<th>Time (Days -6)</th>
|
||||
<th>Change (%, Days -6)</th>
|
||||
<th>Time (Days -7)</th>
|
||||
<th>Change (%, Days -7)</th>
|
||||
<th>Time (Days -14)</th>
|
||||
<th>Change (%, Days -14)</th>
|
||||
<th>Time (Years -1)</th>
|
||||
<th>Change (%, Years -1)</th>
|
||||
</tr>"""
|
||||
|
||||
def get_prev_days(date, numdays):
|
||||
"""Gets the date numdays previous days so that we could search for
|
||||
that test in the config file"""
|
||||
date_obj = datetime.strptime(date, '%d.%m.%Y').date()
|
||||
past_date = date_obj - timedelta(days=numdays)
|
||||
return past_date.strftime('%d.%m.%Y')
|
||||
|
||||
def gather_necessary_lines(logfile, date):
|
||||
"""Gathers the necessary lines corresponding to past dates
|
||||
and parses them if they exist"""
|
||||
#Get a dictionary of dates
|
||||
dates = {}
|
||||
dates[get_prev_days(date, 2)] = ('-2', None)
|
||||
dates[get_prev_days(date, 3)] = ('-3', None)
|
||||
dates[get_prev_days(date, 4)] = ('-4', None)
|
||||
dates[get_prev_days(date, 5)] = ('-5', None)
|
||||
dates[get_prev_days(date, 6)] = ('-6', None)
|
||||
dates[get_prev_days(date, 7)] = ('-7', None)
|
||||
dates[get_prev_days(date, 14)] = ('-14', None)
|
||||
dates[get_prev_days(date, 365)] = ('-365', None)
|
||||
|
||||
openfile = open(logfile, 'r')
|
||||
for line in openfile:
|
||||
if line.split()[0] in dates.keys():
|
||||
day = dates[line.split()[0]][0]
|
||||
dates[line.split()[0]] = (day, processLogLine(line))
|
||||
openfile.close()
|
||||
return dates
|
||||
|
||||
def append_date_to_table(resline):
|
||||
"""Appends past dates to the html"""
|
||||
cur_html = '<td>' + str(resline.current) + '</td>'
|
||||
|
||||
if resline.percentage > 0.05: #If we have improvement of more than 5%
|
||||
cur_html = cur_html + '<td class="better">' + str(resline.percentage) + '</td>'
|
||||
elif resline.percentage < -0.05: #We have a regression of more than 5%
|
||||
cur_html = cur_html + '<td class="worse">' + str(resline.percentage) + '</td>'
|
||||
else:
|
||||
cur_html = cur_html + '<td class="unchanged">' + str(resline.percentage) + '</td>'
|
||||
return cur_html
|
||||
|
||||
def compare_rev(filename, rev1, rev2, branch1=False, branch2=False):
|
||||
"""Compare the test results of two lines. We can specify either a
|
||||
revision or a branch for comparison. The first rev should be the
|
||||
base version and the second revision should be the later version"""
|
||||
|
||||
#In the log file the index of the revision is 2 but the index of
|
||||
#the branch is 12. Alternate those depending on whether we are looking
|
||||
#for a specific revision or branch.
|
||||
firstidx = 2
|
||||
secondidx = 2
|
||||
if branch1 == True:
|
||||
firstidx = 12
|
||||
if branch2 == True:
|
||||
secondidx = 12
|
||||
|
||||
rev1line = ''
|
||||
rev2line = ''
|
||||
resfile = open(filename, 'r')
|
||||
for line in resfile:
|
||||
if rev1 == line.split()[firstidx]:
|
||||
rev1line = line
|
||||
elif rev2 == line.split()[secondidx]:
|
||||
rev2line = line
|
||||
if rev1line != '' and rev2line != '':
|
||||
break
|
||||
resfile.close()
|
||||
if rev1line == '':
|
||||
raise ValueError('Revision ' + rev1 + " was not found!")
|
||||
if rev2line == '':
|
||||
raise ValueError('Revision ' + rev2 + " was not found!")
|
||||
|
||||
logLine1 = processLogLine(rev1line)
|
||||
logLine2 = processLogLine(rev2line)
|
||||
res = Result(logLine1.testname, logLine1.real, logLine2.real,\
|
||||
logLine2.revision, logLine2.branch, logLine1.revision, logLine1.branch)
|
||||
|
||||
return res
|
||||
|
||||
def produce_html(path, global_config):
|
||||
"""Produces html file for the report."""
|
||||
html = '' #The table HTML
|
||||
for filenam in os.listdir(global_config.testlogs):
|
||||
#Generate html for the newest two lines
|
||||
#Get the lines from the config file
|
||||
(ll1, ll2) = getLastTwoLines(filenam, global_config.testlogs)
|
||||
logLine1 = processLogLine(ll1)
|
||||
logLine2 = processLogLine(ll2)
|
||||
|
||||
#Generate html
|
||||
res1 = Result(logLine1.testname, logLine1.real, logLine2.real,\
|
||||
logLine2.revision, logLine2.branch, logLine1.revision, logLine1.branch)
|
||||
html = html + '<tr><td>' + logLine2.date + '</td><td>' + logLine2.time + '</td><td>' +\
|
||||
res1.testname + '</td><td>' + res1.revision[:10] + '</td><td>' + res1.branch + '</td><td>' +\
|
||||
str(res1.current) + '</td><td>' + str(res1.previous) + '</td><td>' + res1.prevrev[:10] + '</td>'
|
||||
|
||||
#Add fancy colours depending on the change
|
||||
if res1.percentage > 0.05: #If we have improvement of more than 5%
|
||||
html = html + '<td class="better">' + str(res1.percentage) + '</td>'
|
||||
elif res1.percentage < -0.05: #We have a regression of more than 5%
|
||||
html = html + '<td class="worse">' + str(res1.percentage) + '</td>'
|
||||
else:
|
||||
html = html + '<td class="unchanged">' + str(res1.percentage) + '</td>'
|
||||
|
||||
#Get comparison against the base version
|
||||
filenam = global_config.testlogs + '/' + filenam #Get proper directory
|
||||
res2 = compare_rev(filenam, global_config.basebranch, res1.revision, branch1=True)
|
||||
html = html + '<td>' + str(res2.previous) + '</td>'
|
||||
|
||||
#Add fancy colours depending on the change
|
||||
if res2.percentage > 0.05: #If we have improvement of more than 5%
|
||||
html = html + '<td class="better">' + str(res2.percentage) + '</td>'
|
||||
elif res2.percentage < -0.05: #We have a regression of more than 5%
|
||||
html = html + '<td class="worse">' + str(res2.percentage) + '</td>'
|
||||
else:
|
||||
html = html + '<td class="unchanged">' + str(res2.percentage) + '</td>'
|
||||
|
||||
#Add extra dates comparison dating from the beginning of time if they exist
|
||||
past_dates = list(range(2, 8))
|
||||
past_dates.append(14)
|
||||
past_dates.append(365) # Get the 1 year ago day
|
||||
linesdict = gather_necessary_lines(filenam, logLine2.date)
|
||||
|
||||
for days in past_dates:
|
||||
act_date = get_prev_days(logLine2.date, days)
|
||||
if linesdict[act_date][1] is not None:
|
||||
logline_date = linesdict[act_date][1]
|
||||
restemp = Result(logline_date.testname, logline_date.real, logLine2.real,\
|
||||
logLine2.revision, logLine2.branch, logline_date.revision, logline_date.branch)
|
||||
html = html + append_date_to_table(restemp)
|
||||
else:
|
||||
html = html + '<td>N/A</td><td>N/A</td>'
|
||||
|
||||
|
||||
|
||||
html = html + '</tr>' #End row
|
||||
|
||||
#Write out the file
|
||||
basebranch_info = '<text><b>Basebranch:</b> ' + res2.prevbranch + ' <b>Revision:</b> ' +\
|
||||
res2.prevrev + '</text>'
|
||||
writeoutstr = HTML_HEADING + basebranch_info + TABLE_HEADING + html + HTML_ENDING
|
||||
writefile = open(path, 'w')
|
||||
writefile.write(writeoutstr)
|
||||
writefile.close()
|
||||
|
||||
if __name__ == '__main__':
|
||||
CONFIG = parse_testconfig(sys.argv[1])
|
||||
produce_html('index.html', CONFIG)
|
293
contrib/moses-speedtest/runtests.py
Normal file
293
contrib/moses-speedtest/runtests.py
Normal file
@ -0,0 +1,293 @@
|
||||
"""Given a config file, runs tests"""
|
||||
import os
|
||||
import subprocess
|
||||
import time
|
||||
from argparse import ArgumentParser
|
||||
from testsuite_common import processLogLine
|
||||
|
||||
def parse_cmd():
|
||||
"""Parse the command line arguments"""
|
||||
description = "A python based speedtest suite for moses."
|
||||
parser = ArgumentParser(description=description)
|
||||
parser.add_argument("-c", "--configfile", action="store",\
|
||||
dest="configfile", required=True,\
|
||||
help="Specify test config file")
|
||||
parser.add_argument("-s", "--singletest", action="store",\
|
||||
dest="singletestdir", default=None,\
|
||||
help="Single test name directory. Specify directory name,\
|
||||
not full path!")
|
||||
parser.add_argument("-r", "--revision", action="store",\
|
||||
dest="revision", default=None,\
|
||||
help="Specify a specific revison for the test.")
|
||||
parser.add_argument("-b", "--branch", action="store",\
|
||||
dest="branch", default=None,\
|
||||
help="Specify a branch for the test.")
|
||||
|
||||
arguments = parser.parse_args()
|
||||
return arguments
|
||||
|
||||
def repoinit(testconfig):
|
||||
"""Determines revision and sets up the repo."""
|
||||
revision = ''
|
||||
#Update the repo
|
||||
os.chdir(testconfig.repo)
|
||||
#Checkout specific branch, else maintain main branch
|
||||
if testconfig.branch != 'master':
|
||||
subprocess.call(['git', 'checkout', testconfig.branch])
|
||||
rev, _ = subprocess.Popen(['git', 'rev-parse', 'HEAD'],\
|
||||
stdout=subprocess.PIPE, stderr=subprocess.PIPE).communicate()
|
||||
revision = str(rev).replace("\\n'", '').replace("b'", '')
|
||||
else:
|
||||
subprocess.call(['git checkout master'], shell=True)
|
||||
|
||||
#Check a specific revision. Else checkout master.
|
||||
if testconfig.revision:
|
||||
subprocess.call(['git', 'checkout', testconfig.revision])
|
||||
revision = testconfig.revision
|
||||
elif testconfig.branch == 'master':
|
||||
subprocess.call(['git pull'], shell=True)
|
||||
rev, _ = subprocess.Popen(['git rev-parse HEAD'], stdout=subprocess.PIPE,\
|
||||
stderr=subprocess.PIPE, shell=True).communicate()
|
||||
revision = str(rev).replace("\\n'", '').replace("b'", '')
|
||||
|
||||
return revision
|
||||
|
||||
class Configuration:
|
||||
"""A simple class to hold all of the configuration constatns"""
|
||||
def __init__(self, repo, drop_caches, tests, testlogs, basebranch, baserev):
|
||||
self.repo = repo
|
||||
self.drop_caches = drop_caches
|
||||
self.tests = tests
|
||||
self.testlogs = testlogs
|
||||
self.basebranch = basebranch
|
||||
self.baserev = baserev
|
||||
self.singletest = None
|
||||
self.revision = None
|
||||
self.branch = 'master' # Default branch
|
||||
|
||||
def additional_args(self, singletest, revision, branch):
|
||||
"""Additional configuration from command line arguments"""
|
||||
self.singletest = singletest
|
||||
if revision is not None:
|
||||
self.revision = revision
|
||||
if branch is not None:
|
||||
self.branch = branch
|
||||
|
||||
def set_revision(self, revision):
|
||||
"""Sets the current revision that is being tested"""
|
||||
self.revision = revision
|
||||
|
||||
|
||||
class Test:
|
||||
"""A simple class to contain all information about tests"""
|
||||
def __init__(self, name, command, ldopts, permutations):
|
||||
self.name = name
|
||||
self.command = command
|
||||
self.ldopts = ldopts.replace(' ', '').split(',') #Not tested yet
|
||||
self.permutations = permutations
|
||||
|
||||
def parse_configfile(conffile, testdir, moses_repo):
|
||||
"""Parses the config file"""
|
||||
command, ldopts = '', ''
|
||||
permutations = []
|
||||
fileopen = open(conffile, 'r')
|
||||
for line in fileopen:
|
||||
line = line.split('#')[0] # Discard comments
|
||||
if line == '' or line == '\n':
|
||||
continue # Discard lines with comments only and empty lines
|
||||
opt, args = line.split(' ', 1) # Get arguments
|
||||
|
||||
if opt == 'Command:':
|
||||
command = args.replace('\n', '')
|
||||
command = moses_repo + '/bin/' + command
|
||||
elif opt == 'LDPRE:':
|
||||
ldopts = args.replace('\n', '')
|
||||
elif opt == 'Variants:':
|
||||
permutations = args.replace('\n', '').replace(' ', '').split(',')
|
||||
else:
|
||||
raise ValueError('Unrecognized option ' + opt)
|
||||
#We use the testdir as the name.
|
||||
testcase = Test(testdir, command, ldopts, permutations)
|
||||
fileopen.close()
|
||||
return testcase
|
||||
|
||||
def parse_testconfig(conffile):
|
||||
"""Parses the config file for the whole testsuite."""
|
||||
repo_path, drop_caches, tests_dir, testlog_dir = '', '', '', ''
|
||||
basebranch, baserev = '', ''
|
||||
fileopen = open(conffile, 'r')
|
||||
for line in fileopen:
|
||||
line = line.split('#')[0] # Discard comments
|
||||
if line == '' or line == '\n':
|
||||
continue # Discard lines with comments only and empty lines
|
||||
opt, args = line.split(' ', 1) # Get arguments
|
||||
if opt == 'MOSES_REPO_PATH:':
|
||||
repo_path = args.replace('\n', '')
|
||||
elif opt == 'DROP_CACHES_COMM:':
|
||||
drop_caches = args.replace('\n', '')
|
||||
elif opt == 'TEST_DIR:':
|
||||
tests_dir = args.replace('\n', '')
|
||||
elif opt == 'TEST_LOG_DIR:':
|
||||
testlog_dir = args.replace('\n', '')
|
||||
elif opt == 'BASEBRANCH:':
|
||||
basebranch = args.replace('\n', '')
|
||||
elif opt == 'BASEREV:':
|
||||
baserev = args.replace('\n', '')
|
||||
else:
|
||||
raise ValueError('Unrecognized option ' + opt)
|
||||
config = Configuration(repo_path, drop_caches, tests_dir, testlog_dir,\
|
||||
basebranch, baserev)
|
||||
fileopen.close()
|
||||
return config
|
||||
|
||||
def get_config():
|
||||
"""Builds the config object with all necessary attributes"""
|
||||
args = parse_cmd()
|
||||
config = parse_testconfig(args.configfile)
|
||||
config.additional_args(args.singletestdir, args.revision, args.branch)
|
||||
revision = repoinit(config)
|
||||
config.set_revision(revision)
|
||||
return config
|
||||
|
||||
def check_for_basever(testlogfile, basebranch):
|
||||
"""Checks if the base revision is present in the testlogs"""
|
||||
filetoopen = open(testlogfile, 'r')
|
||||
for line in filetoopen:
|
||||
templine = processLogLine(line)
|
||||
if templine.branch == basebranch:
|
||||
return True
|
||||
return False
|
||||
|
||||
def split_time(filename):
|
||||
"""Splits the output of the time function into seperate parts.
|
||||
We will write time to file, because many programs output to
|
||||
stderr which makes it difficult to get only the exact results we need."""
|
||||
timefile = open(filename, 'r')
|
||||
realtime = float(timefile.readline().replace('\n', '').split()[1])
|
||||
usertime = float(timefile.readline().replace('\n', '').split()[1])
|
||||
systime = float(timefile.readline().replace('\n', '').split()[1])
|
||||
timefile.close()
|
||||
|
||||
return (realtime, usertime, systime)
|
||||
|
||||
|
||||
def write_log(time_file, logname, config):
|
||||
"""Writes to a logfile"""
|
||||
log_write = open(config.testlogs + '/' + logname, 'a') # Open logfile
|
||||
date_run = time.strftime("%d.%m.%Y %H:%M:%S") # Get the time of the test
|
||||
realtime, usertime, systime = split_time(time_file) # Get the times in a nice form
|
||||
|
||||
# Append everything to a log file.
|
||||
writestr = date_run + " " + config.revision + " Testname: " + logname +\
|
||||
" RealTime: " + str(realtime) + " UserTime: " + str(usertime) +\
|
||||
" SystemTime: " + str(systime) + " Branch: " + config.branch +'\n'
|
||||
log_write.write(writestr)
|
||||
log_write.close()
|
||||
|
||||
|
||||
def execute_tests(testcase, cur_directory, config):
|
||||
"""Executes timed tests based on the config file"""
|
||||
#Figure out the order of which tests must be executed.
|
||||
#Change to the current test directory
|
||||
os.chdir(config.tests + '/' + cur_directory)
|
||||
#Clear caches
|
||||
subprocess.call(['sync'], shell=True)
|
||||
subprocess.call([config.drop_caches], shell=True)
|
||||
#Perform vanilla test and if a cached test exists - as well
|
||||
print(testcase.name)
|
||||
if 'vanilla' in testcase.permutations:
|
||||
print(testcase.command)
|
||||
subprocess.Popen(['time -p -o /tmp/time_moses_tests ' + testcase.command], stdout=None,\
|
||||
stderr=subprocess.PIPE, shell=True).communicate()
|
||||
write_log('/tmp/time_moses_tests', testcase.name + '_vanilla', config)
|
||||
if 'cached' in testcase.permutations:
|
||||
subprocess.Popen(['time -p -o /tmp/time_moses_tests ' + testcase.command], stdout=None,\
|
||||
stderr=None, shell=True).communicate()
|
||||
write_log('/tmp/time_moses_tests', testcase.name + '_vanilla_cached', config)
|
||||
|
||||
#Now perform LD_PRELOAD tests
|
||||
if 'ldpre' in testcase.permutations:
|
||||
for opt in testcase.ldopts:
|
||||
#Clear caches
|
||||
subprocess.call(['sync'], shell=True)
|
||||
subprocess.call([config.drop_caches], shell=True)
|
||||
|
||||
#test
|
||||
subprocess.Popen(['LD_PRELOAD ' + opt + ' time -p -o /tmp/time_moses_tests ' + testcase.command], stdout=None,\
|
||||
stderr=None, shell=True).communicate()
|
||||
write_log('/tmp/time_moses_tests', testcase.name + '_ldpre_' + opt, config)
|
||||
if 'cached' in testcase.permutations:
|
||||
subprocess.Popen(['LD_PRELOAD ' + opt + ' time -p -o /tmp/time_moses_tests ' + testcase.command], stdout=None,\
|
||||
stderr=None, shell=True).communicate()
|
||||
write_log('/tmp/time_moses_tests', testcase.name + '_ldpre_' +opt +'_cached', config)
|
||||
|
||||
# Go through all the test directories and executes tests
|
||||
if __name__ == '__main__':
|
||||
CONFIG = get_config()
|
||||
ALL_DIR = os.listdir(CONFIG.tests)
|
||||
|
||||
#We should first check if any of the tests is run for the first time.
|
||||
#If some of them are run for the first time we should first get their
|
||||
#time with the base version (usually the previous release)
|
||||
FIRSTTIME = []
|
||||
TESTLOGS = []
|
||||
#Strip filenames of test underscores
|
||||
for listline in os.listdir(CONFIG.testlogs):
|
||||
listline = listline.replace('_vanilla', '')
|
||||
listline = listline.replace('_cached', '')
|
||||
listline = listline.replace('_ldpre', '')
|
||||
TESTLOGS.append(listline)
|
||||
for directory in ALL_DIR:
|
||||
if directory not in TESTLOGS:
|
||||
FIRSTTIME.append(directory)
|
||||
|
||||
#Sometimes even though we have the log files, we will need to rerun them
|
||||
#Against a base version, because we require a different baseversion (for
|
||||
#example when a new version of Moses is released.) Therefore we should
|
||||
#Check if the version of Moses that we have as a base version is in all
|
||||
#of the log files.
|
||||
|
||||
for logfile in os.listdir(CONFIG.testlogs):
|
||||
logfile_name = CONFIG.testlogs + '/' + logfile
|
||||
if not check_for_basever(logfile_name, CONFIG.basebranch):
|
||||
logfile = logfile.replace('_vanilla', '')
|
||||
logfile = logfile.replace('_cached', '')
|
||||
logfile = logfile.replace('_ldpre', '')
|
||||
FIRSTTIME.append(logfile)
|
||||
FIRSTTIME = list(set(FIRSTTIME)) #Deduplicate
|
||||
|
||||
if FIRSTTIME != []:
|
||||
#Create a new configuration for base version tests:
|
||||
BASECONFIG = Configuration(CONFIG.repo, CONFIG.drop_caches,\
|
||||
CONFIG.tests, CONFIG.testlogs, CONFIG.basebranch,\
|
||||
CONFIG.baserev)
|
||||
BASECONFIG.additional_args(None, CONFIG.baserev, CONFIG.basebranch)
|
||||
#Set up the repository and get its revision:
|
||||
REVISION = repoinit(BASECONFIG)
|
||||
BASECONFIG.set_revision(REVISION)
|
||||
#Build
|
||||
os.chdir(BASECONFIG.repo)
|
||||
subprocess.call(['./previous.sh'], shell=True)
|
||||
|
||||
#Perform tests
|
||||
for directory in FIRSTTIME:
|
||||
cur_testcase = parse_configfile(BASECONFIG.tests + '/' + directory +\
|
||||
'/config', directory, BASECONFIG.repo)
|
||||
execute_tests(cur_testcase, directory, BASECONFIG)
|
||||
|
||||
#Reset back the repository to the normal configuration
|
||||
repoinit(CONFIG)
|
||||
|
||||
#Builds moses
|
||||
os.chdir(CONFIG.repo)
|
||||
subprocess.call(['./previous.sh'], shell=True)
|
||||
|
||||
if CONFIG.singletest:
|
||||
TESTCASE = parse_configfile(CONFIG.tests + '/' +\
|
||||
CONFIG.singletest + '/config', CONFIG.singletest, CONFIG.repo)
|
||||
execute_tests(TESTCASE, CONFIG.singletest, CONFIG)
|
||||
else:
|
||||
for directory in ALL_DIR:
|
||||
cur_testcase = parse_configfile(CONFIG.tests + '/' + directory +\
|
||||
'/config', directory, CONFIG.repo)
|
||||
execute_tests(cur_testcase, directory, CONFIG)
|
22
contrib/moses-speedtest/sys_drop_caches.py
Normal file
22
contrib/moses-speedtest/sys_drop_caches.py
Normal file
@ -0,0 +1,22 @@
|
||||
#!/usr/bin/spython
|
||||
from sys import argv, stderr, exit
|
||||
from os import linesep as ls
|
||||
procfile = "/proc/sys/vm/drop_caches"
|
||||
options = ["1","2","3"]
|
||||
flush_type = None
|
||||
try:
|
||||
flush_type = argv[1][0:1]
|
||||
if not flush_type in options:
|
||||
raise IndexError, "not in options"
|
||||
with open(procfile, "w") as f:
|
||||
f.write("%s%s" % (flush_type,ls))
|
||||
exit(0)
|
||||
except IndexError, e:
|
||||
stderr.write("Argument %s required.%s" % (options, ls))
|
||||
except IOError, e:
|
||||
stderr.write("Error writing to file.%s" % ls)
|
||||
except StandardError, e:
|
||||
stderr.write("Unknown Error.%s" % ls)
|
||||
|
||||
exit(1)
|
||||
|
3
contrib/moses-speedtest/test_config
Normal file
3
contrib/moses-speedtest/test_config
Normal file
@ -0,0 +1,3 @@
|
||||
Command: moses -f ... -i fff #Looks for the command in the /bin directory of the repo specified in the testsuite_config
|
||||
LDPRE: ldpreloads #Comma separated LD_LIBRARY_PATH:/,
|
||||
Variants: vanilla, cached, ldpre #Can't have cached without ldpre or vanilla
|
54
contrib/moses-speedtest/testsuite_common.py
Normal file
54
contrib/moses-speedtest/testsuite_common.py
Normal file
@ -0,0 +1,54 @@
|
||||
"""Common functions of the testsuitce"""
|
||||
import os
|
||||
#Clour constants
|
||||
class bcolors:
|
||||
PURPLE = '\033[95m'
|
||||
BLUE = '\033[94m'
|
||||
GREEN = '\033[92m'
|
||||
YELLOW = '\033[93m'
|
||||
RED = '\033[91m'
|
||||
ENDC = '\033[0m'
|
||||
|
||||
class LogLine:
|
||||
"""A class to contain logfile line"""
|
||||
def __init__(self, date, time, revision, testname, real, user, system, branch):
|
||||
self.date = date
|
||||
self.time = time
|
||||
self.revision = revision
|
||||
self.testname = testname
|
||||
self.real = real
|
||||
self.system = system
|
||||
self.user = user
|
||||
self.branch = branch
|
||||
|
||||
class Result:
|
||||
"""A class to contain results of benchmarking"""
|
||||
def __init__(self, testname, previous, current, revision, branch, prevrev, prevbranch):
|
||||
self.testname = testname
|
||||
self.previous = previous
|
||||
self.current = current
|
||||
self.change = previous - current
|
||||
self.revision = revision
|
||||
self.branch = branch
|
||||
self.prevbranch = prevbranch
|
||||
self.prevrev = prevrev
|
||||
#Produce a percentage with fewer digits
|
||||
self.percentage = float(format(1 - current/previous, '.4f'))
|
||||
|
||||
def processLogLine(logline):
|
||||
"""Parses the log line into a nice datastructure"""
|
||||
logline = logline.split()
|
||||
log = LogLine(logline[0], logline[1], logline[2], logline[4],\
|
||||
float(logline[6]), float(logline[8]), float(logline[10]), logline[12])
|
||||
return log
|
||||
|
||||
def getLastTwoLines(filename, logdir):
|
||||
"""Just a call to tail to get the diff between the last two runs"""
|
||||
try:
|
||||
line1, line2 = os.popen("tail -n2 " + logdir + '/' + filename)
|
||||
except ValueError: #Check for new tests
|
||||
tempfile = open(logdir + '/' + filename)
|
||||
line1 = tempfile.readline()
|
||||
tempfile.close()
|
||||
return (line1, '\n')
|
||||
return (line1, line2)
|
5
contrib/moses-speedtest/testsuite_config
Normal file
5
contrib/moses-speedtest/testsuite_config
Normal file
@ -0,0 +1,5 @@
|
||||
MOSES_REPO_PATH: /home/moses-speedtest/moses-standard/mosesdecoder
|
||||
DROP_CACHES_COMM: sys_drop_caches 3
|
||||
TEST_DIR: /home/moses-speedtest/phrase_tables/tests
|
||||
TEST_LOG_DIR: /home/moses-speedtest/phrase_tables/testlogs
|
||||
BASEBRANCH: RELEASE-2.1.1
|
132
contrib/other-builds/consolidate/.cproject
Normal file
132
contrib/other-builds/consolidate/.cproject
Normal file
@ -0,0 +1,132 @@
|
||||
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
|
||||
<?fileVersion 4.0.0?><cproject storage_type_id="org.eclipse.cdt.core.XmlProjectDescriptionStorage">
|
||||
<storageModule moduleId="org.eclipse.cdt.core.settings">
|
||||
<cconfiguration id="cdt.managedbuild.config.gnu.cross.exe.debug.1847651686">
|
||||
<storageModule buildSystemId="org.eclipse.cdt.managedbuilder.core.configurationDataProvider" id="cdt.managedbuild.config.gnu.cross.exe.debug.1847651686" moduleId="org.eclipse.cdt.core.settings" name="Debug">
|
||||
<externalSettings/>
|
||||
<extensions>
|
||||
<extension id="org.eclipse.cdt.core.GmakeErrorParser" point="org.eclipse.cdt.core.ErrorParser"/>
|
||||
<extension id="org.eclipse.cdt.core.CWDLocator" point="org.eclipse.cdt.core.ErrorParser"/>
|
||||
<extension id="org.eclipse.cdt.core.GCCErrorParser" point="org.eclipse.cdt.core.ErrorParser"/>
|
||||
<extension id="org.eclipse.cdt.core.GASErrorParser" point="org.eclipse.cdt.core.ErrorParser"/>
|
||||
<extension id="org.eclipse.cdt.core.GLDErrorParser" point="org.eclipse.cdt.core.ErrorParser"/>
|
||||
<extension id="org.eclipse.cdt.core.ELF" point="org.eclipse.cdt.core.BinaryParser"/>
|
||||
</extensions>
|
||||
</storageModule>
|
||||
<storageModule moduleId="cdtBuildSystem" version="4.0.0">
|
||||
<configuration artifactName="${ProjName}" buildArtefactType="org.eclipse.cdt.build.core.buildArtefactType.exe" buildProperties="org.eclipse.cdt.build.core.buildType=org.eclipse.cdt.build.core.buildType.debug,org.eclipse.cdt.build.core.buildArtefactType=org.eclipse.cdt.build.core.buildArtefactType.exe" cleanCommand="rm -rf" description="" id="cdt.managedbuild.config.gnu.cross.exe.debug.1847651686" name="Debug" parent="cdt.managedbuild.config.gnu.cross.exe.debug">
|
||||
<folderInfo id="cdt.managedbuild.config.gnu.cross.exe.debug.1847651686." name="/" resourcePath="">
|
||||
<toolChain id="cdt.managedbuild.toolchain.gnu.cross.exe.debug.1312813804" name="Cross GCC" superClass="cdt.managedbuild.toolchain.gnu.cross.exe.debug">
|
||||
<targetPlatform archList="all" binaryParser="org.eclipse.cdt.core.ELF" id="cdt.managedbuild.targetPlatform.gnu.cross.1457158442" isAbstract="false" osList="all" superClass="cdt.managedbuild.targetPlatform.gnu.cross"/>
|
||||
<builder buildPath="${workspace_loc:/consolidate}/Debug" id="cdt.managedbuild.builder.gnu.cross.401817170" keepEnvironmentInBuildfile="false" managedBuildOn="true" name="Gnu Make Builder" superClass="cdt.managedbuild.builder.gnu.cross"/>
|
||||
<tool id="cdt.managedbuild.tool.gnu.cross.c.compiler.584773180" name="Cross GCC Compiler" superClass="cdt.managedbuild.tool.gnu.cross.c.compiler">
|
||||
<option defaultValue="gnu.c.optimization.level.none" id="gnu.c.compiler.option.optimization.level.548826159" name="Optimization Level" superClass="gnu.c.compiler.option.optimization.level" valueType="enumerated"/>
|
||||
<option id="gnu.c.compiler.option.debugging.level.69309976" name="Debug Level" superClass="gnu.c.compiler.option.debugging.level" value="gnu.c.debugging.level.max" valueType="enumerated"/>
|
||||
<inputType id="cdt.managedbuild.tool.gnu.c.compiler.input.1869389417" superClass="cdt.managedbuild.tool.gnu.c.compiler.input"/>
|
||||
</tool>
|
||||
<tool id="cdt.managedbuild.tool.gnu.cross.cpp.compiler.1684035985" name="Cross G++ Compiler" superClass="cdt.managedbuild.tool.gnu.cross.cpp.compiler">
|
||||
<option id="gnu.cpp.compiler.option.optimization.level.1978964587" name="Optimization Level" superClass="gnu.cpp.compiler.option.optimization.level" value="gnu.cpp.compiler.optimization.level.none" valueType="enumerated"/>
|
||||
<option id="gnu.cpp.compiler.option.debugging.level.1174628687" name="Debug Level" superClass="gnu.cpp.compiler.option.debugging.level" value="gnu.cpp.compiler.debugging.level.max" valueType="enumerated"/>
|
||||
<option id="gnu.cpp.compiler.option.include.paths.1899244069" name="Include paths (-I)" superClass="gnu.cpp.compiler.option.include.paths" valueType="includePath">
|
||||
<listOptionValue builtIn="false" value=""${workspace_loc}/../../boost/include""/>
|
||||
</option>
|
||||
<inputType id="cdt.managedbuild.tool.gnu.cpp.compiler.input.1369007077" superClass="cdt.managedbuild.tool.gnu.cpp.compiler.input"/>
|
||||
</tool>
|
||||
<tool id="cdt.managedbuild.tool.gnu.cross.c.linker.988122551" name="Cross GCC Linker" superClass="cdt.managedbuild.tool.gnu.cross.c.linker"/>
|
||||
<tool id="cdt.managedbuild.tool.gnu.cross.cpp.linker.580092188" name="Cross G++ Linker" superClass="cdt.managedbuild.tool.gnu.cross.cpp.linker">
|
||||
<option id="gnu.cpp.link.option.libs.1224797947" name="Libraries (-l)" superClass="gnu.cpp.link.option.libs" valueType="libs">
|
||||
<listOptionValue builtIn="false" value="z"/>
|
||||
<listOptionValue builtIn="false" value="boost_iostreams-mt"/>
|
||||
</option>
|
||||
<option id="gnu.cpp.link.option.paths.845281969" superClass="gnu.cpp.link.option.paths" valueType="libPaths">
|
||||
<listOptionValue builtIn="false" value=""${workspace_loc:}/../../boost/lib64""/>
|
||||
</option>
|
||||
<inputType id="cdt.managedbuild.tool.gnu.cpp.linker.input.1562981657" superClass="cdt.managedbuild.tool.gnu.cpp.linker.input">
|
||||
<additionalInput kind="additionalinputdependency" paths="$(USER_OBJS)"/>
|
||||
<additionalInput kind="additionalinput" paths="$(LIBS)"/>
|
||||
</inputType>
|
||||
</tool>
|
||||
<tool id="cdt.managedbuild.tool.gnu.cross.archiver.1813579853" name="Cross GCC Archiver" superClass="cdt.managedbuild.tool.gnu.cross.archiver"/>
|
||||
<tool id="cdt.managedbuild.tool.gnu.cross.assembler.660034723" name="Cross GCC Assembler" superClass="cdt.managedbuild.tool.gnu.cross.assembler">
|
||||
<inputType id="cdt.managedbuild.tool.gnu.assembler.input.2016181080" superClass="cdt.managedbuild.tool.gnu.assembler.input"/>
|
||||
</tool>
|
||||
</toolChain>
|
||||
</folderInfo>
|
||||
</configuration>
|
||||
</storageModule>
|
||||
<storageModule moduleId="org.eclipse.cdt.core.externalSettings"/>
|
||||
</cconfiguration>
|
||||
<cconfiguration id="cdt.managedbuild.config.gnu.cross.exe.release.1197533473">
|
||||
<storageModule buildSystemId="org.eclipse.cdt.managedbuilder.core.configurationDataProvider" id="cdt.managedbuild.config.gnu.cross.exe.release.1197533473" moduleId="org.eclipse.cdt.core.settings" name="Release">
|
||||
<externalSettings/>
|
||||
<extensions>
|
||||
<extension id="org.eclipse.cdt.core.GmakeErrorParser" point="org.eclipse.cdt.core.ErrorParser"/>
|
||||
<extension id="org.eclipse.cdt.core.CWDLocator" point="org.eclipse.cdt.core.ErrorParser"/>
|
||||
<extension id="org.eclipse.cdt.core.GCCErrorParser" point="org.eclipse.cdt.core.ErrorParser"/>
|
||||
<extension id="org.eclipse.cdt.core.GASErrorParser" point="org.eclipse.cdt.core.ErrorParser"/>
|
||||
<extension id="org.eclipse.cdt.core.GLDErrorParser" point="org.eclipse.cdt.core.ErrorParser"/>
|
||||
<extension id="org.eclipse.cdt.core.ELF" point="org.eclipse.cdt.core.BinaryParser"/>
|
||||
</extensions>
|
||||
</storageModule>
|
||||
<storageModule moduleId="cdtBuildSystem" version="4.0.0">
|
||||
<configuration artifactName="${ProjName}" buildArtefactType="org.eclipse.cdt.build.core.buildArtefactType.exe" buildProperties="org.eclipse.cdt.build.core.buildType=org.eclipse.cdt.build.core.buildType.release,org.eclipse.cdt.build.core.buildArtefactType=org.eclipse.cdt.build.core.buildArtefactType.exe" cleanCommand="rm -rf" description="" id="cdt.managedbuild.config.gnu.cross.exe.release.1197533473" name="Release" parent="cdt.managedbuild.config.gnu.cross.exe.release">
|
||||
<folderInfo id="cdt.managedbuild.config.gnu.cross.exe.release.1197533473." name="/" resourcePath="">
|
||||
<toolChain id="cdt.managedbuild.toolchain.gnu.cross.exe.release.1193312581" name="Cross GCC" superClass="cdt.managedbuild.toolchain.gnu.cross.exe.release">
|
||||
<targetPlatform archList="all" binaryParser="org.eclipse.cdt.core.ELF" id="cdt.managedbuild.targetPlatform.gnu.cross.1614674218" isAbstract="false" osList="all" superClass="cdt.managedbuild.targetPlatform.gnu.cross"/>
|
||||
<builder buildPath="${workspace_loc:/consolidate}/Release" id="cdt.managedbuild.builder.gnu.cross.1921548268" keepEnvironmentInBuildfile="false" managedBuildOn="true" name="Gnu Make Builder" superClass="cdt.managedbuild.builder.gnu.cross"/>
|
||||
<tool id="cdt.managedbuild.tool.gnu.cross.c.compiler.1402792534" name="Cross GCC Compiler" superClass="cdt.managedbuild.tool.gnu.cross.c.compiler">
|
||||
<option defaultValue="gnu.c.optimization.level.most" id="gnu.c.compiler.option.optimization.level.172258714" name="Optimization Level" superClass="gnu.c.compiler.option.optimization.level" valueType="enumerated"/>
|
||||
<option id="gnu.c.compiler.option.debugging.level.949623548" name="Debug Level" superClass="gnu.c.compiler.option.debugging.level" value="gnu.c.debugging.level.none" valueType="enumerated"/>
|
||||
<inputType id="cdt.managedbuild.tool.gnu.c.compiler.input.1960225725" superClass="cdt.managedbuild.tool.gnu.c.compiler.input"/>
|
||||
</tool>
|
||||
<tool id="cdt.managedbuild.tool.gnu.cross.cpp.compiler.1697856596" name="Cross G++ Compiler" superClass="cdt.managedbuild.tool.gnu.cross.cpp.compiler">
|
||||
<option id="gnu.cpp.compiler.option.optimization.level.1575999400" name="Optimization Level" superClass="gnu.cpp.compiler.option.optimization.level" value="gnu.cpp.compiler.optimization.level.most" valueType="enumerated"/>
|
||||
<option id="gnu.cpp.compiler.option.debugging.level.732263649" name="Debug Level" superClass="gnu.cpp.compiler.option.debugging.level" value="gnu.cpp.compiler.debugging.level.none" valueType="enumerated"/>
|
||||
<inputType id="cdt.managedbuild.tool.gnu.cpp.compiler.input.1685852561" superClass="cdt.managedbuild.tool.gnu.cpp.compiler.input"/>
|
||||
</tool>
|
||||
<tool id="cdt.managedbuild.tool.gnu.cross.c.linker.1332869586" name="Cross GCC Linker" superClass="cdt.managedbuild.tool.gnu.cross.c.linker"/>
|
||||
<tool id="cdt.managedbuild.tool.gnu.cross.cpp.linker.484647585" name="Cross G++ Linker" superClass="cdt.managedbuild.tool.gnu.cross.cpp.linker">
|
||||
<inputType id="cdt.managedbuild.tool.gnu.cpp.linker.input.2140954002" superClass="cdt.managedbuild.tool.gnu.cpp.linker.input">
|
||||
<additionalInput kind="additionalinputdependency" paths="$(USER_OBJS)"/>
|
||||
<additionalInput kind="additionalinput" paths="$(LIBS)"/>
|
||||
</inputType>
|
||||
</tool>
|
||||
<tool id="cdt.managedbuild.tool.gnu.cross.archiver.620666274" name="Cross GCC Archiver" superClass="cdt.managedbuild.tool.gnu.cross.archiver"/>
|
||||
<tool id="cdt.managedbuild.tool.gnu.cross.assembler.1478840357" name="Cross GCC Assembler" superClass="cdt.managedbuild.tool.gnu.cross.assembler">
|
||||
<inputType id="cdt.managedbuild.tool.gnu.assembler.input.412043972" superClass="cdt.managedbuild.tool.gnu.assembler.input"/>
|
||||
</tool>
|
||||
</toolChain>
|
||||
</folderInfo>
|
||||
</configuration>
|
||||
</storageModule>
|
||||
<storageModule moduleId="org.eclipse.cdt.core.externalSettings"/>
|
||||
</cconfiguration>
|
||||
</storageModule>
|
||||
<storageModule moduleId="cdtBuildSystem" version="4.0.0">
|
||||
<project id="consolidate.cdt.managedbuild.target.gnu.cross.exe.1166003694" name="Executable" projectType="cdt.managedbuild.target.gnu.cross.exe"/>
|
||||
</storageModule>
|
||||
<storageModule moduleId="scannerConfiguration">
|
||||
<autodiscovery enabled="true" problemReportingEnabled="true" selectedProfileId=""/>
|
||||
<scannerConfigBuildInfo instanceId="cdt.managedbuild.config.gnu.cross.exe.debug.1847651686;cdt.managedbuild.config.gnu.cross.exe.debug.1847651686.;cdt.managedbuild.tool.gnu.cross.c.compiler.584773180;cdt.managedbuild.tool.gnu.c.compiler.input.1869389417">
|
||||
<autodiscovery enabled="true" problemReportingEnabled="true" selectedProfileId="org.eclipse.cdt.managedbuilder.core.GCCManagedMakePerProjectProfileC"/>
|
||||
</scannerConfigBuildInfo>
|
||||
<scannerConfigBuildInfo instanceId="cdt.managedbuild.config.gnu.cross.exe.release.1197533473;cdt.managedbuild.config.gnu.cross.exe.release.1197533473.;cdt.managedbuild.tool.gnu.cross.cpp.compiler.1697856596;cdt.managedbuild.tool.gnu.cpp.compiler.input.1685852561">
|
||||
<autodiscovery enabled="true" problemReportingEnabled="true" selectedProfileId="org.eclipse.cdt.managedbuilder.core.GCCManagedMakePerProjectProfileCPP"/>
|
||||
</scannerConfigBuildInfo>
|
||||
<scannerConfigBuildInfo instanceId="cdt.managedbuild.config.gnu.cross.exe.debug.1847651686;cdt.managedbuild.config.gnu.cross.exe.debug.1847651686.;cdt.managedbuild.tool.gnu.cross.cpp.compiler.1684035985;cdt.managedbuild.tool.gnu.cpp.compiler.input.1369007077">
|
||||
<autodiscovery enabled="true" problemReportingEnabled="true" selectedProfileId="org.eclipse.cdt.managedbuilder.core.GCCManagedMakePerProjectProfileCPP"/>
|
||||
</scannerConfigBuildInfo>
|
||||
<scannerConfigBuildInfo instanceId="cdt.managedbuild.config.gnu.cross.exe.release.1197533473;cdt.managedbuild.config.gnu.cross.exe.release.1197533473.;cdt.managedbuild.tool.gnu.cross.c.compiler.1402792534;cdt.managedbuild.tool.gnu.c.compiler.input.1960225725">
|
||||
<autodiscovery enabled="true" problemReportingEnabled="true" selectedProfileId="org.eclipse.cdt.managedbuilder.core.GCCManagedMakePerProjectProfileC"/>
|
||||
</scannerConfigBuildInfo>
|
||||
</storageModule>
|
||||
<storageModule moduleId="org.eclipse.cdt.core.LanguageSettingsProviders"/>
|
||||
<storageModule moduleId="refreshScope" versionNumber="2">
|
||||
<configuration configurationName="Release">
|
||||
<resource resourceType="PROJECT" workspacePath="/consolidate"/>
|
||||
</configuration>
|
||||
<configuration configurationName="Debug">
|
||||
<resource resourceType="PROJECT" workspacePath="/consolidate"/>
|
||||
</configuration>
|
||||
</storageModule>
|
||||
</cproject>
|
64
contrib/other-builds/consolidate/.project
Normal file
64
contrib/other-builds/consolidate/.project
Normal file
@ -0,0 +1,64 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<projectDescription>
|
||||
<name>consolidate</name>
|
||||
<comment></comment>
|
||||
<projects>
|
||||
</projects>
|
||||
<buildSpec>
|
||||
<buildCommand>
|
||||
<name>org.eclipse.cdt.managedbuilder.core.genmakebuilder</name>
|
||||
<triggers>clean,full,incremental,</triggers>
|
||||
<arguments>
|
||||
</arguments>
|
||||
</buildCommand>
|
||||
<buildCommand>
|
||||
<name>org.eclipse.cdt.managedbuilder.core.ScannerConfigBuilder</name>
|
||||
<triggers>full,incremental,</triggers>
|
||||
<arguments>
|
||||
</arguments>
|
||||
</buildCommand>
|
||||
</buildSpec>
|
||||
<natures>
|
||||
<nature>org.eclipse.cdt.core.cnature</nature>
|
||||
<nature>org.eclipse.cdt.core.ccnature</nature>
|
||||
<nature>org.eclipse.cdt.managedbuilder.core.managedBuildNature</nature>
|
||||
<nature>org.eclipse.cdt.managedbuilder.core.ScannerConfigNature</nature>
|
||||
</natures>
|
||||
<linkedResources>
|
||||
<link>
|
||||
<name>InputFileStream.cpp</name>
|
||||
<type>1</type>
|
||||
<locationURI>PARENT-3-PROJECT_LOC/phrase-extract/InputFileStream.cpp</locationURI>
|
||||
</link>
|
||||
<link>
|
||||
<name>InputFileStream.h</name>
|
||||
<type>1</type>
|
||||
<locationURI>PARENT-3-PROJECT_LOC/phrase-extract/InputFileStream.h</locationURI>
|
||||
</link>
|
||||
<link>
|
||||
<name>OutputFileStream.cpp</name>
|
||||
<type>1</type>
|
||||
<locationURI>PARENT-3-PROJECT_LOC/phrase-extract/OutputFileStream.cpp</locationURI>
|
||||
</link>
|
||||
<link>
|
||||
<name>OutputFileStream.h</name>
|
||||
<type>1</type>
|
||||
<locationURI>PARENT-3-PROJECT_LOC/phrase-extract/OutputFileStream.h</locationURI>
|
||||
</link>
|
||||
<link>
|
||||
<name>consolidate-main.cpp</name>
|
||||
<type>1</type>
|
||||
<locationURI>PARENT-3-PROJECT_LOC/phrase-extract/consolidate-main.cpp</locationURI>
|
||||
</link>
|
||||
<link>
|
||||
<name>tables-core.cpp</name>
|
||||
<type>1</type>
|
||||
<locationURI>PARENT-3-PROJECT_LOC/phrase-extract/tables-core.cpp</locationURI>
|
||||
</link>
|
||||
<link>
|
||||
<name>tables-core.h</name>
|
||||
<type>1</type>
|
||||
<locationURI>PARENT-3-PROJECT_LOC/phrase-extract/tables-core.h</locationURI>
|
||||
</link>
|
||||
</linkedResources>
|
||||
</projectDescription>
|
@ -42,9 +42,11 @@
|
||||
</option>
|
||||
<option id="gnu.cpp.link.option.libs.585257079" name="Libraries (-l)" superClass="gnu.cpp.link.option.libs" valueType="libs">
|
||||
<listOptionValue builtIn="false" value="mert_lib"/>
|
||||
<listOptionValue builtIn="false" value="boost_system-mt"/>
|
||||
<listOptionValue builtIn="false" value="util"/>
|
||||
<listOptionValue builtIn="false" value="boost_system-mt"/>
|
||||
<listOptionValue builtIn="false" value="boost_thread-mt"/>
|
||||
<listOptionValue builtIn="false" value="z"/>
|
||||
<listOptionValue builtIn="false" value="pthread"/>
|
||||
</option>
|
||||
<inputType id="cdt.managedbuild.tool.gnu.cpp.linker.input.656319745" superClass="cdt.managedbuild.tool.gnu.cpp.linker.input">
|
||||
<additionalInput kind="additionalinputdependency" paths="$(USER_OBJS)"/>
|
||||
|
@ -4,6 +4,7 @@
|
||||
<comment></comment>
|
||||
<projects>
|
||||
<project>mert_lib</project>
|
||||
<project>util</project>
|
||||
</projects>
|
||||
<buildSpec>
|
||||
<buildCommand>
|
||||
|
@ -125,7 +125,7 @@ void ChartManager::ProcessSentence()
|
||||
*/
|
||||
void ChartManager::AddXmlChartOptions()
|
||||
{
|
||||
const StaticData &staticData = StaticData::Instance();
|
||||
// const StaticData &staticData = StaticData::Instance();
|
||||
|
||||
const std::vector <ChartTranslationOptions*> xmlChartOptionsList = m_source.GetXmlChartTranslationOptions();
|
||||
IFVERBOSE(2) {
|
||||
|
@ -142,7 +142,7 @@ namespace Moses
|
||||
{
|
||||
Clear();
|
||||
|
||||
const StaticData &staticData = StaticData::Instance();
|
||||
// const StaticData &staticData = StaticData::Instance();
|
||||
const InputFeature &inputFeature = InputFeature::Instance();
|
||||
size_t numInputScores = inputFeature.GetNumInputScores();
|
||||
size_t numRealWordCount = inputFeature.GetNumRealWordsInInput();
|
||||
|
@ -85,7 +85,7 @@ size_t InputPath::GetTotalRuleSize() const
|
||||
size_t ret = 0;
|
||||
std::map<const PhraseDictionary*, std::pair<const TargetPhraseCollection*, const void*> >::const_iterator iter;
|
||||
for (iter = m_targetPhrases.begin(); iter != m_targetPhrases.end(); ++iter) {
|
||||
const PhraseDictionary *pt = iter->first;
|
||||
// const PhraseDictionary *pt = iter->first;
|
||||
const TargetPhraseCollection *tpColl = iter->second.first;
|
||||
|
||||
if (tpColl) {
|
||||
|
@ -15,7 +15,7 @@ public:
|
||||
|
||||
virtual void ProcessValue() {};
|
||||
|
||||
const std::string &GetValueString() { return m_value; };
|
||||
const std::string &GetValueString() const { return m_value; };
|
||||
|
||||
protected:
|
||||
|
||||
|
@ -47,8 +47,8 @@ class WordsRange;
|
||||
class Phrase
|
||||
{
|
||||
friend std::ostream& operator<<(std::ostream&, const Phrase&);
|
||||
private:
|
||||
|
||||
// private:
|
||||
protected:
|
||||
std::vector<Word> m_words;
|
||||
|
||||
public:
|
||||
|
@ -494,7 +494,8 @@ bool StaticData::LoadData(Parameter *parameter)
|
||||
}
|
||||
m_xmlBrackets.first= brackets[0];
|
||||
m_xmlBrackets.second=brackets[1];
|
||||
cerr << "XML tags opening and closing brackets for XML input are: " << m_xmlBrackets.first << " and " << m_xmlBrackets.second << endl;
|
||||
VERBOSE(1,"XML tags opening and closing brackets for XML input are: "
|
||||
<< m_xmlBrackets.first << " and " << m_xmlBrackets.second << endl);
|
||||
}
|
||||
|
||||
if (m_parameter->GetParam("placeholder-factor").size() > 0) {
|
||||
@ -511,7 +512,7 @@ bool StaticData::LoadData(Parameter *parameter)
|
||||
const vector<string> &features = m_parameter->GetParam("feature");
|
||||
for (size_t i = 0; i < features.size(); ++i) {
|
||||
const string &line = Trim(features[i]);
|
||||
cerr << "line=" << line << endl;
|
||||
VERBOSE(1,"line=" << line << endl);
|
||||
if (line.empty())
|
||||
continue;
|
||||
|
||||
@ -535,7 +536,9 @@ bool StaticData::LoadData(Parameter *parameter)
|
||||
NoCache();
|
||||
OverrideFeatures();
|
||||
|
||||
LoadFeatureFunctions();
|
||||
if (!m_parameter->isParamSpecified("show-weights")) {
|
||||
LoadFeatureFunctions();
|
||||
}
|
||||
|
||||
if (!LoadDecodeGraphs()) return false;
|
||||
|
||||
@ -640,7 +643,8 @@ void StaticData::LoadNonTerminals()
|
||||
"Incorrect unknown LHS format: " << line);
|
||||
UnknownLHSEntry entry(tokens[0], Scan<float>(tokens[1]));
|
||||
m_unknownLHS.push_back(entry);
|
||||
const Factor *targetFactor = factorCollection.AddFactor(Output, 0, tokens[0], true);
|
||||
// const Factor *targetFactor =
|
||||
factorCollection.AddFactor(Output, 0, tokens[0], true);
|
||||
}
|
||||
|
||||
}
|
||||
@ -734,7 +738,7 @@ bool StaticData::LoadDecodeGraphs()
|
||||
DecodeGraph *decodeGraph;
|
||||
if (IsChart()) {
|
||||
size_t maxChartSpan = (decodeGraphInd < maxChartSpans.size()) ? maxChartSpans[decodeGraphInd] : DEFAULT_MAX_CHART_SPAN;
|
||||
cerr << "max-chart-span: " << maxChartSpans[decodeGraphInd] << endl;
|
||||
VERBOSE(1,"max-chart-span: " << maxChartSpans[decodeGraphInd] << endl);
|
||||
decodeGraph = new DecodeGraph(m_decodeGraphs.size(), maxChartSpan);
|
||||
} else {
|
||||
decodeGraph = new DecodeGraph(m_decodeGraphs.size());
|
||||
@ -866,7 +870,7 @@ void StaticData::SetExecPath(const std::string &path)
|
||||
if (pos != string::npos) {
|
||||
m_binPath = path.substr(0, pos);
|
||||
}
|
||||
cerr << m_binPath << endl;
|
||||
VERBOSE(1,m_binPath << endl);
|
||||
}
|
||||
|
||||
const string &StaticData::GetBinDirectory() const
|
||||
@ -920,7 +924,8 @@ void StaticData::LoadFeatureFunctions()
|
||||
FeatureFunction *ff = *iter;
|
||||
bool doLoad = true;
|
||||
|
||||
if (PhraseDictionary *ffCast = dynamic_cast<PhraseDictionary*>(ff)) {
|
||||
// if (PhraseDictionary *ffCast = dynamic_cast<PhraseDictionary*>(ff)) {
|
||||
if (dynamic_cast<PhraseDictionary*>(ff)) {
|
||||
doLoad = false;
|
||||
}
|
||||
|
||||
@ -964,7 +969,7 @@ bool StaticData::CheckWeights() const
|
||||
set<string>::iterator iter;
|
||||
for (iter = weightNames.begin(); iter != weightNames.end(); ) {
|
||||
string fname = (*iter).substr(0, (*iter).find("_"));
|
||||
cerr << fname << "\n";
|
||||
VERBOSE(1,fname << "\n");
|
||||
if (featureNames.find(fname) != featureNames.end()) {
|
||||
weightNames.erase(iter++);
|
||||
}
|
||||
@ -1039,7 +1044,7 @@ bool StaticData::LoadAlternateWeightSettings()
|
||||
vector<string> tokens = Tokenize(weightSpecification[i]);
|
||||
vector<string> args = Tokenize(tokens[0], "=");
|
||||
currentId = args[1];
|
||||
cerr << "alternate weight setting " << currentId << endl;
|
||||
VERBOSE(1,"alternate weight setting " << currentId << endl);
|
||||
UTIL_THROW_IF2(m_weightSetting.find(currentId) != m_weightSetting.end(),
|
||||
"Duplicate alternate weight id: " << currentId);
|
||||
m_weightSetting[ currentId ] = new ScoreComponentCollection;
|
||||
|
@ -44,6 +44,12 @@ public:
|
||||
typedef CollType::iterator iterator;
|
||||
typedef CollType::const_iterator const_iterator;
|
||||
|
||||
TargetPhrase const*
|
||||
operator[](size_t const i) const
|
||||
{
|
||||
return m_collection.at(i);
|
||||
}
|
||||
|
||||
iterator begin() {
|
||||
return m_collection.begin();
|
||||
}
|
||||
|
@ -17,12 +17,8 @@ License along with this library; if not, write to the Free Software
|
||||
Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
|
||||
***********************************************************************/
|
||||
#include "util/exception.hh"
|
||||
|
||||
#include "moses/TranslationModel/PhraseDictionaryMultiModelCounts.h"
|
||||
|
||||
#define LINE_MAX_LENGTH 100000
|
||||
#include "phrase-extract/SafeGetline.h" // for SAFE_GETLINE()
|
||||
|
||||
using namespace std;
|
||||
|
||||
template<typename T>
|
||||
@ -461,16 +457,14 @@ void PhraseDictionaryMultiModelCounts::LoadLexicalTable( string &fileName, lexic
|
||||
}
|
||||
istream *inFileP = &inFile;
|
||||
|
||||
char line[LINE_MAX_LENGTH];
|
||||
|
||||
int i=0;
|
||||
while(true) {
|
||||
string line;
|
||||
|
||||
while(getline(*inFileP, line)) {
|
||||
i++;
|
||||
if (i%100000 == 0) cerr << "." << flush;
|
||||
SAFE_GETLINE((*inFileP), line, LINE_MAX_LENGTH, '\n', __FILE__);
|
||||
if (inFileP->eof()) break;
|
||||
|
||||
vector<string> token = tokenize( line );
|
||||
vector<string> token = tokenize( line.c_str() );
|
||||
if (token.size() != 4) {
|
||||
cerr << "line " << i << " in " << fileName
|
||||
<< " has wrong number of tokens, skipping:\n"
|
||||
|
@ -9,6 +9,17 @@ $(TOP)/moses/TranslationModel/UG//mmsapt
|
||||
$(TOP)/util//kenutil
|
||||
;
|
||||
|
||||
exe lookup_mmsapt :
|
||||
lookup_mmsapt.cc
|
||||
$(TOP)/moses//moses
|
||||
$(TOP)/moses/TranslationModel/UG/generic//generic
|
||||
$(TOP)//boost_iostreams
|
||||
$(TOP)//boost_program_options
|
||||
$(TOP)/moses/TranslationModel/UG/mm//mm
|
||||
$(TOP)/moses/TranslationModel/UG//mmsapt
|
||||
$(TOP)/util//kenutil
|
||||
;
|
||||
|
||||
install $(PREFIX)/bin : try-align ;
|
||||
|
||||
fakelib mmsapt : [ glob *.cpp mmsapt*.cc ] ;
|
||||
|
76
moses/TranslationModel/UG/lookup_mmsapt.cc
Normal file
76
moses/TranslationModel/UG/lookup_mmsapt.cc
Normal file
@ -0,0 +1,76 @@
|
||||
#include "mmsapt.h"
|
||||
#include <boost/foreach.hpp>
|
||||
#include <boost/tokenizer.hpp>
|
||||
#include <boost/shared_ptr.hpp>
|
||||
#include <algorithm>
|
||||
#include <iostream>
|
||||
|
||||
using namespace Moses;
|
||||
using namespace bitext;
|
||||
using namespace std;
|
||||
using namespace boost;
|
||||
|
||||
vector<FactorType> fo(1,FactorType(0));
|
||||
|
||||
class SimplePhrase : public Moses::Phrase
|
||||
{
|
||||
vector<FactorType> const m_fo; // factor order
|
||||
public:
|
||||
SimplePhrase(): m_fo(1,FactorType(0)) {}
|
||||
|
||||
void init(string const& s)
|
||||
{
|
||||
istringstream buf(s); string w;
|
||||
while (buf >> w)
|
||||
{
|
||||
Word wrd;
|
||||
this->AddWord().CreateFromString(Input,m_fo,StringPiece(w),false,false);
|
||||
}
|
||||
}
|
||||
};
|
||||
|
||||
class TargetPhraseIndexSorter
|
||||
{
|
||||
TargetPhraseCollection const& my_tpc;
|
||||
CompareTargetPhrase cmp;
|
||||
public:
|
||||
TargetPhraseIndexSorter(TargetPhraseCollection const& tpc) : my_tpc(tpc) {}
|
||||
bool operator()(size_t a, size_t b) const
|
||||
{
|
||||
return cmp(*my_tpc[a], *my_tpc[b]);
|
||||
}
|
||||
};
|
||||
|
||||
int main(int argc, char* argv[])
|
||||
{
|
||||
Parameter params;
|
||||
if (!params.LoadParam(argc,argv) || !StaticData::LoadDataStatic(¶ms, argv[0]))
|
||||
exit(1);
|
||||
|
||||
Mmsapt* PT;
|
||||
BOOST_FOREACH(PhraseDictionary* pd, PhraseDictionary::GetColl())
|
||||
if ((PT = dynamic_cast<Mmsapt*>(pd))) break;
|
||||
|
||||
string line;
|
||||
while (getline(cin,line))
|
||||
{
|
||||
SimplePhrase p; p.init(line);
|
||||
cout << p << endl;
|
||||
TargetPhraseCollection const* trg = PT->GetTargetPhraseCollectionLEGACY(p);
|
||||
if (!trg) continue;
|
||||
vector<size_t> order(trg->GetSize());
|
||||
for (size_t i = 0; i < order.size(); ++i) order[i] = i;
|
||||
sort(order.begin(),order.end(),TargetPhraseIndexSorter(*trg));
|
||||
size_t k = 0;
|
||||
BOOST_FOREACH(size_t i, order)
|
||||
{
|
||||
Phrase const& phr = static_cast<Phrase const&>(*(*trg)[i]);
|
||||
cout << setw(3) << ++k << " " << phr << endl;
|
||||
}
|
||||
PT->Release(trg);
|
||||
}
|
||||
exit(0);
|
||||
}
|
||||
|
||||
|
||||
|
@ -131,7 +131,7 @@ interpret_args(int ac, char* av[])
|
||||
o.add_options()
|
||||
("help,h", "print this message")
|
||||
("source,s",po::value<string>(&swrd),"source word")
|
||||
("target,t",po::value<string>(&swrd),"target word")
|
||||
("target,t",po::value<string>(&twrd),"target word")
|
||||
;
|
||||
|
||||
h.add_options()
|
||||
|
@ -318,10 +318,10 @@ namespace Moses {
|
||||
assert(pp.sample1);
|
||||
assert(pp.joint);
|
||||
assert(pp.raw2);
|
||||
(*dest)[i] = log(pp.raw1);
|
||||
(*dest)[++i] = log(pp.sample1);
|
||||
(*dest)[++i] = log(pp.joint);
|
||||
(*dest)[++i] = log(pp.raw2);
|
||||
(*dest)[i] = -log(pp.raw1);
|
||||
(*dest)[++i] = -log(pp.sample1);
|
||||
(*dest)[++i] = +log(pp.joint);
|
||||
(*dest)[++i] = -log(pp.raw2);
|
||||
}
|
||||
};
|
||||
|
||||
@ -590,8 +590,9 @@ namespace Moses {
|
||||
static ThreadSafeCounter active;
|
||||
boost::mutex lock;
|
||||
friend class agenda;
|
||||
boost::taus88 rnd; // every job has its own pseudo random generator
|
||||
double rnddenom; // denominator for scaling random sampling
|
||||
boost::taus88 rnd; // every job has its own pseudo random generator
|
||||
double rnddenom; // denominator for scaling random sampling
|
||||
size_t min_diverse; // minimum number of distinct translations
|
||||
public:
|
||||
size_t workers; // how many workers are working on this job?
|
||||
sptr<TSA<Token> const> root; // root of the underlying suffix array
|
||||
@ -644,34 +645,47 @@ namespace Moses {
|
||||
step(uint64_t & sid, uint64_t & offset)
|
||||
{
|
||||
boost::lock_guard<boost::mutex> jguard(lock);
|
||||
if ((max_samples == 0) && (next < stop))
|
||||
bool ret = (max_samples == 0) && (next < stop);
|
||||
if (ret)
|
||||
{
|
||||
next = root->readSid(next,stop,sid);
|
||||
next = root->readOffset(next,stop,offset);
|
||||
boost::lock_guard<boost::mutex> sguard(stats->lock);
|
||||
if (stats->raw_cnt == ctr) ++stats->raw_cnt;
|
||||
stats->sample_cnt++;
|
||||
return true;
|
||||
}
|
||||
else
|
||||
{
|
||||
while (next < stop && stats->good < max_samples)
|
||||
while (next < stop && (stats->good < max_samples ||
|
||||
stats->trg.size() < min_diverse))
|
||||
{
|
||||
next = root->readSid(next,stop,sid);
|
||||
next = root->readOffset(next,stop,offset);
|
||||
{
|
||||
boost::lock_guard<boost::mutex> sguard(stats->lock);
|
||||
{ // brackets required for lock scoping; see sguard immediately below
|
||||
boost::lock_guard<boost::mutex> sguard(stats->lock);
|
||||
if (stats->raw_cnt == ctr) ++stats->raw_cnt;
|
||||
size_t rnum = (stats->raw_cnt - ctr++)*(rnd()/(rnd.max()+1.));
|
||||
size_t scalefac = (stats->raw_cnt - ctr++);
|
||||
size_t rnum = scalefac*(rnd()/(rnd.max()+1.));
|
||||
#if 0
|
||||
cerr << rnum << "/" << scalefac << " vs. "
|
||||
<< max_samples - stats->good << " ("
|
||||
<< max_samples << " - " << stats->good << ")"
|
||||
<< endl;
|
||||
#endif
|
||||
if (rnum < max_samples - stats->good)
|
||||
{
|
||||
stats->sample_cnt++;
|
||||
return true;
|
||||
ret = true;
|
||||
break;
|
||||
}
|
||||
}
|
||||
}
|
||||
return false;
|
||||
}
|
||||
|
||||
// boost::lock_guard<boost::mutex> sguard(stats->lock);
|
||||
// abuse of lock for clean output to cerr
|
||||
// cerr << stats->sample_cnt++;
|
||||
return ret;
|
||||
}
|
||||
|
||||
template<typename Token>
|
||||
@ -713,6 +727,13 @@ namespace Moses {
|
||||
worker::
|
||||
operator()()
|
||||
{
|
||||
// things to do:
|
||||
// - have each worker maintain their own pstats object and merge results at the end;
|
||||
// - ensure the minimum size of samples considered by a non-locked counter that is only
|
||||
// ever incremented -- who cares if we look at more samples than required, as long
|
||||
// as we look at at least the minimum required
|
||||
// This way, we can reduce the number of lock / unlock operations we need to do during
|
||||
// sampling.
|
||||
size_t s1=0, s2=0, e1=0, e2=0;
|
||||
uint64_t sid=0, offset=0; // of the source phrase
|
||||
while(sptr<job> j = ag.get_job())
|
||||
@ -812,6 +833,7 @@ namespace Moses {
|
||||
sptr<TSA<Token> > const& r, size_t maxsmpl, bool isfwd)
|
||||
: rnd(0)
|
||||
, rnddenom(rnd.max() + 1.)
|
||||
, min_diverse(10)
|
||||
, workers(0)
|
||||
, root(r)
|
||||
, next(m.lower_bound(-1))
|
||||
|
@ -122,16 +122,16 @@ namespace Moses
|
||||
if (m != param.end())
|
||||
withPbwd = m->second != "0";
|
||||
|
||||
m_default_sample_size = m != param.end() ? atoi(m->second.c_str()) : 1000;
|
||||
|
||||
m = param.find("workers");
|
||||
m_workers = m != param.end() ? atoi(m->second.c_str()) : 8;
|
||||
m_workers = min(m_workers,24UL);
|
||||
|
||||
m = param.find("limit");
|
||||
if (m != param.end()) m_tableLimit = atoi(m->second.c_str());
|
||||
|
||||
m = param.find("cache-size");
|
||||
m_history.reserve(m != param.end()
|
||||
? max(1000,atoi(m->second.c_str()))
|
||||
: 10000);
|
||||
m_history.reserve(m != param.end()?max(1000,atoi(m->second.c_str())):10000);
|
||||
// in plain language: cache size is at least 1000, and 10,000 by default
|
||||
|
||||
this->m_numScoreComponents = atoi(param["num-features"].c_str());
|
||||
|
||||
@ -196,8 +196,8 @@ namespace Moses
|
||||
// currently always active by default; may (should) change later
|
||||
num_feats = calc_lex.init(num_feats, bname + L1 + "-" + L2 + ".lex");
|
||||
|
||||
if (this->m_numScoreComponents%2) // a bit of a hack, for backwards compatibility
|
||||
num_feats = apply_pp.init(num_feats);
|
||||
// if (this->m_numScoreComponents%2) // a bit of a hack, for backwards compatibility
|
||||
// num_feats = apply_pp.init(num_feats);
|
||||
|
||||
if (num_feats < this->m_numScoreComponents)
|
||||
{
|
||||
@ -283,8 +283,8 @@ namespace Moses
|
||||
{
|
||||
PhrasePair pp;
|
||||
pp.init(pid1, stats, this->m_numScoreComponents);
|
||||
if (this->m_numScoreComponents%2)
|
||||
apply_pp(bt,pp);
|
||||
// if (this->m_numScoreComponents%2)
|
||||
// apply_pp(bt,pp);
|
||||
pstats::trg_map_t::const_iterator t;
|
||||
for (t = stats.trg.begin(); t != stats.trg.end(); ++t)
|
||||
{
|
||||
@ -318,8 +318,8 @@ namespace Moses
|
||||
pp.init(pid1b, *statsb, this->m_numScoreComponents);
|
||||
else return false; // throw "no stats for pooling available!";
|
||||
|
||||
if (this->m_numScoreComponents%2)
|
||||
apply_pp(bta,pp);
|
||||
// if (this->m_numScoreComponents%2)
|
||||
// apply_pp(bta,pp);
|
||||
pstats::trg_map_t::const_iterator b;
|
||||
pstats::trg_map_t::iterator a;
|
||||
if (statsb)
|
||||
@ -368,6 +368,13 @@ namespace Moses
|
||||
}
|
||||
else
|
||||
pp.update(a->first,a->second);
|
||||
#if 0
|
||||
// jstats const& j = a->second;
|
||||
cerr << bta.T1->pid2str(bta.V1.get(),pp.p1) << " ::: "
|
||||
<< bta.T2->pid2str(bta.V2.get(),pp.p2) << endl;
|
||||
cerr << pp.raw1 << " " << pp.sample1 << " " << pp.good1 << " "
|
||||
<< pp.joint << " " << pp.raw2 << endl;
|
||||
#endif
|
||||
|
||||
UTIL_THROW_IF2(pp.raw2 == 0,
|
||||
"OOPS"
|
||||
@ -376,12 +383,6 @@ namespace Moses
|
||||
<< pp.raw1 << " " << pp.sample1 << " "
|
||||
<< pp.good1 << " " << pp.joint << " "
|
||||
<< pp.raw2);
|
||||
#if 0
|
||||
jstats const& j = a->second;
|
||||
cerr << bta.T1->pid2str(bta.V1.get(),pp.p1) << " ::: "
|
||||
<< bta.T2->pid2str(bta.V2.get(),pp.p2) << endl;
|
||||
cerr << j.rcnt() << " " << j.cnt2() << " " << j.wcnt() << endl;
|
||||
#endif
|
||||
calc_lex(bta,pp);
|
||||
if (withPfwd) calc_pfwd_fix(bta,pp);
|
||||
if (withPbwd) calc_pbwd_fix(bta,pp);
|
||||
@ -415,8 +416,8 @@ namespace Moses
|
||||
if (statsb)
|
||||
{
|
||||
pool.init(pid1b,*statsb,0);
|
||||
if (this->m_numScoreComponents%2)
|
||||
apply_pp(btb,ppdyn);
|
||||
// if (this->m_numScoreComponents%2)
|
||||
// apply_pp(btb,ppdyn);
|
||||
for (b = statsb->trg.begin(); b != statsb->trg.end(); ++b)
|
||||
{
|
||||
ppdyn.update(b->first,b->second);
|
||||
@ -456,8 +457,8 @@ namespace Moses
|
||||
if (statsa)
|
||||
{
|
||||
pool.init(pid1a,*statsa,0);
|
||||
if (this->m_numScoreComponents%2)
|
||||
apply_pp(bta,ppfix);
|
||||
// if (this->m_numScoreComponents%2)
|
||||
// apply_pp(bta,ppfix);
|
||||
for (a = statsa->trg.begin(); a != statsa->trg.end(); ++a)
|
||||
{
|
||||
if (!a->second.valid()) continue; // done above
|
||||
@ -662,7 +663,7 @@ namespace Moses
|
||||
|| combine_pstats(src, mfix.getPid(),sfix.get(),btfix,
|
||||
mdyn.getPid(),sdyn.get(),*dyn,ret))
|
||||
{
|
||||
ret->NthElement(m_tableLimit);
|
||||
if (m_tableLimit) ret->Prune(true,m_tableLimit);
|
||||
#if 0
|
||||
sort(ret->begin(), ret->end(), CompareTargetPhrase());
|
||||
cout << "SOURCE PHRASE: " << src << endl;
|
||||
@ -683,6 +684,14 @@ namespace Moses
|
||||
return encache(ret);
|
||||
}
|
||||
|
||||
size_t
|
||||
Mmsapt::
|
||||
SetTableLimit(size_t limit)
|
||||
{
|
||||
std::swap(m_tableLimit,limit);
|
||||
return limit;
|
||||
}
|
||||
|
||||
void
|
||||
Mmsapt::
|
||||
CleanUpAfterSentenceProcessing(const InputType& source)
|
||||
|
@ -71,7 +71,7 @@ namespace Moses
|
||||
PScorePfwd<Token> calc_pfwd_fix, calc_pfwd_dyn;
|
||||
PScorePbwd<Token> calc_pbwd_fix, calc_pbwd_dyn;
|
||||
PScoreLex<Token> calc_lex; // this one I'd like to see as an external ff eventually
|
||||
PScorePP<Token> apply_pp; // apply phrase penalty
|
||||
// PScorePP<Token> apply_pp; // apply phrase penalty
|
||||
PScoreLogCounts<Token> add_logcounts_fix;
|
||||
PScoreLogCounts<Token> add_logcounts_dyn;
|
||||
void init(string const& line);
|
||||
@ -168,6 +168,9 @@ namespace Moses
|
||||
void
|
||||
Load();
|
||||
|
||||
// returns the prior table limit
|
||||
size_t SetTableLimit(size_t limit);
|
||||
|
||||
#ifndef NO_MOSES
|
||||
TargetPhraseCollection const*
|
||||
GetTargetPhraseCollectionLEGACY(const Phrase& src) const;
|
||||
|
@ -413,11 +413,9 @@ void FuzzyMatchWrapper::load_corpus( const std::string &fileName, vector< vector
|
||||
|
||||
istream *fileStreamP = &fileStream;
|
||||
|
||||
char line[LINE_MAX_LENGTH];
|
||||
while(true) {
|
||||
SAFE_GETLINE((*fileStreamP), line, LINE_MAX_LENGTH, '\n');
|
||||
if (fileStreamP->eof()) break;
|
||||
corpus.push_back( GetVocabulary().Tokenize( line ) );
|
||||
string line;
|
||||
while(getline(*fileStreamP, line)) {
|
||||
corpus.push_back( GetVocabulary().Tokenize( line.c_str() ) );
|
||||
}
|
||||
}
|
||||
|
||||
@ -436,12 +434,9 @@ void FuzzyMatchWrapper::load_target(const std::string &fileName, vector< vector<
|
||||
WORD_ID delimiter = GetVocabulary().StoreIfNew("|||");
|
||||
|
||||
int lineNum = 0;
|
||||
char line[LINE_MAX_LENGTH];
|
||||
while(true) {
|
||||
SAFE_GETLINE((*fileStreamP), line, LINE_MAX_LENGTH, '\n');
|
||||
if (fileStreamP->eof()) break;
|
||||
|
||||
vector<WORD_ID> toks = GetVocabulary().Tokenize( line );
|
||||
string line;
|
||||
while(getline(*fileStreamP, line)) {
|
||||
vector<WORD_ID> toks = GetVocabulary().Tokenize( line.c_str() );
|
||||
|
||||
corpus.push_back(vector< SentenceAlignment >());
|
||||
vector< SentenceAlignment > &vec = corpus.back();
|
||||
@ -493,11 +488,8 @@ void FuzzyMatchWrapper::load_alignment(const std::string &fileName, vector< vect
|
||||
string delimiter = "|||";
|
||||
|
||||
int lineNum = 0;
|
||||
char line[LINE_MAX_LENGTH];
|
||||
while(true) {
|
||||
SAFE_GETLINE((*fileStreamP), line, LINE_MAX_LENGTH, '\n');
|
||||
if (fileStreamP->eof()) break;
|
||||
|
||||
string line;
|
||||
while(getline(*fileStreamP, line)) {
|
||||
vector< SentenceAlignment > &vec = corpus[lineNum];
|
||||
size_t targetInd = 0;
|
||||
SentenceAlignment *sentence = &vec[targetInd];
|
||||
|
@ -14,17 +14,16 @@ SuffixArray::SuffixArray( string fileName )
|
||||
m_endOfSentence = m_vcb.StoreIfNew( "<s>" );
|
||||
|
||||
ifstream extractFile;
|
||||
char line[LINE_MAX_LENGTH];
|
||||
|
||||
// count the number of words first;
|
||||
extractFile.open(fileName.c_str());
|
||||
istream *fileP = &extractFile;
|
||||
m_size = 0;
|
||||
size_t sentenceCount = 0;
|
||||
while(!fileP->eof()) {
|
||||
SAFE_GETLINE((*fileP), line, LINE_MAX_LENGTH, '\n');
|
||||
if (fileP->eof()) break;
|
||||
vector< WORD_ID > words = m_vcb.Tokenize( line );
|
||||
string line;
|
||||
while(getline(*fileP, line)) {
|
||||
|
||||
vector< WORD_ID > words = m_vcb.Tokenize( line.c_str() );
|
||||
m_size += words.size() + 1;
|
||||
sentenceCount++;
|
||||
}
|
||||
@ -43,10 +42,8 @@ SuffixArray::SuffixArray( string fileName )
|
||||
int sentenceId = 0;
|
||||
extractFile.open(fileName.c_str());
|
||||
fileP = &extractFile;
|
||||
while(!fileP->eof()) {
|
||||
SAFE_GETLINE((*fileP), line, LINE_MAX_LENGTH, '\n');
|
||||
if (fileP->eof()) break;
|
||||
vector< WORD_ID > words = m_vcb.Tokenize( line );
|
||||
while(getline(*fileP, line)) {
|
||||
vector< WORD_ID > words = m_vcb.Tokenize( line.c_str() );
|
||||
|
||||
// add to corpus vector
|
||||
corpus.push_back(words);
|
||||
|
@ -17,20 +17,6 @@
|
||||
|
||||
namespace tmmt
|
||||
{
|
||||
|
||||
#define MAX_LENGTH 10000
|
||||
|
||||
#define SAFE_GETLINE(_IS, _LINE, _SIZE, _DELIM) { \
|
||||
_IS.getline(_LINE, _SIZE, _DELIM); \
|
||||
if(_IS.fail() && !_IS.bad() && !_IS.eof()) _IS.clear(); \
|
||||
if (_IS.gcount() == _SIZE-1) { \
|
||||
cerr << "Line too long! Buffer overflow. Delete lines >=" \
|
||||
<< _SIZE << " chars or raise MAX_LENGTH in phrase-extract/tables-core.cpp" \
|
||||
<< endl; \
|
||||
exit(1); \
|
||||
} \
|
||||
}
|
||||
|
||||
typedef std::string WORD;
|
||||
typedef unsigned int WORD_ID;
|
||||
|
||||
|
@ -2,9 +2,6 @@
|
||||
#include "ExtractionPhrasePair.h"
|
||||
#include "tables-core.h"
|
||||
#include "InputFileStream.h"
|
||||
#include "SafeGetline.h"
|
||||
|
||||
#define TABLE_LINE_MAX_LENGTH 1000
|
||||
|
||||
using namespace std;
|
||||
|
||||
@ -16,12 +13,11 @@ void Domain::load( const std::string &domainFileName )
|
||||
{
|
||||
Moses::InputFileStream fileS( domainFileName );
|
||||
istream *fileP = &fileS;
|
||||
while(true) {
|
||||
char line[TABLE_LINE_MAX_LENGTH];
|
||||
SAFE_GETLINE((*fileP), line, TABLE_LINE_MAX_LENGTH, '\n', __FILE__);
|
||||
if (fileP->eof()) break;
|
||||
|
||||
string line;
|
||||
while(getline(*fileP, line)) {
|
||||
// read
|
||||
vector< string > domainSpecLine = tokenize( line );
|
||||
vector< string > domainSpecLine = tokenize( line.c_str() );
|
||||
int lineNumber;
|
||||
if (domainSpecLine.size() != 2 ||
|
||||
! sscanf(domainSpecLine[0].c_str(), "%d", &lineNumber)) {
|
||||
|
@ -19,7 +19,6 @@
|
||||
|
||||
#include <sstream>
|
||||
#include "ExtractionPhrasePair.h"
|
||||
#include "SafeGetline.h"
|
||||
#include "tables-core.h"
|
||||
#include "score.h"
|
||||
#include "moses/Util.h"
|
||||
|
@ -1,35 +0,0 @@
|
||||
/***********************************************************************
|
||||
Moses - factored phrase-based language decoder
|
||||
Copyright (C) 2010 University of Edinburgh
|
||||
|
||||
This library is free software; you can redistribute it and/or
|
||||
modify it under the terms of the GNU Lesser General Public
|
||||
License as published by the Free Software Foundation; either
|
||||
version 2.1 of the License, or (at your option) any later version.
|
||||
|
||||
This library is distributed in the hope that it will be useful,
|
||||
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
|
||||
Lesser General Public License for more details.
|
||||
|
||||
You should have received a copy of the GNU Lesser General Public
|
||||
License along with this library; if not, write to the Free Software
|
||||
Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
|
||||
***********************************************************************/
|
||||
|
||||
#pragma once
|
||||
#ifndef SAFE_GETLINE_INCLUDED_
|
||||
#define SAFE_GETLINE_INCLUDED_
|
||||
|
||||
#define SAFE_GETLINE(_IS, _LINE, _SIZE, _DELIM, _FILE) { \
|
||||
_IS.getline(_LINE, _SIZE, _DELIM); \
|
||||
if(_IS.fail() && !_IS.bad() && !_IS.eof()) _IS.clear(); \
|
||||
if (_IS.gcount() == _SIZE-1) { \
|
||||
cerr << "Line too long! Buffer overflow. Delete lines >=" \
|
||||
<< _SIZE << " chars or raise LINE_MAX_LENGTH in " << _FILE \
|
||||
<< endl; \
|
||||
exit(1); \
|
||||
} \
|
||||
}
|
||||
|
||||
#endif
|
@ -54,7 +54,11 @@ bool SentenceAlignment::processSourceSentence(const char * sourceString, int, bo
|
||||
return true;
|
||||
}
|
||||
|
||||
bool SentenceAlignment::create( char targetString[], char sourceString[], char alignmentString[], char weightString[], int sentenceID, bool boundaryRules)
|
||||
bool SentenceAlignment::create(const char targetString[],
|
||||
const char sourceString[],
|
||||
const char alignmentString[],
|
||||
const char weightString[],
|
||||
int sentenceID, bool boundaryRules)
|
||||
{
|
||||
using namespace std;
|
||||
this->sentenceID = sentenceID;
|
||||
|
@ -43,8 +43,11 @@ public:
|
||||
|
||||
virtual bool processSourceSentence(const char *, int, bool boundaryRules);
|
||||
|
||||
bool create(char targetString[], char sourceString[],
|
||||
char alignmentString[], char weightString[], int sentenceID, bool boundaryRules);
|
||||
bool create(const char targetString[],
|
||||
const char sourceString[],
|
||||
const char alignmentString[],
|
||||
const char weightString[],
|
||||
int sentenceID, bool boundaryRules);
|
||||
|
||||
void invertAlignment();
|
||||
|
||||
|
@ -26,16 +26,9 @@
|
||||
#include "InputFileStream.h"
|
||||
#include "OutputFileStream.h"
|
||||
|
||||
#include "SafeGetline.h"
|
||||
|
||||
#define LINE_MAX_LENGTH 10000
|
||||
|
||||
using namespace std;
|
||||
|
||||
char line[LINE_MAX_LENGTH];
|
||||
|
||||
|
||||
vector< string > splitLine()
|
||||
vector< string > splitLine(const char *line)
|
||||
{
|
||||
vector< string > item;
|
||||
int start=0;
|
||||
@ -61,14 +54,15 @@ bool getLine( istream &fileP, vector< string > &item )
|
||||
{
|
||||
if (fileP.eof())
|
||||
return false;
|
||||
|
||||
SAFE_GETLINE((fileP), line, LINE_MAX_LENGTH, '\n', __FILE__);
|
||||
if (fileP.eof())
|
||||
|
||||
string line;
|
||||
if (getline(fileP, line)) {
|
||||
item = splitLine(line.c_str());
|
||||
return false;
|
||||
|
||||
item = splitLine();
|
||||
|
||||
return true;
|
||||
}
|
||||
else {
|
||||
return false;
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
|
@ -26,12 +26,9 @@
|
||||
#include <cstring>
|
||||
|
||||
#include "tables-core.h"
|
||||
#include "SafeGetline.h"
|
||||
#include "InputFileStream.h"
|
||||
#include "OutputFileStream.h"
|
||||
|
||||
#define LINE_MAX_LENGTH 10000
|
||||
|
||||
using namespace std;
|
||||
|
||||
bool hierarchicalFlag = false;
|
||||
@ -46,12 +43,11 @@ inline float maybeLogProb( float a )
|
||||
return logProbFlag ? log(a) : a;
|
||||
}
|
||||
|
||||
char line[LINE_MAX_LENGTH];
|
||||
void processFiles( char*, char*, char*, char* );
|
||||
void loadCountOfCounts( char* );
|
||||
void breakdownCoreAndSparse( string combined, string &core, string &sparse );
|
||||
bool getLine( istream &fileP, vector< string > &item );
|
||||
vector< string > splitLine();
|
||||
vector< string > splitLine(const char *line);
|
||||
vector< int > countBin;
|
||||
bool sparseCountBinFeatureFlag = false;
|
||||
|
||||
@ -140,14 +136,13 @@ void loadCountOfCounts( char* fileNameCountOfCounts )
|
||||
istream &fileP = fileCountOfCounts;
|
||||
|
||||
countOfCounts.push_back(0.0);
|
||||
while(1) {
|
||||
if (fileP.eof()) break;
|
||||
SAFE_GETLINE((fileP), line, LINE_MAX_LENGTH, '\n', __FILE__);
|
||||
if (fileP.eof()) break;
|
||||
|
||||
string line;
|
||||
while (getline(fileP, line)) {
|
||||
if (totalCount < 0)
|
||||
totalCount = atof(line); // total number of distinct phrase pairs
|
||||
totalCount = atof(line.c_str()); // total number of distinct phrase pairs
|
||||
else
|
||||
countOfCounts.push_back( atof(line) );
|
||||
countOfCounts.push_back( atof(line.c_str()) );
|
||||
}
|
||||
fileCountOfCounts.Close();
|
||||
|
||||
@ -370,16 +365,16 @@ bool getLine( istream &fileP, vector< string > &item )
|
||||
if (fileP.eof())
|
||||
return false;
|
||||
|
||||
SAFE_GETLINE((fileP), line, LINE_MAX_LENGTH, '\n', __FILE__);
|
||||
if (fileP.eof())
|
||||
string line;
|
||||
if (!getline(fileP, line))
|
||||
return false;
|
||||
|
||||
item = splitLine();
|
||||
item = splitLine(line.c_str());
|
||||
|
||||
return true;
|
||||
}
|
||||
|
||||
vector< string > splitLine()
|
||||
vector< string > splitLine(const char *line)
|
||||
{
|
||||
vector< string > item;
|
||||
int start=0;
|
||||
|
@ -27,23 +27,19 @@
|
||||
#include <cstring>
|
||||
|
||||
#include "tables-core.h"
|
||||
#include "SafeGetline.h"
|
||||
#include "InputFileStream.h"
|
||||
|
||||
#define LINE_MAX_LENGTH 10000
|
||||
|
||||
using namespace std;
|
||||
|
||||
bool hierarchicalFlag = false;
|
||||
bool onlyDirectFlag = false;
|
||||
bool phraseCountFlag = true;
|
||||
bool logProbFlag = false;
|
||||
char line[LINE_MAX_LENGTH];
|
||||
|
||||
void processFiles( char*, char*, char* );
|
||||
bool getLine( istream &fileP, vector< string > &item );
|
||||
string reverseAlignment(const string &alignments);
|
||||
vector< string > splitLine();
|
||||
vector< string > splitLine(const char *lin);
|
||||
|
||||
inline void Tokenize(std::vector<std::string> &output
|
||||
, const std::string& str
|
||||
@ -190,17 +186,18 @@ bool getLine( istream &fileP, vector< string > &item )
|
||||
{
|
||||
if (fileP.eof())
|
||||
return false;
|
||||
|
||||
SAFE_GETLINE((fileP), line, LINE_MAX_LENGTH, '\n', __FILE__);
|
||||
if (fileP.eof())
|
||||
|
||||
string line;
|
||||
if (getline(fileP, line)) {
|
||||
item = splitLine(line.c_str());
|
||||
return false;
|
||||
|
||||
item = splitLine();
|
||||
|
||||
return true;
|
||||
}
|
||||
else {
|
||||
return false;
|
||||
}
|
||||
}
|
||||
|
||||
vector< string > splitLine()
|
||||
vector< string > splitLine(const char *line)
|
||||
{
|
||||
vector< string > item;
|
||||
bool betweenWords = true;
|
||||
|
@ -19,7 +19,6 @@
|
||||
#include <set>
|
||||
#include <vector>
|
||||
|
||||
#include "SafeGetline.h"
|
||||
#include "SentenceAlignment.h"
|
||||
#include "tables-core.h"
|
||||
#include "InputFileStream.h"
|
||||
@ -32,10 +31,6 @@ using namespace MosesTraining;
|
||||
namespace MosesTraining
|
||||
{
|
||||
|
||||
|
||||
const long int LINE_MAX_LENGTH = 500000 ;
|
||||
|
||||
|
||||
// HPhraseVertex represents a point in the alignment matrix
|
||||
typedef pair <int, int> HPhraseVertex;
|
||||
|
||||
@ -277,20 +272,18 @@ int main(int argc, char* argv[])
|
||||
|
||||
int i = sentenceOffset;
|
||||
|
||||
while(true) {
|
||||
string englishString, foreignString, alignmentString, weightString;
|
||||
|
||||
while(getline(*eFileP, englishString)) {
|
||||
i++;
|
||||
if (i%10000 == 0) cerr << "." << flush;
|
||||
char englishString[LINE_MAX_LENGTH];
|
||||
char foreignString[LINE_MAX_LENGTH];
|
||||
char alignmentString[LINE_MAX_LENGTH];
|
||||
char weightString[LINE_MAX_LENGTH];
|
||||
SAFE_GETLINE((*eFileP), englishString, LINE_MAX_LENGTH, '\n', __FILE__);
|
||||
if (eFileP->eof()) break;
|
||||
SAFE_GETLINE((*fFileP), foreignString, LINE_MAX_LENGTH, '\n', __FILE__);
|
||||
SAFE_GETLINE((*aFileP), alignmentString, LINE_MAX_LENGTH, '\n', __FILE__);
|
||||
|
||||
getline(*fFileP, foreignString);
|
||||
getline(*aFileP, alignmentString);
|
||||
if (iwFileP) {
|
||||
SAFE_GETLINE((*iwFileP), weightString, LINE_MAX_LENGTH, '\n', __FILE__);
|
||||
getline(*iwFileP, weightString);
|
||||
}
|
||||
|
||||
SentenceAlignment sentence;
|
||||
// cout << "read in: " << englishString << " & " << foreignString << " & " << alignmentString << endl;
|
||||
//az: output src, tgt, and alingment line
|
||||
@ -300,7 +293,11 @@ int main(int argc, char* argv[])
|
||||
cout << "LOG: ALT: " << alignmentString << endl;
|
||||
cout << "LOG: PHRASES_BEGIN:" << endl;
|
||||
}
|
||||
if (sentence.create( englishString, foreignString, alignmentString, weightString, i, false)) {
|
||||
if (sentence.create( englishString.c_str(),
|
||||
foreignString.c_str(),
|
||||
alignmentString.c_str(),
|
||||
weightString.c_str(),
|
||||
i, false)) {
|
||||
if (options.placeholders.size()) {
|
||||
sentence.invertAlignment();
|
||||
}
|
||||
|
@ -19,7 +19,6 @@
|
||||
#include <set>
|
||||
#include <vector>
|
||||
|
||||
#include "SafeGetline.h"
|
||||
#include "SentenceAlignment.h"
|
||||
#include "tables-core.h"
|
||||
#include "InputFileStream.h"
|
||||
@ -32,10 +31,6 @@ using namespace MosesTraining;
|
||||
namespace MosesTraining
|
||||
{
|
||||
|
||||
|
||||
const long int LINE_MAX_LENGTH = 500000 ;
|
||||
|
||||
|
||||
// HPhraseVertex represents a point in the alignment matrix
|
||||
typedef pair <int, int> HPhraseVertex;
|
||||
|
||||
@ -246,20 +241,20 @@ int main(int argc, char* argv[])
|
||||
|
||||
int i = sentenceOffset;
|
||||
|
||||
while(true) {
|
||||
string englishString, foreignString, alignmentString, weightString;
|
||||
|
||||
while(getline(*eFileP, englishString)) {
|
||||
i++;
|
||||
if (i%10000 == 0) cerr << "." << flush;
|
||||
char englishString[LINE_MAX_LENGTH];
|
||||
char foreignString[LINE_MAX_LENGTH];
|
||||
char alignmentString[LINE_MAX_LENGTH];
|
||||
char weightString[LINE_MAX_LENGTH];
|
||||
SAFE_GETLINE((*eFileP), englishString, LINE_MAX_LENGTH, '\n', __FILE__);
|
||||
if (eFileP->eof()) break;
|
||||
SAFE_GETLINE((*fFileP), foreignString, LINE_MAX_LENGTH, '\n', __FILE__);
|
||||
SAFE_GETLINE((*aFileP), alignmentString, LINE_MAX_LENGTH, '\n', __FILE__);
|
||||
|
||||
getline(*eFileP, englishString);
|
||||
getline(*fFileP, foreignString);
|
||||
getline(*aFileP, alignmentString);
|
||||
if (iwFileP) {
|
||||
SAFE_GETLINE((*iwFileP), weightString, LINE_MAX_LENGTH, '\n', __FILE__);
|
||||
getline(*iwFileP, weightString);
|
||||
}
|
||||
|
||||
if (i%10000 == 0) cerr << "." << flush;
|
||||
|
||||
SentenceAlignment sentence;
|
||||
// cout << "read in: " << englishString << " & " << foreignString << " & " << alignmentString << endl;
|
||||
//az: output src, tgt, and alingment line
|
||||
@ -269,7 +264,7 @@ int main(int argc, char* argv[])
|
||||
cout << "LOG: ALT: " << alignmentString << endl;
|
||||
cout << "LOG: PHRASES_BEGIN:" << endl;
|
||||
}
|
||||
if (sentence.create( englishString, foreignString, alignmentString, weightString, i, false)) {
|
||||
if (sentence.create( englishString.c_str(), foreignString.c_str(), alignmentString.c_str(), weightString.c_str(), i, false)) {
|
||||
ExtractTask *task = new ExtractTask(i-1, sentence, options, extractFileOrientation);
|
||||
task->Run();
|
||||
delete task;
|
||||
|
@ -39,7 +39,6 @@
|
||||
#include "Hole.h"
|
||||
#include "HoleCollection.h"
|
||||
#include "RuleExist.h"
|
||||
#include "SafeGetline.h"
|
||||
#include "SentenceAlignmentWithSyntax.h"
|
||||
#include "SyntaxTree.h"
|
||||
#include "tables-core.h"
|
||||
@ -47,8 +46,6 @@
|
||||
#include "InputFileStream.h"
|
||||
#include "OutputFileStream.h"
|
||||
|
||||
#define LINE_MAX_LENGTH 500000
|
||||
|
||||
using namespace std;
|
||||
using namespace MosesTraining;
|
||||
|
||||
@ -326,17 +323,15 @@ int main(int argc, char* argv[])
|
||||
|
||||
// loop through all sentence pairs
|
||||
size_t i=sentenceOffset;
|
||||
while(true) {
|
||||
i++;
|
||||
if (i%1000 == 0) cerr << i << " " << flush;
|
||||
string targetString, sourceString, alignmentString;
|
||||
|
||||
char targetString[LINE_MAX_LENGTH];
|
||||
char sourceString[LINE_MAX_LENGTH];
|
||||
char alignmentString[LINE_MAX_LENGTH];
|
||||
SAFE_GETLINE((*tFileP), targetString, LINE_MAX_LENGTH, '\n', __FILE__);
|
||||
if (tFileP->eof()) break;
|
||||
SAFE_GETLINE((*sFileP), sourceString, LINE_MAX_LENGTH, '\n', __FILE__);
|
||||
SAFE_GETLINE((*aFileP), alignmentString, LINE_MAX_LENGTH, '\n', __FILE__);
|
||||
while(getline(*tFileP, targetString)) {
|
||||
i++;
|
||||
|
||||
getline(*sFileP, sourceString);
|
||||
getline(*aFileP, alignmentString);
|
||||
|
||||
if (i%1000 == 0) cerr << i << " " << flush;
|
||||
|
||||
SentenceAlignmentWithSyntax sentence
|
||||
(targetLabelCollection, sourceLabelCollection,
|
||||
@ -349,7 +344,7 @@ int main(int argc, char* argv[])
|
||||
cout << "LOG: PHRASES_BEGIN:" << endl;
|
||||
}
|
||||
|
||||
if (sentence.create(targetString, sourceString, alignmentString,"", i, options.boundaryRules)) {
|
||||
if (sentence.create(targetString.c_str(), sourceString.c_str(), alignmentString.c_str(),"", i, options.boundaryRules)) {
|
||||
if (options.unknownWordLabelFlag) {
|
||||
collectWordLabelCounts(sentence);
|
||||
}
|
||||
|
@ -20,8 +20,6 @@
|
||||
***********************************************************************/
|
||||
|
||||
#include "relax-parse.h"
|
||||
|
||||
#include "SafeGetline.h"
|
||||
#include "tables-core.h"
|
||||
|
||||
using namespace std;
|
||||
@ -33,17 +31,13 @@ int main(int argc, char* argv[])
|
||||
|
||||
// loop through all sentences
|
||||
int i=0;
|
||||
char inBuffer[LINE_MAX_LENGTH];
|
||||
while(true) {
|
||||
string inBuffer;
|
||||
while(getline(cin, inBuffer)) {
|
||||
i++;
|
||||
if (i%1000 == 0) cerr << "." << flush;
|
||||
if (i%10000 == 0) cerr << ":" << flush;
|
||||
if (i%100000 == 0) cerr << "!" << flush;
|
||||
|
||||
// get line from stdin
|
||||
SAFE_GETLINE( cin, inBuffer, LINE_MAX_LENGTH, '\n', __FILE__);
|
||||
if (cin.eof()) break;
|
||||
|
||||
// process into syntax tree representation
|
||||
string inBufferString = string( inBuffer );
|
||||
set< string > labelCollection; // set of labels, not used
|
||||
|
@ -29,7 +29,6 @@
|
||||
#include <vector>
|
||||
#include <algorithm>
|
||||
|
||||
#include "SafeGetline.h"
|
||||
#include "ScoreFeature.h"
|
||||
#include "tables-core.h"
|
||||
#include "ExtractionPhrasePair.h"
|
||||
@ -40,8 +39,6 @@
|
||||
using namespace std;
|
||||
using namespace MosesTraining;
|
||||
|
||||
#define LINE_MAX_LENGTH 100000
|
||||
|
||||
namespace MosesTraining
|
||||
{
|
||||
LexicalTable lexTable;
|
||||
@ -232,7 +229,7 @@ int main(int argc, char* argv[])
|
||||
}
|
||||
|
||||
// loop through all extracted phrase translations
|
||||
char line[LINE_MAX_LENGTH], lastLine[LINE_MAX_LENGTH];
|
||||
string line, lastLine;
|
||||
lastLine[0] = '\0';
|
||||
ExtractionPhrasePair *phrasePair = NULL;
|
||||
std::vector< ExtractionPhrasePair* > phrasePairsWithSameSource;
|
||||
@ -245,8 +242,8 @@ int main(int argc, char* argv[])
|
||||
float tmpCount=0.0f, tmpPcfgSum=0.0f;
|
||||
|
||||
int i=0;
|
||||
SAFE_GETLINE( (extractFileP), line, LINE_MAX_LENGTH, '\n', __FILE__ );
|
||||
if ( !extractFileP.eof() ) {
|
||||
// TODO why read only the 1st line?
|
||||
if ( getline(extractFileP, line)) {
|
||||
++i;
|
||||
tmpPhraseSource = new PHRASE();
|
||||
tmpPhraseTarget = new PHRASE();
|
||||
@ -265,23 +262,21 @@ int main(int argc, char* argv[])
|
||||
if ( hierarchicalFlag ) {
|
||||
phrasePairsWithSameSourceAndTarget.push_back( phrasePair );
|
||||
}
|
||||
strcpy( lastLine, line );
|
||||
SAFE_GETLINE( (extractFileP), line, LINE_MAX_LENGTH, '\n', __FILE__ );
|
||||
lastLine = line;
|
||||
}
|
||||
|
||||
while ( !extractFileP.eof() ) {
|
||||
while ( getline(extractFileP, line) ) {
|
||||
|
||||
if ( ++i % 100000 == 0 ) {
|
||||
std::cerr << "." << std::flush;
|
||||
}
|
||||
|
||||
// identical to last line? just add count
|
||||
if (strcmp(line,lastLine) == 0) {
|
||||
if (line == lastLine) {
|
||||
phrasePair->IncrementPrevious(tmpCount,tmpPcfgSum);
|
||||
SAFE_GETLINE((extractFileP), line, LINE_MAX_LENGTH, '\n', __FILE__);
|
||||
continue;
|
||||
} else {
|
||||
strcpy( lastLine, line );
|
||||
lastLine = line;
|
||||
}
|
||||
|
||||
tmpPhraseSource = new PHRASE();
|
||||
@ -359,8 +354,6 @@ int main(int argc, char* argv[])
|
||||
}
|
||||
}
|
||||
|
||||
SAFE_GETLINE((extractFileP), line, LINE_MAX_LENGTH, '\n', __FILE__);
|
||||
|
||||
}
|
||||
|
||||
processPhrasePairs( phrasePairsWithSameSource, *phraseTableFile, featureManager, maybeLogProb );
|
||||
@ -750,11 +743,9 @@ void loadFunctionWords( const string &fileName )
|
||||
}
|
||||
istream *inFileP = &inFile;
|
||||
|
||||
char line[LINE_MAX_LENGTH];
|
||||
while(true) {
|
||||
SAFE_GETLINE((*inFileP), line, LINE_MAX_LENGTH, '\n', __FILE__);
|
||||
if (inFileP->eof()) break;
|
||||
std::vector<string> token = tokenize( line );
|
||||
string line;
|
||||
while(getline(*inFileP, line)) {
|
||||
std::vector<string> token = tokenize( line.c_str() );
|
||||
if (token.size() > 0)
|
||||
functionWordList.insert( token[0] );
|
||||
}
|
||||
@ -799,16 +790,13 @@ void LexicalTable::load( const string &fileName )
|
||||
}
|
||||
istream *inFileP = &inFile;
|
||||
|
||||
char line[LINE_MAX_LENGTH];
|
||||
|
||||
string line;
|
||||
int i=0;
|
||||
while(true) {
|
||||
while(getline(*inFileP, line)) {
|
||||
i++;
|
||||
if (i%100000 == 0) std::cerr << "." << flush;
|
||||
SAFE_GETLINE((*inFileP), line, LINE_MAX_LENGTH, '\n', __FILE__);
|
||||
if (inFileP->eof()) break;
|
||||
|
||||
std::vector<string> token = tokenize( line );
|
||||
std::vector<string> token = tokenize( line.c_str() );
|
||||
if (token.size() != 3) {
|
||||
std::cerr << "line " << i << " in " << fileName
|
||||
<< " has wrong number of tokens, skipping:" << std::endl
|
||||
|
@ -12,15 +12,12 @@
|
||||
#include <time.h>
|
||||
|
||||
#include "AlignmentPhrase.h"
|
||||
#include "SafeGetline.h"
|
||||
#include "tables-core.h"
|
||||
#include "InputFileStream.h"
|
||||
|
||||
using namespace std;
|
||||
using namespace MosesTraining;
|
||||
|
||||
#define LINE_MAX_LENGTH 10000
|
||||
|
||||
namespace MosesTraining
|
||||
{
|
||||
|
||||
@ -31,7 +28,7 @@ public:
|
||||
vector< vector<size_t> > alignedToE;
|
||||
vector< vector<size_t> > alignedToF;
|
||||
|
||||
bool create( char*, int );
|
||||
bool create( const char*, int );
|
||||
void clear();
|
||||
bool equals( const PhraseAlignment& );
|
||||
};
|
||||
@ -106,16 +103,14 @@ int main(int argc, char* argv[])
|
||||
vector< PhraseAlignment > phrasePairsWithSameF;
|
||||
int i=0;
|
||||
int fileCount = 0;
|
||||
while(true) {
|
||||
|
||||
string line;
|
||||
while(getline(extractFileP, line)) {
|
||||
if (extractFileP.eof()) break;
|
||||
if (++i % 100000 == 0) cerr << "." << flush;
|
||||
char line[LINE_MAX_LENGTH];
|
||||
SAFE_GETLINE((extractFileP), line, LINE_MAX_LENGTH, '\n', __FILE__);
|
||||
// if (fileCount>0)
|
||||
if (extractFileP.eof())
|
||||
break;
|
||||
|
||||
PhraseAlignment phrasePair;
|
||||
bool isPhrasePair = phrasePair.create( line, i );
|
||||
bool isPhrasePair = phrasePair.create( line.c_str(), i );
|
||||
if (lastForeign >= 0 && lastForeign != phrasePair.foreign) {
|
||||
processPhrasePairs( phrasePairsWithSameF );
|
||||
for(size_t j=0; j<phrasePairsWithSameF.size(); j++)
|
||||
@ -124,7 +119,7 @@ int main(int argc, char* argv[])
|
||||
phraseTableE.clear();
|
||||
phraseTableF.clear();
|
||||
phrasePair.clear(); // process line again, since phrase tables flushed
|
||||
phrasePair.create( line, i );
|
||||
phrasePair.create( line.c_str(), i );
|
||||
phrasePairBase = 0;
|
||||
}
|
||||
lastForeign = phrasePair.foreign;
|
||||
@ -242,7 +237,7 @@ void processPhrasePairs( vector< PhraseAlignment > &phrasePair )
|
||||
}
|
||||
}
|
||||
|
||||
bool PhraseAlignment::create( char line[], int lineID )
|
||||
bool PhraseAlignment::create(const char line[], int lineID )
|
||||
{
|
||||
vector< string > token = tokenize( line );
|
||||
int item = 1;
|
||||
@ -321,16 +316,14 @@ void LexicalTable::load( const string &filePath )
|
||||
}
|
||||
istream *inFileP = &inFile;
|
||||
|
||||
char line[LINE_MAX_LENGTH];
|
||||
string line;
|
||||
|
||||
int i=0;
|
||||
while(true) {
|
||||
while(getline(*inFileP, line)) {
|
||||
i++;
|
||||
if (i%100000 == 0) cerr << "." << flush;
|
||||
SAFE_GETLINE((*inFileP), line, LINE_MAX_LENGTH, '\n', __FILE__);
|
||||
if (inFileP->eof()) break;
|
||||
|
||||
vector<string> token = tokenize( line );
|
||||
vector<string> token = tokenize( line.c_str() );
|
||||
if (token.size() != 3) {
|
||||
cerr << "line " << i << " in " << filePath << " has wrong number of tokens, skipping:\n" <<
|
||||
token.size() << " " << token[0] << " " << line << endl;
|
||||
|
188
scripts/training/wrappers/conll2mosesxml.py
Executable file
188
scripts/training/wrappers/conll2mosesxml.py
Executable file
@ -0,0 +1,188 @@
|
||||
#!/usr/bin/python
|
||||
# -*- coding: utf-8 -*-
|
||||
# Author: Rico Sennrich
|
||||
|
||||
# takes a file in the CoNLL dependency format (from the CoNLL-X shared task on dependency parsing; http://ilk.uvt.nl/conll/#dataformat )
|
||||
# and produces Moses XML format. Note that the structure is built based on fields 9 and 10 (projective HEAD and RELATION),
|
||||
# which not all parsers produce.
|
||||
|
||||
# usage: conll2mosesxml.py [--brackets] < input_file > output_file
|
||||
|
||||
from __future__ import print_function, unicode_literals
|
||||
import sys
|
||||
import re
|
||||
import codecs
|
||||
from collections import namedtuple,defaultdict
|
||||
from lxml import etree as ET
|
||||
|
||||
|
||||
Word = namedtuple('Word', ['pos','word','lemma','tag','head','func', 'proj_head', 'proj_func'])
|
||||
|
||||
def main(output_format='xml'):
|
||||
sentence = []
|
||||
|
||||
for line in sys.stdin:
|
||||
|
||||
# process sentence
|
||||
if line == "\n":
|
||||
sentence.insert(0,[])
|
||||
if is_projective(sentence):
|
||||
write(sentence,output_format)
|
||||
else:
|
||||
sys.stderr.write(' '.join(w.word for w in sentence[1:]) + '\n')
|
||||
sys.stdout.write('\n')
|
||||
sentence = []
|
||||
continue
|
||||
|
||||
try:
|
||||
pos, word, lemma, tag, tag2, morph, head, func, proj_head, proj_func = line.split()
|
||||
except ValueError: # word may be unicode whitespace
|
||||
pos, word, lemma, tag, tag2, morph, head, func, proj_head, proj_func = re.split(' *\t*',line.strip())
|
||||
|
||||
word = escape_special_chars(word)
|
||||
lemma = escape_special_chars(lemma)
|
||||
|
||||
if proj_head == '_':
|
||||
proj_head = head
|
||||
proj_func = func
|
||||
|
||||
sentence.append(Word(int(pos), word, lemma, tag2,int(head), func, int(proj_head), proj_func))
|
||||
|
||||
|
||||
# this script performs the same escaping as escape-special-chars.perl in Moses.
|
||||
# most of it is done in function write(), but quotation marks need to be processed first
|
||||
def escape_special_chars(line):
|
||||
|
||||
line = line.replace('\'',''') # xml
|
||||
line = line.replace('"','"') # xml
|
||||
|
||||
return line
|
||||
|
||||
|
||||
# make a check if structure is projective
|
||||
def is_projective(sentence):
|
||||
dominates = defaultdict(set)
|
||||
for i,w in enumerate(sentence):
|
||||
dominates[i].add(i)
|
||||
if not i:
|
||||
continue
|
||||
head = int(w.proj_head)
|
||||
while head != 0:
|
||||
if i in dominates[head]:
|
||||
break
|
||||
dominates[head].add(i)
|
||||
head = int(sentence[head].proj_head)
|
||||
|
||||
for i in dominates:
|
||||
dependents = dominates[i]
|
||||
if max(dependents) - min(dependents) != len(dependents)-1:
|
||||
sys.stderr.write("error: non-projective structure.\n")
|
||||
return False
|
||||
return True
|
||||
|
||||
|
||||
def write(sentence, output_format='xml'):
|
||||
|
||||
if output_format == 'xml':
|
||||
tree = create_subtree(0,sentence)
|
||||
out = ET.tostring(tree, encoding = 'UTF-8').decode('UTF-8')
|
||||
|
||||
if output_format == 'brackets':
|
||||
out = create_brackets(0,sentence)
|
||||
|
||||
out = out.replace('|','|') # factor separator
|
||||
out = out.replace('[','[') # syntax non-terminal
|
||||
out = out.replace(']',']') # syntax non-terminal
|
||||
|
||||
out = out.replace('&apos;',''') # lxml is buggy if input is escaped
|
||||
out = out.replace('&quot;','"') # lxml is buggy if input is escaped
|
||||
|
||||
print(out)
|
||||
|
||||
# write node in Moses XML format
|
||||
def create_subtree(position, sentence):
|
||||
|
||||
element = ET.Element('tree')
|
||||
|
||||
if position:
|
||||
element.set('label', sentence[position].proj_func)
|
||||
else:
|
||||
element.set('label', 'sent')
|
||||
|
||||
for i in range(1,position):
|
||||
if sentence[i].proj_head == position:
|
||||
element.append(create_subtree(i, sentence))
|
||||
|
||||
if position:
|
||||
|
||||
if preterminals:
|
||||
head = ET.Element('tree')
|
||||
head.set('label', sentence[position].tag)
|
||||
head.text = sentence[position].word
|
||||
element.append(head)
|
||||
|
||||
else:
|
||||
if len(element):
|
||||
element[-1].tail = sentence[position].word
|
||||
else:
|
||||
element.text = sentence[position].word
|
||||
|
||||
for i in range(position, len(sentence)):
|
||||
if i and sentence[i].proj_head == position:
|
||||
element.append(create_subtree(i, sentence))
|
||||
|
||||
return element
|
||||
|
||||
|
||||
# write node in bracket format (Penn treebank style)
|
||||
def create_brackets(position, sentence):
|
||||
|
||||
if position:
|
||||
element = "( " + sentence[position].proj_func + ' '
|
||||
else:
|
||||
element = "( sent "
|
||||
|
||||
for i in range(1,position):
|
||||
if sentence[i].proj_head == position:
|
||||
element += create_brackets(i, sentence)
|
||||
|
||||
if position:
|
||||
word = sentence[position].word
|
||||
if word == ')':
|
||||
word = 'RBR'
|
||||
elif word == '(':
|
||||
word = 'LBR'
|
||||
|
||||
tag = sentence[position].tag
|
||||
if tag == '$(':
|
||||
tag = '$BR'
|
||||
|
||||
if preterminals:
|
||||
element += '( ' + tag + ' ' + word + ' ) '
|
||||
else:
|
||||
element += word + ' ) '
|
||||
|
||||
for i in range(position, len(sentence)):
|
||||
if i and sentence[i].proj_head == position:
|
||||
element += create_brackets(i, sentence)
|
||||
|
||||
if preterminals or not position:
|
||||
element += ') '
|
||||
|
||||
return element
|
||||
|
||||
if __name__ == '__main__':
|
||||
if sys.version_info < (3,0,0):
|
||||
sys.stdin = codecs.getreader('UTF-8')(sys.stdin)
|
||||
sys.stdout = codecs.getwriter('UTF-8')(sys.stdout)
|
||||
sys.stderr = codecs.getwriter('UTF-8')(sys.stderr)
|
||||
|
||||
if '--no_preterminals' in sys.argv:
|
||||
preterminals = False
|
||||
else:
|
||||
preterminals = True
|
||||
|
||||
if '--brackets' in sys.argv:
|
||||
main('brackets')
|
||||
else:
|
||||
main('xml')
|
Loading…
Reference in New Issue
Block a user