Merge branch 'master' of ssh://github.com/moses-smt/mosesdecoder

This commit is contained in:
phikoehn 2014-06-11 13:44:22 +01:00
commit 89a9c410c9
51 changed files with 1541 additions and 291 deletions

View File

@ -145,6 +145,7 @@ build-projects lm util phrase-extract search moses moses/LM mert moses-cmd moses
if [ option.get "with-mm" : : "yes" ] if [ option.get "with-mm" : : "yes" ]
{ {
alias mm : alias mm :
moses/TranslationModel/UG//lookup_mmsapt
moses/TranslationModel/UG/mm//mtt-build moses/TranslationModel/UG/mm//mtt-build
moses/TranslationModel/UG/mm//mtt-dump moses/TranslationModel/UG/mm//mtt-dump
moses/TranslationModel/UG/mm//symal2mam moses/TranslationModel/UG/mm//symal2mam

View File

@ -0,0 +1,122 @@
# Moses speedtesting framework
### Description
This is an automatic test framework that is designed to test the day to day performance changes in Moses.
### Set up
#### Set up a Moses repo
Set up a Moses repo and build it with the desired configuration.
```bash
git clone https://github.com/moses-smt/mosesdecoder.git
cd mosesdecoder
./bjam -j10 --with-cmph=/usr/include/
```
You need to build Moses first, so that the testsuite knows what command you want it to use when rebuilding against newer revisions.
#### Create a parent directory.
Create a parent directory where the **runtests.py** and related scripts and configuration file should reside.
This should also be the location of the TEST_DIR and TEST_LOG_DIR as explained in the next section.
#### Set up a global configuration file.
You need a configuration file for the testsuite. A sample configuration file is provided in **testsuite\_config**
<pre>
MOSES_REPO_PATH: /home/moses-speedtest/moses-standard/mosesdecoder
DROP_CACHES_COMM: sys_drop_caches 3
TEST_DIR: /home/moses-speedtest/phrase_tables/tests
TEST_LOG_DIR: /home/moses-speedtest/phrase_tables/testlogs
BASEBRANCH: RELEASE-2.1.1
</pre>
The _MOSES\_REPO\_PATH_ is the place where you have set up and built moses.
The _DROP\_CACHES\_COMM_ is the command that would beused to drop caches. It should run without needing root access.
_TEST\_DIR_ is the directory where all the tests will reside.
_TEST\_LOG\_DIR_ is the directory where the performance logs will be gathered. It should be created before running the testsuite for the first time.
_BASEBRANCH_ is the branch against which all new tests will be compared. It should normally be set to be the latest Moses stable release.
### Creating tests
In order to create a test one should go into the TEST_DIR and create a new folder. That folder will be used for the name of the test.
Inside that folder one should place a configuration file named **config**. The naming is mandatory.
An example such configuration file is **test\_config**
<pre>
Command: moses -f ... -i fff #Looks for the command in the /bin directory of the repo specified in the testsuite_config
LDPRE: ldpreloads #Comma separated LD_LIBRARY_PATH:/,
Variants: vanilla, cached, ldpre #Can't have cached without ldpre or vanilla
</pre>
The _Command:_ line specifies the executable (which is looked up in the /bin directory of the repo.) and any arguments necessary. Before running the test, the script cds to the current test directory so you can use relative paths.
The _LDPRE:_ specifies if tests should be run with any LD\_PRELOAD flags.
The _Variants:_ line specifies what type of tests should we run. This particular line will run the following tests:
1. A Vanilla test meaning just the command after _Command_ will be issued.
2. A vanilla cached test meaning that after the vanilla test, the test will be run again without dropping caches in order to benchmark performance on cached filesystem.
3. A test with LD_PRELOAD ldpreloads moses -f command. For each available LDPRELOAD comma separated library to preload.
4. A cached version of all LD_PRELOAD tests.
### Running tests.
Running the tests is done through the **runtests.py** script.
#### Running all tests.
To run all tests, with the base branch and the latests revision (and generate new basebranch test data if such is missing) do a:
```bash
python3 runtests.py -c testsuite_config
```
#### Running specific tests.
The script allows the user to manually run a particular test or to test against a specific branch or revision:
<pre>
moses-speedtest@crom:~/phrase_tables$ python3 runtests.py --help
usage: runtests.py [-h] -c CONFIGFILE [-s SINGLETESTDIR] [-r REVISION]
[-b BRANCH]
A python based speedtest suite for moses.
optional arguments:
-h, --help show this help message and exit
-c CONFIGFILE, --configfile CONFIGFILE
Specify test config file
-s SINGLETESTDIR, --singletest SINGLETESTDIR
Single test name directory. Specify directory name,
not full path!
-r REVISION, --revision REVISION
Specify a specific revison for the test.
-b BRANCH, --branch BRANCH
Specify a branch for the test.
</pre>
### Generating HTML report.
To generate a summary of the test results use the **html\_gen.py** script. It places a file named *index.html* in the current script directory.
```bash
python3 html_gen.py testsuite_config
```
You should use the generated file with the **style.css** file provided in the html directory.
### Command line regression testing.
Alternatively you could check for regressions from the command line using the **check\_fo\r_regression.py** script:
```bash
python3 check_for_regression.py TESTLOGS_DIRECTORY
```
Alternatively the results of all tests are logged inside the the specified TESTLOGS directory so you can manually check them for additional information such as date, time, revision, branch, etc...
### Create a cron job:
Create a cron job to run the tests daily and generate an html report. An example *cronjob* is available.
```bash
#!/bin/sh
cd /home/moses-speedtest/phrase_tables
python3 runtests.py -c testsuite_config #Run the tests.
python3 html_gen.py testsuite_config #Generate html
cp index.html /fs/thor4/html/www/speed-test/ #Update the html
```
Place the script in _/etc/cron.daily_ for dayly testing
###### Author
Nikolay Bogoychev, 2014
###### License
This software is licensed under the LGPL.

View File

@ -0,0 +1,63 @@
"""Checks if any of the latests tests has performed considerably different than
the previous ones. Takes the log directory as an argument."""
import os
import sys
from testsuite_common import Result, processLogLine, bcolors, getLastTwoLines
LOGDIR = sys.argv[1] #Get the log directory as an argument
PERCENTAGE = 5 #Default value for how much a test shoudl change
if len(sys.argv) == 3:
PERCENTAGE = float(sys.argv[2]) #Default is 5%, but we can specify more
#line parameter
def printResults(regressed, better, unchanged, firsttime):
"""Pretty print the results in different colours"""
if regressed != []:
for item in regressed:
print(bcolors.RED + "REGRESSION! " + item.testname + " Was: "\
+ str(item.previous) + " Is: " + str(item.current) + " Change: "\
+ str(abs(item.percentage)) + "%. Revision: " + item.revision\
+ bcolors.ENDC)
print('\n')
if unchanged != []:
for item in unchanged:
print(bcolors.BLUE + "UNCHANGED: " + item.testname + " Revision: " +\
item.revision + bcolors.ENDC)
print('\n')
if better != []:
for item in better:
print(bcolors.GREEN + "IMPROVEMENT! " + item.testname + " Was: "\
+ str(item.previous) + " Is: " + str(item.current) + " Change: "\
+ str(abs(item.percentage)) + "%. Revision: " + item.revision\
+ bcolors.ENDC)
if firsttime != []:
for item in firsttime:
print(bcolors.PURPLE + "First time test! " + item.testname +\
" Took: " + str(item.real) + " seconds. Revision: " +\
item.revision + bcolors.ENDC)
all_files = os.listdir(LOGDIR)
regressed = []
better = []
unchanged = []
firsttime = []
#Go through all log files and find which tests have performed better.
for logfile in all_files:
(line1, line2) = getLastTwoLines(logfile, LOGDIR)
log1 = processLogLine(line1)
if line2 == '\n': # Empty line, only one test ever run
firsttime.append(log1)
continue
log2 = processLogLine(line2)
res = Result(log1.testname, log1.real, log2.real, log2.revision,\
log2.branch, log1.revision, log1.branch)
if res.percentage < -PERCENTAGE:
regressed.append(res)
elif res.change > PERCENTAGE:
better.append(res)
else:
unchanged.append(res)
printResults(regressed, better, unchanged, firsttime)

View File

@ -0,0 +1,7 @@
#!/bin/sh
cd /home/moses-speedtest/phrase_tables
python3 runtests.py -c testsuite_config #Run the tests.
python3 html_gen.py testsuite_config #Generate html
cp index.html /fs/thor4/html/www/speed-test/ #Update the html

View File

@ -0,0 +1,5 @@
###Helpers
This is a python script that basically gives you the equivalent of:
```echo 3 > /proc/sys/vm/drop_caches```
You need to set it up so it is executed with root access without needing a password so that the tests can be automated.

View File

@ -0,0 +1,22 @@
#!/usr/bin/spython
from sys import argv, stderr, exit
from os import linesep as ls
procfile = "/proc/sys/vm/drop_caches"
options = ["1","2","3"]
flush_type = None
try:
flush_type = argv[1][0:1]
if not flush_type in options:
raise IndexError, "not in options"
with open(procfile, "w") as f:
f.write("%s%s" % (flush_type,ls))
exit(0)
except IndexError, e:
stderr.write("Argument %s required.%s" % (options, ls))
except IOError, e:
stderr.write("Error writing to file.%s" % ls)
except StandardError, e:
stderr.write("Unknown Error.%s" % ls)
exit(1)

View File

@ -0,0 +1,5 @@
###HTML files.
_index.html_ is a sample generated file by this testsuite.
_style.css_ should be placed in the html directory in which _index.html_ will be placed in order to visualize the test results in a browser.

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,21 @@
table,th,td
{
border:1px solid black;
border-collapse:collapse
}
tr:nth-child(odd) {
background-color: Gainsboro;
}
.better {
color: Green;
}
.worse {
color: Red;
}
.unchanged {
color: SkyBlue;
}

View File

@ -0,0 +1,192 @@
"""Generates HTML page containing the testresults"""
from testsuite_common import Result, processLogLine, getLastTwoLines
from runtests import parse_testconfig
import os
import sys
from datetime import datetime, timedelta
HTML_HEADING = """<html>
<head>
<title>Moses speed testing</title>
<link rel="stylesheet" type="text/css" href="style.css"></head><body>"""
HTML_ENDING = "</table></body></html>\n"
TABLE_HEADING = """<table><tr class="heading">
<th>Date</th>
<th>Time</th>
<th>Testname</th>
<th>Revision</th>
<th>Branch</th>
<th>Time</th>
<th>Prevtime</th>
<th>Prevrev</th>
<th>Change (%)</th>
<th>Time (Basebranch)</th>
<th>Change (%, Basebranch)</th>
<th>Time (Days -2)</th>
<th>Change (%, Days -2)</th>
<th>Time (Days -3)</th>
<th>Change (%, Days -3)</th>
<th>Time (Days -4)</th>
<th>Change (%, Days -4)</th>
<th>Time (Days -5)</th>
<th>Change (%, Days -5)</th>
<th>Time (Days -6)</th>
<th>Change (%, Days -6)</th>
<th>Time (Days -7)</th>
<th>Change (%, Days -7)</th>
<th>Time (Days -14)</th>
<th>Change (%, Days -14)</th>
<th>Time (Years -1)</th>
<th>Change (%, Years -1)</th>
</tr>"""
def get_prev_days(date, numdays):
"""Gets the date numdays previous days so that we could search for
that test in the config file"""
date_obj = datetime.strptime(date, '%d.%m.%Y').date()
past_date = date_obj - timedelta(days=numdays)
return past_date.strftime('%d.%m.%Y')
def gather_necessary_lines(logfile, date):
"""Gathers the necessary lines corresponding to past dates
and parses them if they exist"""
#Get a dictionary of dates
dates = {}
dates[get_prev_days(date, 2)] = ('-2', None)
dates[get_prev_days(date, 3)] = ('-3', None)
dates[get_prev_days(date, 4)] = ('-4', None)
dates[get_prev_days(date, 5)] = ('-5', None)
dates[get_prev_days(date, 6)] = ('-6', None)
dates[get_prev_days(date, 7)] = ('-7', None)
dates[get_prev_days(date, 14)] = ('-14', None)
dates[get_prev_days(date, 365)] = ('-365', None)
openfile = open(logfile, 'r')
for line in openfile:
if line.split()[0] in dates.keys():
day = dates[line.split()[0]][0]
dates[line.split()[0]] = (day, processLogLine(line))
openfile.close()
return dates
def append_date_to_table(resline):
"""Appends past dates to the html"""
cur_html = '<td>' + str(resline.current) + '</td>'
if resline.percentage > 0.05: #If we have improvement of more than 5%
cur_html = cur_html + '<td class="better">' + str(resline.percentage) + '</td>'
elif resline.percentage < -0.05: #We have a regression of more than 5%
cur_html = cur_html + '<td class="worse">' + str(resline.percentage) + '</td>'
else:
cur_html = cur_html + '<td class="unchanged">' + str(resline.percentage) + '</td>'
return cur_html
def compare_rev(filename, rev1, rev2, branch1=False, branch2=False):
"""Compare the test results of two lines. We can specify either a
revision or a branch for comparison. The first rev should be the
base version and the second revision should be the later version"""
#In the log file the index of the revision is 2 but the index of
#the branch is 12. Alternate those depending on whether we are looking
#for a specific revision or branch.
firstidx = 2
secondidx = 2
if branch1 == True:
firstidx = 12
if branch2 == True:
secondidx = 12
rev1line = ''
rev2line = ''
resfile = open(filename, 'r')
for line in resfile:
if rev1 == line.split()[firstidx]:
rev1line = line
elif rev2 == line.split()[secondidx]:
rev2line = line
if rev1line != '' and rev2line != '':
break
resfile.close()
if rev1line == '':
raise ValueError('Revision ' + rev1 + " was not found!")
if rev2line == '':
raise ValueError('Revision ' + rev2 + " was not found!")
logLine1 = processLogLine(rev1line)
logLine2 = processLogLine(rev2line)
res = Result(logLine1.testname, logLine1.real, logLine2.real,\
logLine2.revision, logLine2.branch, logLine1.revision, logLine1.branch)
return res
def produce_html(path, global_config):
"""Produces html file for the report."""
html = '' #The table HTML
for filenam in os.listdir(global_config.testlogs):
#Generate html for the newest two lines
#Get the lines from the config file
(ll1, ll2) = getLastTwoLines(filenam, global_config.testlogs)
logLine1 = processLogLine(ll1)
logLine2 = processLogLine(ll2)
#Generate html
res1 = Result(logLine1.testname, logLine1.real, logLine2.real,\
logLine2.revision, logLine2.branch, logLine1.revision, logLine1.branch)
html = html + '<tr><td>' + logLine2.date + '</td><td>' + logLine2.time + '</td><td>' +\
res1.testname + '</td><td>' + res1.revision[:10] + '</td><td>' + res1.branch + '</td><td>' +\
str(res1.current) + '</td><td>' + str(res1.previous) + '</td><td>' + res1.prevrev[:10] + '</td>'
#Add fancy colours depending on the change
if res1.percentage > 0.05: #If we have improvement of more than 5%
html = html + '<td class="better">' + str(res1.percentage) + '</td>'
elif res1.percentage < -0.05: #We have a regression of more than 5%
html = html + '<td class="worse">' + str(res1.percentage) + '</td>'
else:
html = html + '<td class="unchanged">' + str(res1.percentage) + '</td>'
#Get comparison against the base version
filenam = global_config.testlogs + '/' + filenam #Get proper directory
res2 = compare_rev(filenam, global_config.basebranch, res1.revision, branch1=True)
html = html + '<td>' + str(res2.previous) + '</td>'
#Add fancy colours depending on the change
if res2.percentage > 0.05: #If we have improvement of more than 5%
html = html + '<td class="better">' + str(res2.percentage) + '</td>'
elif res2.percentage < -0.05: #We have a regression of more than 5%
html = html + '<td class="worse">' + str(res2.percentage) + '</td>'
else:
html = html + '<td class="unchanged">' + str(res2.percentage) + '</td>'
#Add extra dates comparison dating from the beginning of time if they exist
past_dates = list(range(2, 8))
past_dates.append(14)
past_dates.append(365) # Get the 1 year ago day
linesdict = gather_necessary_lines(filenam, logLine2.date)
for days in past_dates:
act_date = get_prev_days(logLine2.date, days)
if linesdict[act_date][1] is not None:
logline_date = linesdict[act_date][1]
restemp = Result(logline_date.testname, logline_date.real, logLine2.real,\
logLine2.revision, logLine2.branch, logline_date.revision, logline_date.branch)
html = html + append_date_to_table(restemp)
else:
html = html + '<td>N/A</td><td>N/A</td>'
html = html + '</tr>' #End row
#Write out the file
basebranch_info = '<text><b>Basebranch:</b> ' + res2.prevbranch + ' <b>Revision:</b> ' +\
res2.prevrev + '</text>'
writeoutstr = HTML_HEADING + basebranch_info + TABLE_HEADING + html + HTML_ENDING
writefile = open(path, 'w')
writefile.write(writeoutstr)
writefile.close()
if __name__ == '__main__':
CONFIG = parse_testconfig(sys.argv[1])
produce_html('index.html', CONFIG)

View File

@ -0,0 +1,293 @@
"""Given a config file, runs tests"""
import os
import subprocess
import time
from argparse import ArgumentParser
from testsuite_common import processLogLine
def parse_cmd():
"""Parse the command line arguments"""
description = "A python based speedtest suite for moses."
parser = ArgumentParser(description=description)
parser.add_argument("-c", "--configfile", action="store",\
dest="configfile", required=True,\
help="Specify test config file")
parser.add_argument("-s", "--singletest", action="store",\
dest="singletestdir", default=None,\
help="Single test name directory. Specify directory name,\
not full path!")
parser.add_argument("-r", "--revision", action="store",\
dest="revision", default=None,\
help="Specify a specific revison for the test.")
parser.add_argument("-b", "--branch", action="store",\
dest="branch", default=None,\
help="Specify a branch for the test.")
arguments = parser.parse_args()
return arguments
def repoinit(testconfig):
"""Determines revision and sets up the repo."""
revision = ''
#Update the repo
os.chdir(testconfig.repo)
#Checkout specific branch, else maintain main branch
if testconfig.branch != 'master':
subprocess.call(['git', 'checkout', testconfig.branch])
rev, _ = subprocess.Popen(['git', 'rev-parse', 'HEAD'],\
stdout=subprocess.PIPE, stderr=subprocess.PIPE).communicate()
revision = str(rev).replace("\\n'", '').replace("b'", '')
else:
subprocess.call(['git checkout master'], shell=True)
#Check a specific revision. Else checkout master.
if testconfig.revision:
subprocess.call(['git', 'checkout', testconfig.revision])
revision = testconfig.revision
elif testconfig.branch == 'master':
subprocess.call(['git pull'], shell=True)
rev, _ = subprocess.Popen(['git rev-parse HEAD'], stdout=subprocess.PIPE,\
stderr=subprocess.PIPE, shell=True).communicate()
revision = str(rev).replace("\\n'", '').replace("b'", '')
return revision
class Configuration:
"""A simple class to hold all of the configuration constatns"""
def __init__(self, repo, drop_caches, tests, testlogs, basebranch, baserev):
self.repo = repo
self.drop_caches = drop_caches
self.tests = tests
self.testlogs = testlogs
self.basebranch = basebranch
self.baserev = baserev
self.singletest = None
self.revision = None
self.branch = 'master' # Default branch
def additional_args(self, singletest, revision, branch):
"""Additional configuration from command line arguments"""
self.singletest = singletest
if revision is not None:
self.revision = revision
if branch is not None:
self.branch = branch
def set_revision(self, revision):
"""Sets the current revision that is being tested"""
self.revision = revision
class Test:
"""A simple class to contain all information about tests"""
def __init__(self, name, command, ldopts, permutations):
self.name = name
self.command = command
self.ldopts = ldopts.replace(' ', '').split(',') #Not tested yet
self.permutations = permutations
def parse_configfile(conffile, testdir, moses_repo):
"""Parses the config file"""
command, ldopts = '', ''
permutations = []
fileopen = open(conffile, 'r')
for line in fileopen:
line = line.split('#')[0] # Discard comments
if line == '' or line == '\n':
continue # Discard lines with comments only and empty lines
opt, args = line.split(' ', 1) # Get arguments
if opt == 'Command:':
command = args.replace('\n', '')
command = moses_repo + '/bin/' + command
elif opt == 'LDPRE:':
ldopts = args.replace('\n', '')
elif opt == 'Variants:':
permutations = args.replace('\n', '').replace(' ', '').split(',')
else:
raise ValueError('Unrecognized option ' + opt)
#We use the testdir as the name.
testcase = Test(testdir, command, ldopts, permutations)
fileopen.close()
return testcase
def parse_testconfig(conffile):
"""Parses the config file for the whole testsuite."""
repo_path, drop_caches, tests_dir, testlog_dir = '', '', '', ''
basebranch, baserev = '', ''
fileopen = open(conffile, 'r')
for line in fileopen:
line = line.split('#')[0] # Discard comments
if line == '' or line == '\n':
continue # Discard lines with comments only and empty lines
opt, args = line.split(' ', 1) # Get arguments
if opt == 'MOSES_REPO_PATH:':
repo_path = args.replace('\n', '')
elif opt == 'DROP_CACHES_COMM:':
drop_caches = args.replace('\n', '')
elif opt == 'TEST_DIR:':
tests_dir = args.replace('\n', '')
elif opt == 'TEST_LOG_DIR:':
testlog_dir = args.replace('\n', '')
elif opt == 'BASEBRANCH:':
basebranch = args.replace('\n', '')
elif opt == 'BASEREV:':
baserev = args.replace('\n', '')
else:
raise ValueError('Unrecognized option ' + opt)
config = Configuration(repo_path, drop_caches, tests_dir, testlog_dir,\
basebranch, baserev)
fileopen.close()
return config
def get_config():
"""Builds the config object with all necessary attributes"""
args = parse_cmd()
config = parse_testconfig(args.configfile)
config.additional_args(args.singletestdir, args.revision, args.branch)
revision = repoinit(config)
config.set_revision(revision)
return config
def check_for_basever(testlogfile, basebranch):
"""Checks if the base revision is present in the testlogs"""
filetoopen = open(testlogfile, 'r')
for line in filetoopen:
templine = processLogLine(line)
if templine.branch == basebranch:
return True
return False
def split_time(filename):
"""Splits the output of the time function into seperate parts.
We will write time to file, because many programs output to
stderr which makes it difficult to get only the exact results we need."""
timefile = open(filename, 'r')
realtime = float(timefile.readline().replace('\n', '').split()[1])
usertime = float(timefile.readline().replace('\n', '').split()[1])
systime = float(timefile.readline().replace('\n', '').split()[1])
timefile.close()
return (realtime, usertime, systime)
def write_log(time_file, logname, config):
"""Writes to a logfile"""
log_write = open(config.testlogs + '/' + logname, 'a') # Open logfile
date_run = time.strftime("%d.%m.%Y %H:%M:%S") # Get the time of the test
realtime, usertime, systime = split_time(time_file) # Get the times in a nice form
# Append everything to a log file.
writestr = date_run + " " + config.revision + " Testname: " + logname +\
" RealTime: " + str(realtime) + " UserTime: " + str(usertime) +\
" SystemTime: " + str(systime) + " Branch: " + config.branch +'\n'
log_write.write(writestr)
log_write.close()
def execute_tests(testcase, cur_directory, config):
"""Executes timed tests based on the config file"""
#Figure out the order of which tests must be executed.
#Change to the current test directory
os.chdir(config.tests + '/' + cur_directory)
#Clear caches
subprocess.call(['sync'], shell=True)
subprocess.call([config.drop_caches], shell=True)
#Perform vanilla test and if a cached test exists - as well
print(testcase.name)
if 'vanilla' in testcase.permutations:
print(testcase.command)
subprocess.Popen(['time -p -o /tmp/time_moses_tests ' + testcase.command], stdout=None,\
stderr=subprocess.PIPE, shell=True).communicate()
write_log('/tmp/time_moses_tests', testcase.name + '_vanilla', config)
if 'cached' in testcase.permutations:
subprocess.Popen(['time -p -o /tmp/time_moses_tests ' + testcase.command], stdout=None,\
stderr=None, shell=True).communicate()
write_log('/tmp/time_moses_tests', testcase.name + '_vanilla_cached', config)
#Now perform LD_PRELOAD tests
if 'ldpre' in testcase.permutations:
for opt in testcase.ldopts:
#Clear caches
subprocess.call(['sync'], shell=True)
subprocess.call([config.drop_caches], shell=True)
#test
subprocess.Popen(['LD_PRELOAD ' + opt + ' time -p -o /tmp/time_moses_tests ' + testcase.command], stdout=None,\
stderr=None, shell=True).communicate()
write_log('/tmp/time_moses_tests', testcase.name + '_ldpre_' + opt, config)
if 'cached' in testcase.permutations:
subprocess.Popen(['LD_PRELOAD ' + opt + ' time -p -o /tmp/time_moses_tests ' + testcase.command], stdout=None,\
stderr=None, shell=True).communicate()
write_log('/tmp/time_moses_tests', testcase.name + '_ldpre_' +opt +'_cached', config)
# Go through all the test directories and executes tests
if __name__ == '__main__':
CONFIG = get_config()
ALL_DIR = os.listdir(CONFIG.tests)
#We should first check if any of the tests is run for the first time.
#If some of them are run for the first time we should first get their
#time with the base version (usually the previous release)
FIRSTTIME = []
TESTLOGS = []
#Strip filenames of test underscores
for listline in os.listdir(CONFIG.testlogs):
listline = listline.replace('_vanilla', '')
listline = listline.replace('_cached', '')
listline = listline.replace('_ldpre', '')
TESTLOGS.append(listline)
for directory in ALL_DIR:
if directory not in TESTLOGS:
FIRSTTIME.append(directory)
#Sometimes even though we have the log files, we will need to rerun them
#Against a base version, because we require a different baseversion (for
#example when a new version of Moses is released.) Therefore we should
#Check if the version of Moses that we have as a base version is in all
#of the log files.
for logfile in os.listdir(CONFIG.testlogs):
logfile_name = CONFIG.testlogs + '/' + logfile
if not check_for_basever(logfile_name, CONFIG.basebranch):
logfile = logfile.replace('_vanilla', '')
logfile = logfile.replace('_cached', '')
logfile = logfile.replace('_ldpre', '')
FIRSTTIME.append(logfile)
FIRSTTIME = list(set(FIRSTTIME)) #Deduplicate
if FIRSTTIME != []:
#Create a new configuration for base version tests:
BASECONFIG = Configuration(CONFIG.repo, CONFIG.drop_caches,\
CONFIG.tests, CONFIG.testlogs, CONFIG.basebranch,\
CONFIG.baserev)
BASECONFIG.additional_args(None, CONFIG.baserev, CONFIG.basebranch)
#Set up the repository and get its revision:
REVISION = repoinit(BASECONFIG)
BASECONFIG.set_revision(REVISION)
#Build
os.chdir(BASECONFIG.repo)
subprocess.call(['./previous.sh'], shell=True)
#Perform tests
for directory in FIRSTTIME:
cur_testcase = parse_configfile(BASECONFIG.tests + '/' + directory +\
'/config', directory, BASECONFIG.repo)
execute_tests(cur_testcase, directory, BASECONFIG)
#Reset back the repository to the normal configuration
repoinit(CONFIG)
#Builds moses
os.chdir(CONFIG.repo)
subprocess.call(['./previous.sh'], shell=True)
if CONFIG.singletest:
TESTCASE = parse_configfile(CONFIG.tests + '/' +\
CONFIG.singletest + '/config', CONFIG.singletest, CONFIG.repo)
execute_tests(TESTCASE, CONFIG.singletest, CONFIG)
else:
for directory in ALL_DIR:
cur_testcase = parse_configfile(CONFIG.tests + '/' + directory +\
'/config', directory, CONFIG.repo)
execute_tests(cur_testcase, directory, CONFIG)

View File

@ -0,0 +1,22 @@
#!/usr/bin/spython
from sys import argv, stderr, exit
from os import linesep as ls
procfile = "/proc/sys/vm/drop_caches"
options = ["1","2","3"]
flush_type = None
try:
flush_type = argv[1][0:1]
if not flush_type in options:
raise IndexError, "not in options"
with open(procfile, "w") as f:
f.write("%s%s" % (flush_type,ls))
exit(0)
except IndexError, e:
stderr.write("Argument %s required.%s" % (options, ls))
except IOError, e:
stderr.write("Error writing to file.%s" % ls)
except StandardError, e:
stderr.write("Unknown Error.%s" % ls)
exit(1)

View File

@ -0,0 +1,3 @@
Command: moses -f ... -i fff #Looks for the command in the /bin directory of the repo specified in the testsuite_config
LDPRE: ldpreloads #Comma separated LD_LIBRARY_PATH:/,
Variants: vanilla, cached, ldpre #Can't have cached without ldpre or vanilla

View File

@ -0,0 +1,54 @@
"""Common functions of the testsuitce"""
import os
#Clour constants
class bcolors:
PURPLE = '\033[95m'
BLUE = '\033[94m'
GREEN = '\033[92m'
YELLOW = '\033[93m'
RED = '\033[91m'
ENDC = '\033[0m'
class LogLine:
"""A class to contain logfile line"""
def __init__(self, date, time, revision, testname, real, user, system, branch):
self.date = date
self.time = time
self.revision = revision
self.testname = testname
self.real = real
self.system = system
self.user = user
self.branch = branch
class Result:
"""A class to contain results of benchmarking"""
def __init__(self, testname, previous, current, revision, branch, prevrev, prevbranch):
self.testname = testname
self.previous = previous
self.current = current
self.change = previous - current
self.revision = revision
self.branch = branch
self.prevbranch = prevbranch
self.prevrev = prevrev
#Produce a percentage with fewer digits
self.percentage = float(format(1 - current/previous, '.4f'))
def processLogLine(logline):
"""Parses the log line into a nice datastructure"""
logline = logline.split()
log = LogLine(logline[0], logline[1], logline[2], logline[4],\
float(logline[6]), float(logline[8]), float(logline[10]), logline[12])
return log
def getLastTwoLines(filename, logdir):
"""Just a call to tail to get the diff between the last two runs"""
try:
line1, line2 = os.popen("tail -n2 " + logdir + '/' + filename)
except ValueError: #Check for new tests
tempfile = open(logdir + '/' + filename)
line1 = tempfile.readline()
tempfile.close()
return (line1, '\n')
return (line1, line2)

View File

@ -0,0 +1,5 @@
MOSES_REPO_PATH: /home/moses-speedtest/moses-standard/mosesdecoder
DROP_CACHES_COMM: sys_drop_caches 3
TEST_DIR: /home/moses-speedtest/phrase_tables/tests
TEST_LOG_DIR: /home/moses-speedtest/phrase_tables/testlogs
BASEBRANCH: RELEASE-2.1.1

View File

@ -0,0 +1,132 @@
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?fileVersion 4.0.0?><cproject storage_type_id="org.eclipse.cdt.core.XmlProjectDescriptionStorage">
<storageModule moduleId="org.eclipse.cdt.core.settings">
<cconfiguration id="cdt.managedbuild.config.gnu.cross.exe.debug.1847651686">
<storageModule buildSystemId="org.eclipse.cdt.managedbuilder.core.configurationDataProvider" id="cdt.managedbuild.config.gnu.cross.exe.debug.1847651686" moduleId="org.eclipse.cdt.core.settings" name="Debug">
<externalSettings/>
<extensions>
<extension id="org.eclipse.cdt.core.GmakeErrorParser" point="org.eclipse.cdt.core.ErrorParser"/>
<extension id="org.eclipse.cdt.core.CWDLocator" point="org.eclipse.cdt.core.ErrorParser"/>
<extension id="org.eclipse.cdt.core.GCCErrorParser" point="org.eclipse.cdt.core.ErrorParser"/>
<extension id="org.eclipse.cdt.core.GASErrorParser" point="org.eclipse.cdt.core.ErrorParser"/>
<extension id="org.eclipse.cdt.core.GLDErrorParser" point="org.eclipse.cdt.core.ErrorParser"/>
<extension id="org.eclipse.cdt.core.ELF" point="org.eclipse.cdt.core.BinaryParser"/>
</extensions>
</storageModule>
<storageModule moduleId="cdtBuildSystem" version="4.0.0">
<configuration artifactName="${ProjName}" buildArtefactType="org.eclipse.cdt.build.core.buildArtefactType.exe" buildProperties="org.eclipse.cdt.build.core.buildType=org.eclipse.cdt.build.core.buildType.debug,org.eclipse.cdt.build.core.buildArtefactType=org.eclipse.cdt.build.core.buildArtefactType.exe" cleanCommand="rm -rf" description="" id="cdt.managedbuild.config.gnu.cross.exe.debug.1847651686" name="Debug" parent="cdt.managedbuild.config.gnu.cross.exe.debug">
<folderInfo id="cdt.managedbuild.config.gnu.cross.exe.debug.1847651686." name="/" resourcePath="">
<toolChain id="cdt.managedbuild.toolchain.gnu.cross.exe.debug.1312813804" name="Cross GCC" superClass="cdt.managedbuild.toolchain.gnu.cross.exe.debug">
<targetPlatform archList="all" binaryParser="org.eclipse.cdt.core.ELF" id="cdt.managedbuild.targetPlatform.gnu.cross.1457158442" isAbstract="false" osList="all" superClass="cdt.managedbuild.targetPlatform.gnu.cross"/>
<builder buildPath="${workspace_loc:/consolidate}/Debug" id="cdt.managedbuild.builder.gnu.cross.401817170" keepEnvironmentInBuildfile="false" managedBuildOn="true" name="Gnu Make Builder" superClass="cdt.managedbuild.builder.gnu.cross"/>
<tool id="cdt.managedbuild.tool.gnu.cross.c.compiler.584773180" name="Cross GCC Compiler" superClass="cdt.managedbuild.tool.gnu.cross.c.compiler">
<option defaultValue="gnu.c.optimization.level.none" id="gnu.c.compiler.option.optimization.level.548826159" name="Optimization Level" superClass="gnu.c.compiler.option.optimization.level" valueType="enumerated"/>
<option id="gnu.c.compiler.option.debugging.level.69309976" name="Debug Level" superClass="gnu.c.compiler.option.debugging.level" value="gnu.c.debugging.level.max" valueType="enumerated"/>
<inputType id="cdt.managedbuild.tool.gnu.c.compiler.input.1869389417" superClass="cdt.managedbuild.tool.gnu.c.compiler.input"/>
</tool>
<tool id="cdt.managedbuild.tool.gnu.cross.cpp.compiler.1684035985" name="Cross G++ Compiler" superClass="cdt.managedbuild.tool.gnu.cross.cpp.compiler">
<option id="gnu.cpp.compiler.option.optimization.level.1978964587" name="Optimization Level" superClass="gnu.cpp.compiler.option.optimization.level" value="gnu.cpp.compiler.optimization.level.none" valueType="enumerated"/>
<option id="gnu.cpp.compiler.option.debugging.level.1174628687" name="Debug Level" superClass="gnu.cpp.compiler.option.debugging.level" value="gnu.cpp.compiler.debugging.level.max" valueType="enumerated"/>
<option id="gnu.cpp.compiler.option.include.paths.1899244069" name="Include paths (-I)" superClass="gnu.cpp.compiler.option.include.paths" valueType="includePath">
<listOptionValue builtIn="false" value="&quot;${workspace_loc}/../../boost/include&quot;"/>
</option>
<inputType id="cdt.managedbuild.tool.gnu.cpp.compiler.input.1369007077" superClass="cdt.managedbuild.tool.gnu.cpp.compiler.input"/>
</tool>
<tool id="cdt.managedbuild.tool.gnu.cross.c.linker.988122551" name="Cross GCC Linker" superClass="cdt.managedbuild.tool.gnu.cross.c.linker"/>
<tool id="cdt.managedbuild.tool.gnu.cross.cpp.linker.580092188" name="Cross G++ Linker" superClass="cdt.managedbuild.tool.gnu.cross.cpp.linker">
<option id="gnu.cpp.link.option.libs.1224797947" name="Libraries (-l)" superClass="gnu.cpp.link.option.libs" valueType="libs">
<listOptionValue builtIn="false" value="z"/>
<listOptionValue builtIn="false" value="boost_iostreams-mt"/>
</option>
<option id="gnu.cpp.link.option.paths.845281969" superClass="gnu.cpp.link.option.paths" valueType="libPaths">
<listOptionValue builtIn="false" value="&quot;${workspace_loc:}/../../boost/lib64&quot;"/>
</option>
<inputType id="cdt.managedbuild.tool.gnu.cpp.linker.input.1562981657" superClass="cdt.managedbuild.tool.gnu.cpp.linker.input">
<additionalInput kind="additionalinputdependency" paths="$(USER_OBJS)"/>
<additionalInput kind="additionalinput" paths="$(LIBS)"/>
</inputType>
</tool>
<tool id="cdt.managedbuild.tool.gnu.cross.archiver.1813579853" name="Cross GCC Archiver" superClass="cdt.managedbuild.tool.gnu.cross.archiver"/>
<tool id="cdt.managedbuild.tool.gnu.cross.assembler.660034723" name="Cross GCC Assembler" superClass="cdt.managedbuild.tool.gnu.cross.assembler">
<inputType id="cdt.managedbuild.tool.gnu.assembler.input.2016181080" superClass="cdt.managedbuild.tool.gnu.assembler.input"/>
</tool>
</toolChain>
</folderInfo>
</configuration>
</storageModule>
<storageModule moduleId="org.eclipse.cdt.core.externalSettings"/>
</cconfiguration>
<cconfiguration id="cdt.managedbuild.config.gnu.cross.exe.release.1197533473">
<storageModule buildSystemId="org.eclipse.cdt.managedbuilder.core.configurationDataProvider" id="cdt.managedbuild.config.gnu.cross.exe.release.1197533473" moduleId="org.eclipse.cdt.core.settings" name="Release">
<externalSettings/>
<extensions>
<extension id="org.eclipse.cdt.core.GmakeErrorParser" point="org.eclipse.cdt.core.ErrorParser"/>
<extension id="org.eclipse.cdt.core.CWDLocator" point="org.eclipse.cdt.core.ErrorParser"/>
<extension id="org.eclipse.cdt.core.GCCErrorParser" point="org.eclipse.cdt.core.ErrorParser"/>
<extension id="org.eclipse.cdt.core.GASErrorParser" point="org.eclipse.cdt.core.ErrorParser"/>
<extension id="org.eclipse.cdt.core.GLDErrorParser" point="org.eclipse.cdt.core.ErrorParser"/>
<extension id="org.eclipse.cdt.core.ELF" point="org.eclipse.cdt.core.BinaryParser"/>
</extensions>
</storageModule>
<storageModule moduleId="cdtBuildSystem" version="4.0.0">
<configuration artifactName="${ProjName}" buildArtefactType="org.eclipse.cdt.build.core.buildArtefactType.exe" buildProperties="org.eclipse.cdt.build.core.buildType=org.eclipse.cdt.build.core.buildType.release,org.eclipse.cdt.build.core.buildArtefactType=org.eclipse.cdt.build.core.buildArtefactType.exe" cleanCommand="rm -rf" description="" id="cdt.managedbuild.config.gnu.cross.exe.release.1197533473" name="Release" parent="cdt.managedbuild.config.gnu.cross.exe.release">
<folderInfo id="cdt.managedbuild.config.gnu.cross.exe.release.1197533473." name="/" resourcePath="">
<toolChain id="cdt.managedbuild.toolchain.gnu.cross.exe.release.1193312581" name="Cross GCC" superClass="cdt.managedbuild.toolchain.gnu.cross.exe.release">
<targetPlatform archList="all" binaryParser="org.eclipse.cdt.core.ELF" id="cdt.managedbuild.targetPlatform.gnu.cross.1614674218" isAbstract="false" osList="all" superClass="cdt.managedbuild.targetPlatform.gnu.cross"/>
<builder buildPath="${workspace_loc:/consolidate}/Release" id="cdt.managedbuild.builder.gnu.cross.1921548268" keepEnvironmentInBuildfile="false" managedBuildOn="true" name="Gnu Make Builder" superClass="cdt.managedbuild.builder.gnu.cross"/>
<tool id="cdt.managedbuild.tool.gnu.cross.c.compiler.1402792534" name="Cross GCC Compiler" superClass="cdt.managedbuild.tool.gnu.cross.c.compiler">
<option defaultValue="gnu.c.optimization.level.most" id="gnu.c.compiler.option.optimization.level.172258714" name="Optimization Level" superClass="gnu.c.compiler.option.optimization.level" valueType="enumerated"/>
<option id="gnu.c.compiler.option.debugging.level.949623548" name="Debug Level" superClass="gnu.c.compiler.option.debugging.level" value="gnu.c.debugging.level.none" valueType="enumerated"/>
<inputType id="cdt.managedbuild.tool.gnu.c.compiler.input.1960225725" superClass="cdt.managedbuild.tool.gnu.c.compiler.input"/>
</tool>
<tool id="cdt.managedbuild.tool.gnu.cross.cpp.compiler.1697856596" name="Cross G++ Compiler" superClass="cdt.managedbuild.tool.gnu.cross.cpp.compiler">
<option id="gnu.cpp.compiler.option.optimization.level.1575999400" name="Optimization Level" superClass="gnu.cpp.compiler.option.optimization.level" value="gnu.cpp.compiler.optimization.level.most" valueType="enumerated"/>
<option id="gnu.cpp.compiler.option.debugging.level.732263649" name="Debug Level" superClass="gnu.cpp.compiler.option.debugging.level" value="gnu.cpp.compiler.debugging.level.none" valueType="enumerated"/>
<inputType id="cdt.managedbuild.tool.gnu.cpp.compiler.input.1685852561" superClass="cdt.managedbuild.tool.gnu.cpp.compiler.input"/>
</tool>
<tool id="cdt.managedbuild.tool.gnu.cross.c.linker.1332869586" name="Cross GCC Linker" superClass="cdt.managedbuild.tool.gnu.cross.c.linker"/>
<tool id="cdt.managedbuild.tool.gnu.cross.cpp.linker.484647585" name="Cross G++ Linker" superClass="cdt.managedbuild.tool.gnu.cross.cpp.linker">
<inputType id="cdt.managedbuild.tool.gnu.cpp.linker.input.2140954002" superClass="cdt.managedbuild.tool.gnu.cpp.linker.input">
<additionalInput kind="additionalinputdependency" paths="$(USER_OBJS)"/>
<additionalInput kind="additionalinput" paths="$(LIBS)"/>
</inputType>
</tool>
<tool id="cdt.managedbuild.tool.gnu.cross.archiver.620666274" name="Cross GCC Archiver" superClass="cdt.managedbuild.tool.gnu.cross.archiver"/>
<tool id="cdt.managedbuild.tool.gnu.cross.assembler.1478840357" name="Cross GCC Assembler" superClass="cdt.managedbuild.tool.gnu.cross.assembler">
<inputType id="cdt.managedbuild.tool.gnu.assembler.input.412043972" superClass="cdt.managedbuild.tool.gnu.assembler.input"/>
</tool>
</toolChain>
</folderInfo>
</configuration>
</storageModule>
<storageModule moduleId="org.eclipse.cdt.core.externalSettings"/>
</cconfiguration>
</storageModule>
<storageModule moduleId="cdtBuildSystem" version="4.0.0">
<project id="consolidate.cdt.managedbuild.target.gnu.cross.exe.1166003694" name="Executable" projectType="cdt.managedbuild.target.gnu.cross.exe"/>
</storageModule>
<storageModule moduleId="scannerConfiguration">
<autodiscovery enabled="true" problemReportingEnabled="true" selectedProfileId=""/>
<scannerConfigBuildInfo instanceId="cdt.managedbuild.config.gnu.cross.exe.debug.1847651686;cdt.managedbuild.config.gnu.cross.exe.debug.1847651686.;cdt.managedbuild.tool.gnu.cross.c.compiler.584773180;cdt.managedbuild.tool.gnu.c.compiler.input.1869389417">
<autodiscovery enabled="true" problemReportingEnabled="true" selectedProfileId="org.eclipse.cdt.managedbuilder.core.GCCManagedMakePerProjectProfileC"/>
</scannerConfigBuildInfo>
<scannerConfigBuildInfo instanceId="cdt.managedbuild.config.gnu.cross.exe.release.1197533473;cdt.managedbuild.config.gnu.cross.exe.release.1197533473.;cdt.managedbuild.tool.gnu.cross.cpp.compiler.1697856596;cdt.managedbuild.tool.gnu.cpp.compiler.input.1685852561">
<autodiscovery enabled="true" problemReportingEnabled="true" selectedProfileId="org.eclipse.cdt.managedbuilder.core.GCCManagedMakePerProjectProfileCPP"/>
</scannerConfigBuildInfo>
<scannerConfigBuildInfo instanceId="cdt.managedbuild.config.gnu.cross.exe.debug.1847651686;cdt.managedbuild.config.gnu.cross.exe.debug.1847651686.;cdt.managedbuild.tool.gnu.cross.cpp.compiler.1684035985;cdt.managedbuild.tool.gnu.cpp.compiler.input.1369007077">
<autodiscovery enabled="true" problemReportingEnabled="true" selectedProfileId="org.eclipse.cdt.managedbuilder.core.GCCManagedMakePerProjectProfileCPP"/>
</scannerConfigBuildInfo>
<scannerConfigBuildInfo instanceId="cdt.managedbuild.config.gnu.cross.exe.release.1197533473;cdt.managedbuild.config.gnu.cross.exe.release.1197533473.;cdt.managedbuild.tool.gnu.cross.c.compiler.1402792534;cdt.managedbuild.tool.gnu.c.compiler.input.1960225725">
<autodiscovery enabled="true" problemReportingEnabled="true" selectedProfileId="org.eclipse.cdt.managedbuilder.core.GCCManagedMakePerProjectProfileC"/>
</scannerConfigBuildInfo>
</storageModule>
<storageModule moduleId="org.eclipse.cdt.core.LanguageSettingsProviders"/>
<storageModule moduleId="refreshScope" versionNumber="2">
<configuration configurationName="Release">
<resource resourceType="PROJECT" workspacePath="/consolidate"/>
</configuration>
<configuration configurationName="Debug">
<resource resourceType="PROJECT" workspacePath="/consolidate"/>
</configuration>
</storageModule>
</cproject>

View File

@ -0,0 +1,64 @@
<?xml version="1.0" encoding="UTF-8"?>
<projectDescription>
<name>consolidate</name>
<comment></comment>
<projects>
</projects>
<buildSpec>
<buildCommand>
<name>org.eclipse.cdt.managedbuilder.core.genmakebuilder</name>
<triggers>clean,full,incremental,</triggers>
<arguments>
</arguments>
</buildCommand>
<buildCommand>
<name>org.eclipse.cdt.managedbuilder.core.ScannerConfigBuilder</name>
<triggers>full,incremental,</triggers>
<arguments>
</arguments>
</buildCommand>
</buildSpec>
<natures>
<nature>org.eclipse.cdt.core.cnature</nature>
<nature>org.eclipse.cdt.core.ccnature</nature>
<nature>org.eclipse.cdt.managedbuilder.core.managedBuildNature</nature>
<nature>org.eclipse.cdt.managedbuilder.core.ScannerConfigNature</nature>
</natures>
<linkedResources>
<link>
<name>InputFileStream.cpp</name>
<type>1</type>
<locationURI>PARENT-3-PROJECT_LOC/phrase-extract/InputFileStream.cpp</locationURI>
</link>
<link>
<name>InputFileStream.h</name>
<type>1</type>
<locationURI>PARENT-3-PROJECT_LOC/phrase-extract/InputFileStream.h</locationURI>
</link>
<link>
<name>OutputFileStream.cpp</name>
<type>1</type>
<locationURI>PARENT-3-PROJECT_LOC/phrase-extract/OutputFileStream.cpp</locationURI>
</link>
<link>
<name>OutputFileStream.h</name>
<type>1</type>
<locationURI>PARENT-3-PROJECT_LOC/phrase-extract/OutputFileStream.h</locationURI>
</link>
<link>
<name>consolidate-main.cpp</name>
<type>1</type>
<locationURI>PARENT-3-PROJECT_LOC/phrase-extract/consolidate-main.cpp</locationURI>
</link>
<link>
<name>tables-core.cpp</name>
<type>1</type>
<locationURI>PARENT-3-PROJECT_LOC/phrase-extract/tables-core.cpp</locationURI>
</link>
<link>
<name>tables-core.h</name>
<type>1</type>
<locationURI>PARENT-3-PROJECT_LOC/phrase-extract/tables-core.h</locationURI>
</link>
</linkedResources>
</projectDescription>

View File

@ -42,9 +42,11 @@
</option> </option>
<option id="gnu.cpp.link.option.libs.585257079" name="Libraries (-l)" superClass="gnu.cpp.link.option.libs" valueType="libs"> <option id="gnu.cpp.link.option.libs.585257079" name="Libraries (-l)" superClass="gnu.cpp.link.option.libs" valueType="libs">
<listOptionValue builtIn="false" value="mert_lib"/> <listOptionValue builtIn="false" value="mert_lib"/>
<listOptionValue builtIn="false" value="boost_system-mt"/>
<listOptionValue builtIn="false" value="util"/> <listOptionValue builtIn="false" value="util"/>
<listOptionValue builtIn="false" value="boost_system-mt"/>
<listOptionValue builtIn="false" value="boost_thread-mt"/>
<listOptionValue builtIn="false" value="z"/> <listOptionValue builtIn="false" value="z"/>
<listOptionValue builtIn="false" value="pthread"/>
</option> </option>
<inputType id="cdt.managedbuild.tool.gnu.cpp.linker.input.656319745" superClass="cdt.managedbuild.tool.gnu.cpp.linker.input"> <inputType id="cdt.managedbuild.tool.gnu.cpp.linker.input.656319745" superClass="cdt.managedbuild.tool.gnu.cpp.linker.input">
<additionalInput kind="additionalinputdependency" paths="$(USER_OBJS)"/> <additionalInput kind="additionalinputdependency" paths="$(USER_OBJS)"/>

View File

@ -4,6 +4,7 @@
<comment></comment> <comment></comment>
<projects> <projects>
<project>mert_lib</project> <project>mert_lib</project>
<project>util</project>
</projects> </projects>
<buildSpec> <buildSpec>
<buildCommand> <buildCommand>

View File

@ -125,7 +125,7 @@ void ChartManager::ProcessSentence()
*/ */
void ChartManager::AddXmlChartOptions() void ChartManager::AddXmlChartOptions()
{ {
const StaticData &staticData = StaticData::Instance(); // const StaticData &staticData = StaticData::Instance();
const std::vector <ChartTranslationOptions*> xmlChartOptionsList = m_source.GetXmlChartTranslationOptions(); const std::vector <ChartTranslationOptions*> xmlChartOptionsList = m_source.GetXmlChartTranslationOptions();
IFVERBOSE(2) { IFVERBOSE(2) {

View File

@ -142,7 +142,7 @@ namespace Moses
{ {
Clear(); Clear();
const StaticData &staticData = StaticData::Instance(); // const StaticData &staticData = StaticData::Instance();
const InputFeature &inputFeature = InputFeature::Instance(); const InputFeature &inputFeature = InputFeature::Instance();
size_t numInputScores = inputFeature.GetNumInputScores(); size_t numInputScores = inputFeature.GetNumInputScores();
size_t numRealWordCount = inputFeature.GetNumRealWordsInInput(); size_t numRealWordCount = inputFeature.GetNumRealWordsInInput();

View File

@ -85,7 +85,7 @@ size_t InputPath::GetTotalRuleSize() const
size_t ret = 0; size_t ret = 0;
std::map<const PhraseDictionary*, std::pair<const TargetPhraseCollection*, const void*> >::const_iterator iter; std::map<const PhraseDictionary*, std::pair<const TargetPhraseCollection*, const void*> >::const_iterator iter;
for (iter = m_targetPhrases.begin(); iter != m_targetPhrases.end(); ++iter) { for (iter = m_targetPhrases.begin(); iter != m_targetPhrases.end(); ++iter) {
const PhraseDictionary *pt = iter->first; // const PhraseDictionary *pt = iter->first;
const TargetPhraseCollection *tpColl = iter->second.first; const TargetPhraseCollection *tpColl = iter->second.first;
if (tpColl) { if (tpColl) {

View File

@ -15,7 +15,7 @@ public:
virtual void ProcessValue() {}; virtual void ProcessValue() {};
const std::string &GetValueString() { return m_value; }; const std::string &GetValueString() const { return m_value; };
protected: protected:

View File

@ -47,8 +47,8 @@ class WordsRange;
class Phrase class Phrase
{ {
friend std::ostream& operator<<(std::ostream&, const Phrase&); friend std::ostream& operator<<(std::ostream&, const Phrase&);
private: // private:
protected:
std::vector<Word> m_words; std::vector<Word> m_words;
public: public:

View File

@ -494,7 +494,8 @@ bool StaticData::LoadData(Parameter *parameter)
} }
m_xmlBrackets.first= brackets[0]; m_xmlBrackets.first= brackets[0];
m_xmlBrackets.second=brackets[1]; m_xmlBrackets.second=brackets[1];
cerr << "XML tags opening and closing brackets for XML input are: " << m_xmlBrackets.first << " and " << m_xmlBrackets.second << endl; VERBOSE(1,"XML tags opening and closing brackets for XML input are: "
<< m_xmlBrackets.first << " and " << m_xmlBrackets.second << endl);
} }
if (m_parameter->GetParam("placeholder-factor").size() > 0) { if (m_parameter->GetParam("placeholder-factor").size() > 0) {
@ -511,7 +512,7 @@ bool StaticData::LoadData(Parameter *parameter)
const vector<string> &features = m_parameter->GetParam("feature"); const vector<string> &features = m_parameter->GetParam("feature");
for (size_t i = 0; i < features.size(); ++i) { for (size_t i = 0; i < features.size(); ++i) {
const string &line = Trim(features[i]); const string &line = Trim(features[i]);
cerr << "line=" << line << endl; VERBOSE(1,"line=" << line << endl);
if (line.empty()) if (line.empty())
continue; continue;
@ -535,7 +536,9 @@ bool StaticData::LoadData(Parameter *parameter)
NoCache(); NoCache();
OverrideFeatures(); OverrideFeatures();
if (!m_parameter->isParamSpecified("show-weights")) {
LoadFeatureFunctions(); LoadFeatureFunctions();
}
if (!LoadDecodeGraphs()) return false; if (!LoadDecodeGraphs()) return false;
@ -640,7 +643,8 @@ void StaticData::LoadNonTerminals()
"Incorrect unknown LHS format: " << line); "Incorrect unknown LHS format: " << line);
UnknownLHSEntry entry(tokens[0], Scan<float>(tokens[1])); UnknownLHSEntry entry(tokens[0], Scan<float>(tokens[1]));
m_unknownLHS.push_back(entry); m_unknownLHS.push_back(entry);
const Factor *targetFactor = factorCollection.AddFactor(Output, 0, tokens[0], true); // const Factor *targetFactor =
factorCollection.AddFactor(Output, 0, tokens[0], true);
} }
} }
@ -734,7 +738,7 @@ bool StaticData::LoadDecodeGraphs()
DecodeGraph *decodeGraph; DecodeGraph *decodeGraph;
if (IsChart()) { if (IsChart()) {
size_t maxChartSpan = (decodeGraphInd < maxChartSpans.size()) ? maxChartSpans[decodeGraphInd] : DEFAULT_MAX_CHART_SPAN; size_t maxChartSpan = (decodeGraphInd < maxChartSpans.size()) ? maxChartSpans[decodeGraphInd] : DEFAULT_MAX_CHART_SPAN;
cerr << "max-chart-span: " << maxChartSpans[decodeGraphInd] << endl; VERBOSE(1,"max-chart-span: " << maxChartSpans[decodeGraphInd] << endl);
decodeGraph = new DecodeGraph(m_decodeGraphs.size(), maxChartSpan); decodeGraph = new DecodeGraph(m_decodeGraphs.size(), maxChartSpan);
} else { } else {
decodeGraph = new DecodeGraph(m_decodeGraphs.size()); decodeGraph = new DecodeGraph(m_decodeGraphs.size());
@ -866,7 +870,7 @@ void StaticData::SetExecPath(const std::string &path)
if (pos != string::npos) { if (pos != string::npos) {
m_binPath = path.substr(0, pos); m_binPath = path.substr(0, pos);
} }
cerr << m_binPath << endl; VERBOSE(1,m_binPath << endl);
} }
const string &StaticData::GetBinDirectory() const const string &StaticData::GetBinDirectory() const
@ -920,7 +924,8 @@ void StaticData::LoadFeatureFunctions()
FeatureFunction *ff = *iter; FeatureFunction *ff = *iter;
bool doLoad = true; bool doLoad = true;
if (PhraseDictionary *ffCast = dynamic_cast<PhraseDictionary*>(ff)) { // if (PhraseDictionary *ffCast = dynamic_cast<PhraseDictionary*>(ff)) {
if (dynamic_cast<PhraseDictionary*>(ff)) {
doLoad = false; doLoad = false;
} }
@ -964,7 +969,7 @@ bool StaticData::CheckWeights() const
set<string>::iterator iter; set<string>::iterator iter;
for (iter = weightNames.begin(); iter != weightNames.end(); ) { for (iter = weightNames.begin(); iter != weightNames.end(); ) {
string fname = (*iter).substr(0, (*iter).find("_")); string fname = (*iter).substr(0, (*iter).find("_"));
cerr << fname << "\n"; VERBOSE(1,fname << "\n");
if (featureNames.find(fname) != featureNames.end()) { if (featureNames.find(fname) != featureNames.end()) {
weightNames.erase(iter++); weightNames.erase(iter++);
} }
@ -1039,7 +1044,7 @@ bool StaticData::LoadAlternateWeightSettings()
vector<string> tokens = Tokenize(weightSpecification[i]); vector<string> tokens = Tokenize(weightSpecification[i]);
vector<string> args = Tokenize(tokens[0], "="); vector<string> args = Tokenize(tokens[0], "=");
currentId = args[1]; currentId = args[1];
cerr << "alternate weight setting " << currentId << endl; VERBOSE(1,"alternate weight setting " << currentId << endl);
UTIL_THROW_IF2(m_weightSetting.find(currentId) != m_weightSetting.end(), UTIL_THROW_IF2(m_weightSetting.find(currentId) != m_weightSetting.end(),
"Duplicate alternate weight id: " << currentId); "Duplicate alternate weight id: " << currentId);
m_weightSetting[ currentId ] = new ScoreComponentCollection; m_weightSetting[ currentId ] = new ScoreComponentCollection;

View File

@ -44,6 +44,12 @@ public:
typedef CollType::iterator iterator; typedef CollType::iterator iterator;
typedef CollType::const_iterator const_iterator; typedef CollType::const_iterator const_iterator;
TargetPhrase const*
operator[](size_t const i) const
{
return m_collection.at(i);
}
iterator begin() { iterator begin() {
return m_collection.begin(); return m_collection.begin();
} }

View File

@ -17,12 +17,8 @@ License along with this library; if not, write to the Free Software
Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
***********************************************************************/ ***********************************************************************/
#include "util/exception.hh" #include "util/exception.hh"
#include "moses/TranslationModel/PhraseDictionaryMultiModelCounts.h" #include "moses/TranslationModel/PhraseDictionaryMultiModelCounts.h"
#define LINE_MAX_LENGTH 100000
#include "phrase-extract/SafeGetline.h" // for SAFE_GETLINE()
using namespace std; using namespace std;
template<typename T> template<typename T>
@ -461,16 +457,14 @@ void PhraseDictionaryMultiModelCounts::LoadLexicalTable( string &fileName, lexic
} }
istream *inFileP = &inFile; istream *inFileP = &inFile;
char line[LINE_MAX_LENGTH];
int i=0; int i=0;
while(true) { string line;
while(getline(*inFileP, line)) {
i++; i++;
if (i%100000 == 0) cerr << "." << flush; if (i%100000 == 0) cerr << "." << flush;
SAFE_GETLINE((*inFileP), line, LINE_MAX_LENGTH, '\n', __FILE__);
if (inFileP->eof()) break;
vector<string> token = tokenize( line ); vector<string> token = tokenize( line.c_str() );
if (token.size() != 4) { if (token.size() != 4) {
cerr << "line " << i << " in " << fileName cerr << "line " << i << " in " << fileName
<< " has wrong number of tokens, skipping:\n" << " has wrong number of tokens, skipping:\n"

View File

@ -9,6 +9,17 @@ $(TOP)/moses/TranslationModel/UG//mmsapt
$(TOP)/util//kenutil $(TOP)/util//kenutil
; ;
exe lookup_mmsapt :
lookup_mmsapt.cc
$(TOP)/moses//moses
$(TOP)/moses/TranslationModel/UG/generic//generic
$(TOP)//boost_iostreams
$(TOP)//boost_program_options
$(TOP)/moses/TranslationModel/UG/mm//mm
$(TOP)/moses/TranslationModel/UG//mmsapt
$(TOP)/util//kenutil
;
install $(PREFIX)/bin : try-align ; install $(PREFIX)/bin : try-align ;
fakelib mmsapt : [ glob *.cpp mmsapt*.cc ] ; fakelib mmsapt : [ glob *.cpp mmsapt*.cc ] ;

View File

@ -0,0 +1,76 @@
#include "mmsapt.h"
#include <boost/foreach.hpp>
#include <boost/tokenizer.hpp>
#include <boost/shared_ptr.hpp>
#include <algorithm>
#include <iostream>
using namespace Moses;
using namespace bitext;
using namespace std;
using namespace boost;
vector<FactorType> fo(1,FactorType(0));
class SimplePhrase : public Moses::Phrase
{
vector<FactorType> const m_fo; // factor order
public:
SimplePhrase(): m_fo(1,FactorType(0)) {}
void init(string const& s)
{
istringstream buf(s); string w;
while (buf >> w)
{
Word wrd;
this->AddWord().CreateFromString(Input,m_fo,StringPiece(w),false,false);
}
}
};
class TargetPhraseIndexSorter
{
TargetPhraseCollection const& my_tpc;
CompareTargetPhrase cmp;
public:
TargetPhraseIndexSorter(TargetPhraseCollection const& tpc) : my_tpc(tpc) {}
bool operator()(size_t a, size_t b) const
{
return cmp(*my_tpc[a], *my_tpc[b]);
}
};
int main(int argc, char* argv[])
{
Parameter params;
if (!params.LoadParam(argc,argv) || !StaticData::LoadDataStatic(&params, argv[0]))
exit(1);
Mmsapt* PT;
BOOST_FOREACH(PhraseDictionary* pd, PhraseDictionary::GetColl())
if ((PT = dynamic_cast<Mmsapt*>(pd))) break;
string line;
while (getline(cin,line))
{
SimplePhrase p; p.init(line);
cout << p << endl;
TargetPhraseCollection const* trg = PT->GetTargetPhraseCollectionLEGACY(p);
if (!trg) continue;
vector<size_t> order(trg->GetSize());
for (size_t i = 0; i < order.size(); ++i) order[i] = i;
sort(order.begin(),order.end(),TargetPhraseIndexSorter(*trg));
size_t k = 0;
BOOST_FOREACH(size_t i, order)
{
Phrase const& phr = static_cast<Phrase const&>(*(*trg)[i]);
cout << setw(3) << ++k << " " << phr << endl;
}
PT->Release(trg);
}
exit(0);
}

View File

@ -131,7 +131,7 @@ interpret_args(int ac, char* av[])
o.add_options() o.add_options()
("help,h", "print this message") ("help,h", "print this message")
("source,s",po::value<string>(&swrd),"source word") ("source,s",po::value<string>(&swrd),"source word")
("target,t",po::value<string>(&swrd),"target word") ("target,t",po::value<string>(&twrd),"target word")
; ;
h.add_options() h.add_options()

View File

@ -318,10 +318,10 @@ namespace Moses {
assert(pp.sample1); assert(pp.sample1);
assert(pp.joint); assert(pp.joint);
assert(pp.raw2); assert(pp.raw2);
(*dest)[i] = log(pp.raw1); (*dest)[i] = -log(pp.raw1);
(*dest)[++i] = log(pp.sample1); (*dest)[++i] = -log(pp.sample1);
(*dest)[++i] = log(pp.joint); (*dest)[++i] = +log(pp.joint);
(*dest)[++i] = log(pp.raw2); (*dest)[++i] = -log(pp.raw2);
} }
}; };
@ -592,6 +592,7 @@ namespace Moses {
friend class agenda; friend class agenda;
boost::taus88 rnd; // every job has its own pseudo random generator boost::taus88 rnd; // every job has its own pseudo random generator
double rnddenom; // denominator for scaling random sampling double rnddenom; // denominator for scaling random sampling
size_t min_diverse; // minimum number of distinct translations
public: public:
size_t workers; // how many workers are working on this job? size_t workers; // how many workers are working on this job?
sptr<TSA<Token> const> root; // root of the underlying suffix array sptr<TSA<Token> const> root; // root of the underlying suffix array
@ -644,34 +645,47 @@ namespace Moses {
step(uint64_t & sid, uint64_t & offset) step(uint64_t & sid, uint64_t & offset)
{ {
boost::lock_guard<boost::mutex> jguard(lock); boost::lock_guard<boost::mutex> jguard(lock);
if ((max_samples == 0) && (next < stop)) bool ret = (max_samples == 0) && (next < stop);
if (ret)
{ {
next = root->readSid(next,stop,sid); next = root->readSid(next,stop,sid);
next = root->readOffset(next,stop,offset); next = root->readOffset(next,stop,offset);
boost::lock_guard<boost::mutex> sguard(stats->lock); boost::lock_guard<boost::mutex> sguard(stats->lock);
if (stats->raw_cnt == ctr) ++stats->raw_cnt; if (stats->raw_cnt == ctr) ++stats->raw_cnt;
stats->sample_cnt++; stats->sample_cnt++;
return true;
} }
else else
{ {
while (next < stop && stats->good < max_samples) while (next < stop && (stats->good < max_samples ||
stats->trg.size() < min_diverse))
{ {
next = root->readSid(next,stop,sid); next = root->readSid(next,stop,sid);
next = root->readOffset(next,stop,offset); next = root->readOffset(next,stop,offset);
{ { // brackets required for lock scoping; see sguard immediately below
boost::lock_guard<boost::mutex> sguard(stats->lock); boost::lock_guard<boost::mutex> sguard(stats->lock);
if (stats->raw_cnt == ctr) ++stats->raw_cnt; if (stats->raw_cnt == ctr) ++stats->raw_cnt;
size_t rnum = (stats->raw_cnt - ctr++)*(rnd()/(rnd.max()+1.)); size_t scalefac = (stats->raw_cnt - ctr++);
size_t rnum = scalefac*(rnd()/(rnd.max()+1.));
#if 0
cerr << rnum << "/" << scalefac << " vs. "
<< max_samples - stats->good << " ("
<< max_samples << " - " << stats->good << ")"
<< endl;
#endif
if (rnum < max_samples - stats->good) if (rnum < max_samples - stats->good)
{ {
stats->sample_cnt++; stats->sample_cnt++;
return true; ret = true;
break;
} }
} }
} }
return false;
} }
// boost::lock_guard<boost::mutex> sguard(stats->lock);
// abuse of lock for clean output to cerr
// cerr << stats->sample_cnt++;
return ret;
} }
template<typename Token> template<typename Token>
@ -713,6 +727,13 @@ namespace Moses {
worker:: worker::
operator()() operator()()
{ {
// things to do:
// - have each worker maintain their own pstats object and merge results at the end;
// - ensure the minimum size of samples considered by a non-locked counter that is only
// ever incremented -- who cares if we look at more samples than required, as long
// as we look at at least the minimum required
// This way, we can reduce the number of lock / unlock operations we need to do during
// sampling.
size_t s1=0, s2=0, e1=0, e2=0; size_t s1=0, s2=0, e1=0, e2=0;
uint64_t sid=0, offset=0; // of the source phrase uint64_t sid=0, offset=0; // of the source phrase
while(sptr<job> j = ag.get_job()) while(sptr<job> j = ag.get_job())
@ -812,6 +833,7 @@ namespace Moses {
sptr<TSA<Token> > const& r, size_t maxsmpl, bool isfwd) sptr<TSA<Token> > const& r, size_t maxsmpl, bool isfwd)
: rnd(0) : rnd(0)
, rnddenom(rnd.max() + 1.) , rnddenom(rnd.max() + 1.)
, min_diverse(10)
, workers(0) , workers(0)
, root(r) , root(r)
, next(m.lower_bound(-1)) , next(m.lower_bound(-1))

View File

@ -122,16 +122,16 @@ namespace Moses
if (m != param.end()) if (m != param.end())
withPbwd = m->second != "0"; withPbwd = m->second != "0";
m_default_sample_size = m != param.end() ? atoi(m->second.c_str()) : 1000;
m = param.find("workers"); m = param.find("workers");
m_workers = m != param.end() ? atoi(m->second.c_str()) : 8; m_workers = m != param.end() ? atoi(m->second.c_str()) : 8;
m_workers = min(m_workers,24UL); m_workers = min(m_workers,24UL);
m = param.find("limit");
if (m != param.end()) m_tableLimit = atoi(m->second.c_str());
m = param.find("cache-size"); m = param.find("cache-size");
m_history.reserve(m != param.end() m_history.reserve(m != param.end()?max(1000,atoi(m->second.c_str())):10000);
? max(1000,atoi(m->second.c_str())) // in plain language: cache size is at least 1000, and 10,000 by default
: 10000);
this->m_numScoreComponents = atoi(param["num-features"].c_str()); this->m_numScoreComponents = atoi(param["num-features"].c_str());
@ -196,8 +196,8 @@ namespace Moses
// currently always active by default; may (should) change later // currently always active by default; may (should) change later
num_feats = calc_lex.init(num_feats, bname + L1 + "-" + L2 + ".lex"); num_feats = calc_lex.init(num_feats, bname + L1 + "-" + L2 + ".lex");
if (this->m_numScoreComponents%2) // a bit of a hack, for backwards compatibility // if (this->m_numScoreComponents%2) // a bit of a hack, for backwards compatibility
num_feats = apply_pp.init(num_feats); // num_feats = apply_pp.init(num_feats);
if (num_feats < this->m_numScoreComponents) if (num_feats < this->m_numScoreComponents)
{ {
@ -283,8 +283,8 @@ namespace Moses
{ {
PhrasePair pp; PhrasePair pp;
pp.init(pid1, stats, this->m_numScoreComponents); pp.init(pid1, stats, this->m_numScoreComponents);
if (this->m_numScoreComponents%2) // if (this->m_numScoreComponents%2)
apply_pp(bt,pp); // apply_pp(bt,pp);
pstats::trg_map_t::const_iterator t; pstats::trg_map_t::const_iterator t;
for (t = stats.trg.begin(); t != stats.trg.end(); ++t) for (t = stats.trg.begin(); t != stats.trg.end(); ++t)
{ {
@ -318,8 +318,8 @@ namespace Moses
pp.init(pid1b, *statsb, this->m_numScoreComponents); pp.init(pid1b, *statsb, this->m_numScoreComponents);
else return false; // throw "no stats for pooling available!"; else return false; // throw "no stats for pooling available!";
if (this->m_numScoreComponents%2) // if (this->m_numScoreComponents%2)
apply_pp(bta,pp); // apply_pp(bta,pp);
pstats::trg_map_t::const_iterator b; pstats::trg_map_t::const_iterator b;
pstats::trg_map_t::iterator a; pstats::trg_map_t::iterator a;
if (statsb) if (statsb)
@ -368,6 +368,13 @@ namespace Moses
} }
else else
pp.update(a->first,a->second); pp.update(a->first,a->second);
#if 0
// jstats const& j = a->second;
cerr << bta.T1->pid2str(bta.V1.get(),pp.p1) << " ::: "
<< bta.T2->pid2str(bta.V2.get(),pp.p2) << endl;
cerr << pp.raw1 << " " << pp.sample1 << " " << pp.good1 << " "
<< pp.joint << " " << pp.raw2 << endl;
#endif
UTIL_THROW_IF2(pp.raw2 == 0, UTIL_THROW_IF2(pp.raw2 == 0,
"OOPS" "OOPS"
@ -376,12 +383,6 @@ namespace Moses
<< pp.raw1 << " " << pp.sample1 << " " << pp.raw1 << " " << pp.sample1 << " "
<< pp.good1 << " " << pp.joint << " " << pp.good1 << " " << pp.joint << " "
<< pp.raw2); << pp.raw2);
#if 0
jstats const& j = a->second;
cerr << bta.T1->pid2str(bta.V1.get(),pp.p1) << " ::: "
<< bta.T2->pid2str(bta.V2.get(),pp.p2) << endl;
cerr << j.rcnt() << " " << j.cnt2() << " " << j.wcnt() << endl;
#endif
calc_lex(bta,pp); calc_lex(bta,pp);
if (withPfwd) calc_pfwd_fix(bta,pp); if (withPfwd) calc_pfwd_fix(bta,pp);
if (withPbwd) calc_pbwd_fix(bta,pp); if (withPbwd) calc_pbwd_fix(bta,pp);
@ -415,8 +416,8 @@ namespace Moses
if (statsb) if (statsb)
{ {
pool.init(pid1b,*statsb,0); pool.init(pid1b,*statsb,0);
if (this->m_numScoreComponents%2) // if (this->m_numScoreComponents%2)
apply_pp(btb,ppdyn); // apply_pp(btb,ppdyn);
for (b = statsb->trg.begin(); b != statsb->trg.end(); ++b) for (b = statsb->trg.begin(); b != statsb->trg.end(); ++b)
{ {
ppdyn.update(b->first,b->second); ppdyn.update(b->first,b->second);
@ -456,8 +457,8 @@ namespace Moses
if (statsa) if (statsa)
{ {
pool.init(pid1a,*statsa,0); pool.init(pid1a,*statsa,0);
if (this->m_numScoreComponents%2) // if (this->m_numScoreComponents%2)
apply_pp(bta,ppfix); // apply_pp(bta,ppfix);
for (a = statsa->trg.begin(); a != statsa->trg.end(); ++a) for (a = statsa->trg.begin(); a != statsa->trg.end(); ++a)
{ {
if (!a->second.valid()) continue; // done above if (!a->second.valid()) continue; // done above
@ -662,7 +663,7 @@ namespace Moses
|| combine_pstats(src, mfix.getPid(),sfix.get(),btfix, || combine_pstats(src, mfix.getPid(),sfix.get(),btfix,
mdyn.getPid(),sdyn.get(),*dyn,ret)) mdyn.getPid(),sdyn.get(),*dyn,ret))
{ {
ret->NthElement(m_tableLimit); if (m_tableLimit) ret->Prune(true,m_tableLimit);
#if 0 #if 0
sort(ret->begin(), ret->end(), CompareTargetPhrase()); sort(ret->begin(), ret->end(), CompareTargetPhrase());
cout << "SOURCE PHRASE: " << src << endl; cout << "SOURCE PHRASE: " << src << endl;
@ -683,6 +684,14 @@ namespace Moses
return encache(ret); return encache(ret);
} }
size_t
Mmsapt::
SetTableLimit(size_t limit)
{
std::swap(m_tableLimit,limit);
return limit;
}
void void
Mmsapt:: Mmsapt::
CleanUpAfterSentenceProcessing(const InputType& source) CleanUpAfterSentenceProcessing(const InputType& source)

View File

@ -71,7 +71,7 @@ namespace Moses
PScorePfwd<Token> calc_pfwd_fix, calc_pfwd_dyn; PScorePfwd<Token> calc_pfwd_fix, calc_pfwd_dyn;
PScorePbwd<Token> calc_pbwd_fix, calc_pbwd_dyn; PScorePbwd<Token> calc_pbwd_fix, calc_pbwd_dyn;
PScoreLex<Token> calc_lex; // this one I'd like to see as an external ff eventually PScoreLex<Token> calc_lex; // this one I'd like to see as an external ff eventually
PScorePP<Token> apply_pp; // apply phrase penalty // PScorePP<Token> apply_pp; // apply phrase penalty
PScoreLogCounts<Token> add_logcounts_fix; PScoreLogCounts<Token> add_logcounts_fix;
PScoreLogCounts<Token> add_logcounts_dyn; PScoreLogCounts<Token> add_logcounts_dyn;
void init(string const& line); void init(string const& line);
@ -168,6 +168,9 @@ namespace Moses
void void
Load(); Load();
// returns the prior table limit
size_t SetTableLimit(size_t limit);
#ifndef NO_MOSES #ifndef NO_MOSES
TargetPhraseCollection const* TargetPhraseCollection const*
GetTargetPhraseCollectionLEGACY(const Phrase& src) const; GetTargetPhraseCollectionLEGACY(const Phrase& src) const;

View File

@ -413,11 +413,9 @@ void FuzzyMatchWrapper::load_corpus( const std::string &fileName, vector< vector
istream *fileStreamP = &fileStream; istream *fileStreamP = &fileStream;
char line[LINE_MAX_LENGTH]; string line;
while(true) { while(getline(*fileStreamP, line)) {
SAFE_GETLINE((*fileStreamP), line, LINE_MAX_LENGTH, '\n'); corpus.push_back( GetVocabulary().Tokenize( line.c_str() ) );
if (fileStreamP->eof()) break;
corpus.push_back( GetVocabulary().Tokenize( line ) );
} }
} }
@ -436,12 +434,9 @@ void FuzzyMatchWrapper::load_target(const std::string &fileName, vector< vector<
WORD_ID delimiter = GetVocabulary().StoreIfNew("|||"); WORD_ID delimiter = GetVocabulary().StoreIfNew("|||");
int lineNum = 0; int lineNum = 0;
char line[LINE_MAX_LENGTH]; string line;
while(true) { while(getline(*fileStreamP, line)) {
SAFE_GETLINE((*fileStreamP), line, LINE_MAX_LENGTH, '\n'); vector<WORD_ID> toks = GetVocabulary().Tokenize( line.c_str() );
if (fileStreamP->eof()) break;
vector<WORD_ID> toks = GetVocabulary().Tokenize( line );
corpus.push_back(vector< SentenceAlignment >()); corpus.push_back(vector< SentenceAlignment >());
vector< SentenceAlignment > &vec = corpus.back(); vector< SentenceAlignment > &vec = corpus.back();
@ -493,11 +488,8 @@ void FuzzyMatchWrapper::load_alignment(const std::string &fileName, vector< vect
string delimiter = "|||"; string delimiter = "|||";
int lineNum = 0; int lineNum = 0;
char line[LINE_MAX_LENGTH]; string line;
while(true) { while(getline(*fileStreamP, line)) {
SAFE_GETLINE((*fileStreamP), line, LINE_MAX_LENGTH, '\n');
if (fileStreamP->eof()) break;
vector< SentenceAlignment > &vec = corpus[lineNum]; vector< SentenceAlignment > &vec = corpus[lineNum];
size_t targetInd = 0; size_t targetInd = 0;
SentenceAlignment *sentence = &vec[targetInd]; SentenceAlignment *sentence = &vec[targetInd];

View File

@ -14,17 +14,16 @@ SuffixArray::SuffixArray( string fileName )
m_endOfSentence = m_vcb.StoreIfNew( "<s>" ); m_endOfSentence = m_vcb.StoreIfNew( "<s>" );
ifstream extractFile; ifstream extractFile;
char line[LINE_MAX_LENGTH];
// count the number of words first; // count the number of words first;
extractFile.open(fileName.c_str()); extractFile.open(fileName.c_str());
istream *fileP = &extractFile; istream *fileP = &extractFile;
m_size = 0; m_size = 0;
size_t sentenceCount = 0; size_t sentenceCount = 0;
while(!fileP->eof()) { string line;
SAFE_GETLINE((*fileP), line, LINE_MAX_LENGTH, '\n'); while(getline(*fileP, line)) {
if (fileP->eof()) break;
vector< WORD_ID > words = m_vcb.Tokenize( line ); vector< WORD_ID > words = m_vcb.Tokenize( line.c_str() );
m_size += words.size() + 1; m_size += words.size() + 1;
sentenceCount++; sentenceCount++;
} }
@ -43,10 +42,8 @@ SuffixArray::SuffixArray( string fileName )
int sentenceId = 0; int sentenceId = 0;
extractFile.open(fileName.c_str()); extractFile.open(fileName.c_str());
fileP = &extractFile; fileP = &extractFile;
while(!fileP->eof()) { while(getline(*fileP, line)) {
SAFE_GETLINE((*fileP), line, LINE_MAX_LENGTH, '\n'); vector< WORD_ID > words = m_vcb.Tokenize( line.c_str() );
if (fileP->eof()) break;
vector< WORD_ID > words = m_vcb.Tokenize( line );
// add to corpus vector // add to corpus vector
corpus.push_back(words); corpus.push_back(words);

View File

@ -17,20 +17,6 @@
namespace tmmt namespace tmmt
{ {
#define MAX_LENGTH 10000
#define SAFE_GETLINE(_IS, _LINE, _SIZE, _DELIM) { \
_IS.getline(_LINE, _SIZE, _DELIM); \
if(_IS.fail() && !_IS.bad() && !_IS.eof()) _IS.clear(); \
if (_IS.gcount() == _SIZE-1) { \
cerr << "Line too long! Buffer overflow. Delete lines >=" \
<< _SIZE << " chars or raise MAX_LENGTH in phrase-extract/tables-core.cpp" \
<< endl; \
exit(1); \
} \
}
typedef std::string WORD; typedef std::string WORD;
typedef unsigned int WORD_ID; typedef unsigned int WORD_ID;

View File

@ -2,9 +2,6 @@
#include "ExtractionPhrasePair.h" #include "ExtractionPhrasePair.h"
#include "tables-core.h" #include "tables-core.h"
#include "InputFileStream.h" #include "InputFileStream.h"
#include "SafeGetline.h"
#define TABLE_LINE_MAX_LENGTH 1000
using namespace std; using namespace std;
@ -16,12 +13,11 @@ void Domain::load( const std::string &domainFileName )
{ {
Moses::InputFileStream fileS( domainFileName ); Moses::InputFileStream fileS( domainFileName );
istream *fileP = &fileS; istream *fileP = &fileS;
while(true) {
char line[TABLE_LINE_MAX_LENGTH]; string line;
SAFE_GETLINE((*fileP), line, TABLE_LINE_MAX_LENGTH, '\n', __FILE__); while(getline(*fileP, line)) {
if (fileP->eof()) break;
// read // read
vector< string > domainSpecLine = tokenize( line ); vector< string > domainSpecLine = tokenize( line.c_str() );
int lineNumber; int lineNumber;
if (domainSpecLine.size() != 2 || if (domainSpecLine.size() != 2 ||
! sscanf(domainSpecLine[0].c_str(), "%d", &lineNumber)) { ! sscanf(domainSpecLine[0].c_str(), "%d", &lineNumber)) {

View File

@ -19,7 +19,6 @@
#include <sstream> #include <sstream>
#include "ExtractionPhrasePair.h" #include "ExtractionPhrasePair.h"
#include "SafeGetline.h"
#include "tables-core.h" #include "tables-core.h"
#include "score.h" #include "score.h"
#include "moses/Util.h" #include "moses/Util.h"

View File

@ -1,35 +0,0 @@
/***********************************************************************
Moses - factored phrase-based language decoder
Copyright (C) 2010 University of Edinburgh
This library is free software; you can redistribute it and/or
modify it under the terms of the GNU Lesser General Public
License as published by the Free Software Foundation; either
version 2.1 of the License, or (at your option) any later version.
This library is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
Lesser General Public License for more details.
You should have received a copy of the GNU Lesser General Public
License along with this library; if not, write to the Free Software
Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
***********************************************************************/
#pragma once
#ifndef SAFE_GETLINE_INCLUDED_
#define SAFE_GETLINE_INCLUDED_
#define SAFE_GETLINE(_IS, _LINE, _SIZE, _DELIM, _FILE) { \
_IS.getline(_LINE, _SIZE, _DELIM); \
if(_IS.fail() && !_IS.bad() && !_IS.eof()) _IS.clear(); \
if (_IS.gcount() == _SIZE-1) { \
cerr << "Line too long! Buffer overflow. Delete lines >=" \
<< _SIZE << " chars or raise LINE_MAX_LENGTH in " << _FILE \
<< endl; \
exit(1); \
} \
}
#endif

View File

@ -54,7 +54,11 @@ bool SentenceAlignment::processSourceSentence(const char * sourceString, int, bo
return true; return true;
} }
bool SentenceAlignment::create( char targetString[], char sourceString[], char alignmentString[], char weightString[], int sentenceID, bool boundaryRules) bool SentenceAlignment::create(const char targetString[],
const char sourceString[],
const char alignmentString[],
const char weightString[],
int sentenceID, bool boundaryRules)
{ {
using namespace std; using namespace std;
this->sentenceID = sentenceID; this->sentenceID = sentenceID;

View File

@ -43,8 +43,11 @@ public:
virtual bool processSourceSentence(const char *, int, bool boundaryRules); virtual bool processSourceSentence(const char *, int, bool boundaryRules);
bool create(char targetString[], char sourceString[], bool create(const char targetString[],
char alignmentString[], char weightString[], int sentenceID, bool boundaryRules); const char sourceString[],
const char alignmentString[],
const char weightString[],
int sentenceID, bool boundaryRules);
void invertAlignment(); void invertAlignment();

View File

@ -26,16 +26,9 @@
#include "InputFileStream.h" #include "InputFileStream.h"
#include "OutputFileStream.h" #include "OutputFileStream.h"
#include "SafeGetline.h"
#define LINE_MAX_LENGTH 10000
using namespace std; using namespace std;
char line[LINE_MAX_LENGTH]; vector< string > splitLine(const char *line)
vector< string > splitLine()
{ {
vector< string > item; vector< string > item;
int start=0; int start=0;
@ -62,13 +55,14 @@ bool getLine( istream &fileP, vector< string > &item )
if (fileP.eof()) if (fileP.eof())
return false; return false;
SAFE_GETLINE((fileP), line, LINE_MAX_LENGTH, '\n', __FILE__); string line;
if (fileP.eof()) if (getline(fileP, line)) {
item = splitLine(line.c_str());
return false; return false;
}
item = splitLine(); else {
return false;
return true; }
} }

View File

@ -26,12 +26,9 @@
#include <cstring> #include <cstring>
#include "tables-core.h" #include "tables-core.h"
#include "SafeGetline.h"
#include "InputFileStream.h" #include "InputFileStream.h"
#include "OutputFileStream.h" #include "OutputFileStream.h"
#define LINE_MAX_LENGTH 10000
using namespace std; using namespace std;
bool hierarchicalFlag = false; bool hierarchicalFlag = false;
@ -46,12 +43,11 @@ inline float maybeLogProb( float a )
return logProbFlag ? log(a) : a; return logProbFlag ? log(a) : a;
} }
char line[LINE_MAX_LENGTH];
void processFiles( char*, char*, char*, char* ); void processFiles( char*, char*, char*, char* );
void loadCountOfCounts( char* ); void loadCountOfCounts( char* );
void breakdownCoreAndSparse( string combined, string &core, string &sparse ); void breakdownCoreAndSparse( string combined, string &core, string &sparse );
bool getLine( istream &fileP, vector< string > &item ); bool getLine( istream &fileP, vector< string > &item );
vector< string > splitLine(); vector< string > splitLine(const char *line);
vector< int > countBin; vector< int > countBin;
bool sparseCountBinFeatureFlag = false; bool sparseCountBinFeatureFlag = false;
@ -140,14 +136,13 @@ void loadCountOfCounts( char* fileNameCountOfCounts )
istream &fileP = fileCountOfCounts; istream &fileP = fileCountOfCounts;
countOfCounts.push_back(0.0); countOfCounts.push_back(0.0);
while(1) {
if (fileP.eof()) break; string line;
SAFE_GETLINE((fileP), line, LINE_MAX_LENGTH, '\n', __FILE__); while (getline(fileP, line)) {
if (fileP.eof()) break;
if (totalCount < 0) if (totalCount < 0)
totalCount = atof(line); // total number of distinct phrase pairs totalCount = atof(line.c_str()); // total number of distinct phrase pairs
else else
countOfCounts.push_back( atof(line) ); countOfCounts.push_back( atof(line.c_str()) );
} }
fileCountOfCounts.Close(); fileCountOfCounts.Close();
@ -370,16 +365,16 @@ bool getLine( istream &fileP, vector< string > &item )
if (fileP.eof()) if (fileP.eof())
return false; return false;
SAFE_GETLINE((fileP), line, LINE_MAX_LENGTH, '\n', __FILE__); string line;
if (fileP.eof()) if (!getline(fileP, line))
return false; return false;
item = splitLine(); item = splitLine(line.c_str());
return true; return true;
} }
vector< string > splitLine() vector< string > splitLine(const char *line)
{ {
vector< string > item; vector< string > item;
int start=0; int start=0;

View File

@ -27,23 +27,19 @@
#include <cstring> #include <cstring>
#include "tables-core.h" #include "tables-core.h"
#include "SafeGetline.h"
#include "InputFileStream.h" #include "InputFileStream.h"
#define LINE_MAX_LENGTH 10000
using namespace std; using namespace std;
bool hierarchicalFlag = false; bool hierarchicalFlag = false;
bool onlyDirectFlag = false; bool onlyDirectFlag = false;
bool phraseCountFlag = true; bool phraseCountFlag = true;
bool logProbFlag = false; bool logProbFlag = false;
char line[LINE_MAX_LENGTH];
void processFiles( char*, char*, char* ); void processFiles( char*, char*, char* );
bool getLine( istream &fileP, vector< string > &item ); bool getLine( istream &fileP, vector< string > &item );
string reverseAlignment(const string &alignments); string reverseAlignment(const string &alignments);
vector< string > splitLine(); vector< string > splitLine(const char *lin);
inline void Tokenize(std::vector<std::string> &output inline void Tokenize(std::vector<std::string> &output
, const std::string& str , const std::string& str
@ -191,16 +187,17 @@ bool getLine( istream &fileP, vector< string > &item )
if (fileP.eof()) if (fileP.eof())
return false; return false;
SAFE_GETLINE((fileP), line, LINE_MAX_LENGTH, '\n', __FILE__); string line;
if (fileP.eof()) if (getline(fileP, line)) {
item = splitLine(line.c_str());
return false; return false;
}
item = splitLine(); else {
return false;
return true; }
} }
vector< string > splitLine() vector< string > splitLine(const char *line)
{ {
vector< string > item; vector< string > item;
bool betweenWords = true; bool betweenWords = true;

View File

@ -19,7 +19,6 @@
#include <set> #include <set>
#include <vector> #include <vector>
#include "SafeGetline.h"
#include "SentenceAlignment.h" #include "SentenceAlignment.h"
#include "tables-core.h" #include "tables-core.h"
#include "InputFileStream.h" #include "InputFileStream.h"
@ -32,10 +31,6 @@ using namespace MosesTraining;
namespace MosesTraining namespace MosesTraining
{ {
const long int LINE_MAX_LENGTH = 500000 ;
// HPhraseVertex represents a point in the alignment matrix // HPhraseVertex represents a point in the alignment matrix
typedef pair <int, int> HPhraseVertex; typedef pair <int, int> HPhraseVertex;
@ -277,20 +272,18 @@ int main(int argc, char* argv[])
int i = sentenceOffset; int i = sentenceOffset;
while(true) { string englishString, foreignString, alignmentString, weightString;
while(getline(*eFileP, englishString)) {
i++; i++;
if (i%10000 == 0) cerr << "." << flush; if (i%10000 == 0) cerr << "." << flush;
char englishString[LINE_MAX_LENGTH];
char foreignString[LINE_MAX_LENGTH]; getline(*fFileP, foreignString);
char alignmentString[LINE_MAX_LENGTH]; getline(*aFileP, alignmentString);
char weightString[LINE_MAX_LENGTH];
SAFE_GETLINE((*eFileP), englishString, LINE_MAX_LENGTH, '\n', __FILE__);
if (eFileP->eof()) break;
SAFE_GETLINE((*fFileP), foreignString, LINE_MAX_LENGTH, '\n', __FILE__);
SAFE_GETLINE((*aFileP), alignmentString, LINE_MAX_LENGTH, '\n', __FILE__);
if (iwFileP) { if (iwFileP) {
SAFE_GETLINE((*iwFileP), weightString, LINE_MAX_LENGTH, '\n', __FILE__); getline(*iwFileP, weightString);
} }
SentenceAlignment sentence; SentenceAlignment sentence;
// cout << "read in: " << englishString << " & " << foreignString << " & " << alignmentString << endl; // cout << "read in: " << englishString << " & " << foreignString << " & " << alignmentString << endl;
//az: output src, tgt, and alingment line //az: output src, tgt, and alingment line
@ -300,7 +293,11 @@ int main(int argc, char* argv[])
cout << "LOG: ALT: " << alignmentString << endl; cout << "LOG: ALT: " << alignmentString << endl;
cout << "LOG: PHRASES_BEGIN:" << endl; cout << "LOG: PHRASES_BEGIN:" << endl;
} }
if (sentence.create( englishString, foreignString, alignmentString, weightString, i, false)) { if (sentence.create( englishString.c_str(),
foreignString.c_str(),
alignmentString.c_str(),
weightString.c_str(),
i, false)) {
if (options.placeholders.size()) { if (options.placeholders.size()) {
sentence.invertAlignment(); sentence.invertAlignment();
} }

View File

@ -19,7 +19,6 @@
#include <set> #include <set>
#include <vector> #include <vector>
#include "SafeGetline.h"
#include "SentenceAlignment.h" #include "SentenceAlignment.h"
#include "tables-core.h" #include "tables-core.h"
#include "InputFileStream.h" #include "InputFileStream.h"
@ -32,10 +31,6 @@ using namespace MosesTraining;
namespace MosesTraining namespace MosesTraining
{ {
const long int LINE_MAX_LENGTH = 500000 ;
// HPhraseVertex represents a point in the alignment matrix // HPhraseVertex represents a point in the alignment matrix
typedef pair <int, int> HPhraseVertex; typedef pair <int, int> HPhraseVertex;
@ -246,20 +241,20 @@ int main(int argc, char* argv[])
int i = sentenceOffset; int i = sentenceOffset;
while(true) { string englishString, foreignString, alignmentString, weightString;
while(getline(*eFileP, englishString)) {
i++; i++;
if (i%10000 == 0) cerr << "." << flush;
char englishString[LINE_MAX_LENGTH]; getline(*eFileP, englishString);
char foreignString[LINE_MAX_LENGTH]; getline(*fFileP, foreignString);
char alignmentString[LINE_MAX_LENGTH]; getline(*aFileP, alignmentString);
char weightString[LINE_MAX_LENGTH];
SAFE_GETLINE((*eFileP), englishString, LINE_MAX_LENGTH, '\n', __FILE__);
if (eFileP->eof()) break;
SAFE_GETLINE((*fFileP), foreignString, LINE_MAX_LENGTH, '\n', __FILE__);
SAFE_GETLINE((*aFileP), alignmentString, LINE_MAX_LENGTH, '\n', __FILE__);
if (iwFileP) { if (iwFileP) {
SAFE_GETLINE((*iwFileP), weightString, LINE_MAX_LENGTH, '\n', __FILE__); getline(*iwFileP, weightString);
} }
if (i%10000 == 0) cerr << "." << flush;
SentenceAlignment sentence; SentenceAlignment sentence;
// cout << "read in: " << englishString << " & " << foreignString << " & " << alignmentString << endl; // cout << "read in: " << englishString << " & " << foreignString << " & " << alignmentString << endl;
//az: output src, tgt, and alingment line //az: output src, tgt, and alingment line
@ -269,7 +264,7 @@ int main(int argc, char* argv[])
cout << "LOG: ALT: " << alignmentString << endl; cout << "LOG: ALT: " << alignmentString << endl;
cout << "LOG: PHRASES_BEGIN:" << endl; cout << "LOG: PHRASES_BEGIN:" << endl;
} }
if (sentence.create( englishString, foreignString, alignmentString, weightString, i, false)) { if (sentence.create( englishString.c_str(), foreignString.c_str(), alignmentString.c_str(), weightString.c_str(), i, false)) {
ExtractTask *task = new ExtractTask(i-1, sentence, options, extractFileOrientation); ExtractTask *task = new ExtractTask(i-1, sentence, options, extractFileOrientation);
task->Run(); task->Run();
delete task; delete task;

View File

@ -39,7 +39,6 @@
#include "Hole.h" #include "Hole.h"
#include "HoleCollection.h" #include "HoleCollection.h"
#include "RuleExist.h" #include "RuleExist.h"
#include "SafeGetline.h"
#include "SentenceAlignmentWithSyntax.h" #include "SentenceAlignmentWithSyntax.h"
#include "SyntaxTree.h" #include "SyntaxTree.h"
#include "tables-core.h" #include "tables-core.h"
@ -47,8 +46,6 @@
#include "InputFileStream.h" #include "InputFileStream.h"
#include "OutputFileStream.h" #include "OutputFileStream.h"
#define LINE_MAX_LENGTH 500000
using namespace std; using namespace std;
using namespace MosesTraining; using namespace MosesTraining;
@ -326,17 +323,15 @@ int main(int argc, char* argv[])
// loop through all sentence pairs // loop through all sentence pairs
size_t i=sentenceOffset; size_t i=sentenceOffset;
while(true) { string targetString, sourceString, alignmentString;
i++;
if (i%1000 == 0) cerr << i << " " << flush;
char targetString[LINE_MAX_LENGTH]; while(getline(*tFileP, targetString)) {
char sourceString[LINE_MAX_LENGTH]; i++;
char alignmentString[LINE_MAX_LENGTH];
SAFE_GETLINE((*tFileP), targetString, LINE_MAX_LENGTH, '\n', __FILE__); getline(*sFileP, sourceString);
if (tFileP->eof()) break; getline(*aFileP, alignmentString);
SAFE_GETLINE((*sFileP), sourceString, LINE_MAX_LENGTH, '\n', __FILE__);
SAFE_GETLINE((*aFileP), alignmentString, LINE_MAX_LENGTH, '\n', __FILE__); if (i%1000 == 0) cerr << i << " " << flush;
SentenceAlignmentWithSyntax sentence SentenceAlignmentWithSyntax sentence
(targetLabelCollection, sourceLabelCollection, (targetLabelCollection, sourceLabelCollection,
@ -349,7 +344,7 @@ int main(int argc, char* argv[])
cout << "LOG: PHRASES_BEGIN:" << endl; cout << "LOG: PHRASES_BEGIN:" << endl;
} }
if (sentence.create(targetString, sourceString, alignmentString,"", i, options.boundaryRules)) { if (sentence.create(targetString.c_str(), sourceString.c_str(), alignmentString.c_str(),"", i, options.boundaryRules)) {
if (options.unknownWordLabelFlag) { if (options.unknownWordLabelFlag) {
collectWordLabelCounts(sentence); collectWordLabelCounts(sentence);
} }

View File

@ -20,8 +20,6 @@
***********************************************************************/ ***********************************************************************/
#include "relax-parse.h" #include "relax-parse.h"
#include "SafeGetline.h"
#include "tables-core.h" #include "tables-core.h"
using namespace std; using namespace std;
@ -33,17 +31,13 @@ int main(int argc, char* argv[])
// loop through all sentences // loop through all sentences
int i=0; int i=0;
char inBuffer[LINE_MAX_LENGTH]; string inBuffer;
while(true) { while(getline(cin, inBuffer)) {
i++; i++;
if (i%1000 == 0) cerr << "." << flush; if (i%1000 == 0) cerr << "." << flush;
if (i%10000 == 0) cerr << ":" << flush; if (i%10000 == 0) cerr << ":" << flush;
if (i%100000 == 0) cerr << "!" << flush; if (i%100000 == 0) cerr << "!" << flush;
// get line from stdin
SAFE_GETLINE( cin, inBuffer, LINE_MAX_LENGTH, '\n', __FILE__);
if (cin.eof()) break;
// process into syntax tree representation // process into syntax tree representation
string inBufferString = string( inBuffer ); string inBufferString = string( inBuffer );
set< string > labelCollection; // set of labels, not used set< string > labelCollection; // set of labels, not used

View File

@ -29,7 +29,6 @@
#include <vector> #include <vector>
#include <algorithm> #include <algorithm>
#include "SafeGetline.h"
#include "ScoreFeature.h" #include "ScoreFeature.h"
#include "tables-core.h" #include "tables-core.h"
#include "ExtractionPhrasePair.h" #include "ExtractionPhrasePair.h"
@ -40,8 +39,6 @@
using namespace std; using namespace std;
using namespace MosesTraining; using namespace MosesTraining;
#define LINE_MAX_LENGTH 100000
namespace MosesTraining namespace MosesTraining
{ {
LexicalTable lexTable; LexicalTable lexTable;
@ -232,7 +229,7 @@ int main(int argc, char* argv[])
} }
// loop through all extracted phrase translations // loop through all extracted phrase translations
char line[LINE_MAX_LENGTH], lastLine[LINE_MAX_LENGTH]; string line, lastLine;
lastLine[0] = '\0'; lastLine[0] = '\0';
ExtractionPhrasePair *phrasePair = NULL; ExtractionPhrasePair *phrasePair = NULL;
std::vector< ExtractionPhrasePair* > phrasePairsWithSameSource; std::vector< ExtractionPhrasePair* > phrasePairsWithSameSource;
@ -245,8 +242,8 @@ int main(int argc, char* argv[])
float tmpCount=0.0f, tmpPcfgSum=0.0f; float tmpCount=0.0f, tmpPcfgSum=0.0f;
int i=0; int i=0;
SAFE_GETLINE( (extractFileP), line, LINE_MAX_LENGTH, '\n', __FILE__ ); // TODO why read only the 1st line?
if ( !extractFileP.eof() ) { if ( getline(extractFileP, line)) {
++i; ++i;
tmpPhraseSource = new PHRASE(); tmpPhraseSource = new PHRASE();
tmpPhraseTarget = new PHRASE(); tmpPhraseTarget = new PHRASE();
@ -265,23 +262,21 @@ int main(int argc, char* argv[])
if ( hierarchicalFlag ) { if ( hierarchicalFlag ) {
phrasePairsWithSameSourceAndTarget.push_back( phrasePair ); phrasePairsWithSameSourceAndTarget.push_back( phrasePair );
} }
strcpy( lastLine, line ); lastLine = line;
SAFE_GETLINE( (extractFileP), line, LINE_MAX_LENGTH, '\n', __FILE__ );
} }
while ( !extractFileP.eof() ) { while ( getline(extractFileP, line) ) {
if ( ++i % 100000 == 0 ) { if ( ++i % 100000 == 0 ) {
std::cerr << "." << std::flush; std::cerr << "." << std::flush;
} }
// identical to last line? just add count // identical to last line? just add count
if (strcmp(line,lastLine) == 0) { if (line == lastLine) {
phrasePair->IncrementPrevious(tmpCount,tmpPcfgSum); phrasePair->IncrementPrevious(tmpCount,tmpPcfgSum);
SAFE_GETLINE((extractFileP), line, LINE_MAX_LENGTH, '\n', __FILE__);
continue; continue;
} else { } else {
strcpy( lastLine, line ); lastLine = line;
} }
tmpPhraseSource = new PHRASE(); tmpPhraseSource = new PHRASE();
@ -359,8 +354,6 @@ int main(int argc, char* argv[])
} }
} }
SAFE_GETLINE((extractFileP), line, LINE_MAX_LENGTH, '\n', __FILE__);
} }
processPhrasePairs( phrasePairsWithSameSource, *phraseTableFile, featureManager, maybeLogProb ); processPhrasePairs( phrasePairsWithSameSource, *phraseTableFile, featureManager, maybeLogProb );
@ -750,11 +743,9 @@ void loadFunctionWords( const string &fileName )
} }
istream *inFileP = &inFile; istream *inFileP = &inFile;
char line[LINE_MAX_LENGTH]; string line;
while(true) { while(getline(*inFileP, line)) {
SAFE_GETLINE((*inFileP), line, LINE_MAX_LENGTH, '\n', __FILE__); std::vector<string> token = tokenize( line.c_str() );
if (inFileP->eof()) break;
std::vector<string> token = tokenize( line );
if (token.size() > 0) if (token.size() > 0)
functionWordList.insert( token[0] ); functionWordList.insert( token[0] );
} }
@ -799,16 +790,13 @@ void LexicalTable::load( const string &fileName )
} }
istream *inFileP = &inFile; istream *inFileP = &inFile;
char line[LINE_MAX_LENGTH]; string line;
int i=0; int i=0;
while(true) { while(getline(*inFileP, line)) {
i++; i++;
if (i%100000 == 0) std::cerr << "." << flush; if (i%100000 == 0) std::cerr << "." << flush;
SAFE_GETLINE((*inFileP), line, LINE_MAX_LENGTH, '\n', __FILE__);
if (inFileP->eof()) break;
std::vector<string> token = tokenize( line ); std::vector<string> token = tokenize( line.c_str() );
if (token.size() != 3) { if (token.size() != 3) {
std::cerr << "line " << i << " in " << fileName std::cerr << "line " << i << " in " << fileName
<< " has wrong number of tokens, skipping:" << std::endl << " has wrong number of tokens, skipping:" << std::endl

View File

@ -12,15 +12,12 @@
#include <time.h> #include <time.h>
#include "AlignmentPhrase.h" #include "AlignmentPhrase.h"
#include "SafeGetline.h"
#include "tables-core.h" #include "tables-core.h"
#include "InputFileStream.h" #include "InputFileStream.h"
using namespace std; using namespace std;
using namespace MosesTraining; using namespace MosesTraining;
#define LINE_MAX_LENGTH 10000
namespace MosesTraining namespace MosesTraining
{ {
@ -31,7 +28,7 @@ public:
vector< vector<size_t> > alignedToE; vector< vector<size_t> > alignedToE;
vector< vector<size_t> > alignedToF; vector< vector<size_t> > alignedToF;
bool create( char*, int ); bool create( const char*, int );
void clear(); void clear();
bool equals( const PhraseAlignment& ); bool equals( const PhraseAlignment& );
}; };
@ -106,16 +103,14 @@ int main(int argc, char* argv[])
vector< PhraseAlignment > phrasePairsWithSameF; vector< PhraseAlignment > phrasePairsWithSameF;
int i=0; int i=0;
int fileCount = 0; int fileCount = 0;
while(true) {
string line;
while(getline(extractFileP, line)) {
if (extractFileP.eof()) break; if (extractFileP.eof()) break;
if (++i % 100000 == 0) cerr << "." << flush; if (++i % 100000 == 0) cerr << "." << flush;
char line[LINE_MAX_LENGTH];
SAFE_GETLINE((extractFileP), line, LINE_MAX_LENGTH, '\n', __FILE__);
// if (fileCount>0)
if (extractFileP.eof())
break;
PhraseAlignment phrasePair; PhraseAlignment phrasePair;
bool isPhrasePair = phrasePair.create( line, i ); bool isPhrasePair = phrasePair.create( line.c_str(), i );
if (lastForeign >= 0 && lastForeign != phrasePair.foreign) { if (lastForeign >= 0 && lastForeign != phrasePair.foreign) {
processPhrasePairs( phrasePairsWithSameF ); processPhrasePairs( phrasePairsWithSameF );
for(size_t j=0; j<phrasePairsWithSameF.size(); j++) for(size_t j=0; j<phrasePairsWithSameF.size(); j++)
@ -124,7 +119,7 @@ int main(int argc, char* argv[])
phraseTableE.clear(); phraseTableE.clear();
phraseTableF.clear(); phraseTableF.clear();
phrasePair.clear(); // process line again, since phrase tables flushed phrasePair.clear(); // process line again, since phrase tables flushed
phrasePair.create( line, i ); phrasePair.create( line.c_str(), i );
phrasePairBase = 0; phrasePairBase = 0;
} }
lastForeign = phrasePair.foreign; lastForeign = phrasePair.foreign;
@ -242,7 +237,7 @@ void processPhrasePairs( vector< PhraseAlignment > &phrasePair )
} }
} }
bool PhraseAlignment::create( char line[], int lineID ) bool PhraseAlignment::create(const char line[], int lineID )
{ {
vector< string > token = tokenize( line ); vector< string > token = tokenize( line );
int item = 1; int item = 1;
@ -321,16 +316,14 @@ void LexicalTable::load( const string &filePath )
} }
istream *inFileP = &inFile; istream *inFileP = &inFile;
char line[LINE_MAX_LENGTH]; string line;
int i=0; int i=0;
while(true) { while(getline(*inFileP, line)) {
i++; i++;
if (i%100000 == 0) cerr << "." << flush; if (i%100000 == 0) cerr << "." << flush;
SAFE_GETLINE((*inFileP), line, LINE_MAX_LENGTH, '\n', __FILE__);
if (inFileP->eof()) break;
vector<string> token = tokenize( line ); vector<string> token = tokenize( line.c_str() );
if (token.size() != 3) { if (token.size() != 3) {
cerr << "line " << i << " in " << filePath << " has wrong number of tokens, skipping:\n" << cerr << "line " << i << " in " << filePath << " has wrong number of tokens, skipping:\n" <<
token.size() << " " << token[0] << " " << line << endl; token.size() << " " << token[0] << " " << line << endl;

View File

@ -0,0 +1,188 @@
#!/usr/bin/python
# -*- coding: utf-8 -*-
# Author: Rico Sennrich
# takes a file in the CoNLL dependency format (from the CoNLL-X shared task on dependency parsing; http://ilk.uvt.nl/conll/#dataformat )
# and produces Moses XML format. Note that the structure is built based on fields 9 and 10 (projective HEAD and RELATION),
# which not all parsers produce.
# usage: conll2mosesxml.py [--brackets] < input_file > output_file
from __future__ import print_function, unicode_literals
import sys
import re
import codecs
from collections import namedtuple,defaultdict
from lxml import etree as ET
Word = namedtuple('Word', ['pos','word','lemma','tag','head','func', 'proj_head', 'proj_func'])
def main(output_format='xml'):
sentence = []
for line in sys.stdin:
# process sentence
if line == "\n":
sentence.insert(0,[])
if is_projective(sentence):
write(sentence,output_format)
else:
sys.stderr.write(' '.join(w.word for w in sentence[1:]) + '\n')
sys.stdout.write('\n')
sentence = []
continue
try:
pos, word, lemma, tag, tag2, morph, head, func, proj_head, proj_func = line.split()
except ValueError: # word may be unicode whitespace
pos, word, lemma, tag, tag2, morph, head, func, proj_head, proj_func = re.split(' *\t*',line.strip())
word = escape_special_chars(word)
lemma = escape_special_chars(lemma)
if proj_head == '_':
proj_head = head
proj_func = func
sentence.append(Word(int(pos), word, lemma, tag2,int(head), func, int(proj_head), proj_func))
# this script performs the same escaping as escape-special-chars.perl in Moses.
# most of it is done in function write(), but quotation marks need to be processed first
def escape_special_chars(line):
line = line.replace('\'','&apos;') # xml
line = line.replace('"','&quot;') # xml
return line
# make a check if structure is projective
def is_projective(sentence):
dominates = defaultdict(set)
for i,w in enumerate(sentence):
dominates[i].add(i)
if not i:
continue
head = int(w.proj_head)
while head != 0:
if i in dominates[head]:
break
dominates[head].add(i)
head = int(sentence[head].proj_head)
for i in dominates:
dependents = dominates[i]
if max(dependents) - min(dependents) != len(dependents)-1:
sys.stderr.write("error: non-projective structure.\n")
return False
return True
def write(sentence, output_format='xml'):
if output_format == 'xml':
tree = create_subtree(0,sentence)
out = ET.tostring(tree, encoding = 'UTF-8').decode('UTF-8')
if output_format == 'brackets':
out = create_brackets(0,sentence)
out = out.replace('|','&#124;') # factor separator
out = out.replace('[','&#91;') # syntax non-terminal
out = out.replace(']','&#93;') # syntax non-terminal
out = out.replace('&amp;apos;','&apos;') # lxml is buggy if input is escaped
out = out.replace('&amp;quot;','&quot;') # lxml is buggy if input is escaped
print(out)
# write node in Moses XML format
def create_subtree(position, sentence):
element = ET.Element('tree')
if position:
element.set('label', sentence[position].proj_func)
else:
element.set('label', 'sent')
for i in range(1,position):
if sentence[i].proj_head == position:
element.append(create_subtree(i, sentence))
if position:
if preterminals:
head = ET.Element('tree')
head.set('label', sentence[position].tag)
head.text = sentence[position].word
element.append(head)
else:
if len(element):
element[-1].tail = sentence[position].word
else:
element.text = sentence[position].word
for i in range(position, len(sentence)):
if i and sentence[i].proj_head == position:
element.append(create_subtree(i, sentence))
return element
# write node in bracket format (Penn treebank style)
def create_brackets(position, sentence):
if position:
element = "( " + sentence[position].proj_func + ' '
else:
element = "( sent "
for i in range(1,position):
if sentence[i].proj_head == position:
element += create_brackets(i, sentence)
if position:
word = sentence[position].word
if word == ')':
word = 'RBR'
elif word == '(':
word = 'LBR'
tag = sentence[position].tag
if tag == '$(':
tag = '$BR'
if preterminals:
element += '( ' + tag + ' ' + word + ' ) '
else:
element += word + ' ) '
for i in range(position, len(sentence)):
if i and sentence[i].proj_head == position:
element += create_brackets(i, sentence)
if preterminals or not position:
element += ') '
return element
if __name__ == '__main__':
if sys.version_info < (3,0,0):
sys.stdin = codecs.getreader('UTF-8')(sys.stdin)
sys.stdout = codecs.getwriter('UTF-8')(sys.stdout)
sys.stderr = codecs.getwriter('UTF-8')(sys.stderr)
if '--no_preterminals' in sys.argv:
preterminals = False
else:
preterminals = True
if '--brackets' in sys.argv:
main('brackets')
else:
main('xml')