The code uses two mechanisms for generating random numbers: srand()/rand(),
which is not thread-safe, and srandom()/random(), which is POSIX-specific.
Here I add a util/random.cc module that centralizes these calls, and unifies
some common usage patterns. If the implementation is not good enough, we can
now change it in a single place.
To keep things simple, this uses the portable srand()/rand() but protects them
with a lock to avoid concurrency problems.
The hard part was to keep the regression tests passing: they rely on fixed
sequences of random numbers, so a small code change could break them very
thoroughly. Util::rand(), for wide types like size_t, calls std::rand() not
once but twice. This behaviour was generalized into utils::wide_rand() and
friends.
Windows does not have popen()/pclose(), so FileHandler.cpp #define's them to
_popen()/_pclose(). But MinGW has similar macros built into <cstdio>, leading
to warnings. So skip the workaround on MinGW.
The duplicate definition works fine in environments where the inline
definition becomes a weak symbol in the object file, but if it gets
generated as a regular definition, the duplicate definition causes link
problems.
In most call sites the return value could easily be made const, which
gives both the reader and the compiler a bit more certainty about the code's
intentions. In theory this may help performance, but it's mainly for clarity.
The comments are based on reverse-engineering, and the unit tests are based
on the comments. It's possible that some of what's in there is not essential,
in which case, don't feel bad about changing it!
I left a third identical definition in place, though I updated it with my
changes to avoid creeping divergence, and noted the duplication in a comment.
It would be nice to get rid of this definition as well, but it'd introduce
headers from the main Moses tree into biconcor, which may be against policy.
I'm adding these because boost::filesystem::unique_path introduces
encoding issues: on Windows the path is in wchar_t, breaking use of
those strings in various places! Encoding the strings is just too
much work.
It's still possible that the current temp_file implementation won't
build on Windows (it uses POSIX mkstemp() and close()) but that can
be fixed underneath the API.
The MmapAllocator header made use of sys/mman.h and mmap(), which are
Unix-specific. But util has a wrapper which also works on Windows.
This also fixes the error handling: when mmap() failed, the old code would
return an invalid (but non-NULL!) pointer — leading to a crash. The wrapper
will throw an exception with a helpful error message.
The bcopy() function is POSIX-specific and deprecated. The recommended
replacement (at least for non-overlapping source and destination ranges)
is memcpy(), which is in the standard C library.
Note that the source and destination parameters are in a different order
between these two functions.
Compiling with clang++ at the default warning/error levels produces
some interesting warnings. Here's a pair of fixes for the simplest
instances:
moses/TranslationModel/RuleTable/PhraseDictionaryFuzzyMatch.cpp:133:7:
warning: comparison of array 'path' equal to a null pointer is always
false [-Wtautological-pointer-compare]
if (path == NULL) {
^~~~ ~~~~
(The code unnecessarily checks that an automatic variable has a
non-null address).
moses/TranslationModel/DynSAInclude/onlineRLM.h:305:20:
warning: unsequenced modification and access to 'den_val' [-Wunsequenced]
if(((den_val = query(&ngram[len - num_fnd], num_fnd - 1)) > 0) &&
^
(The code tries to cram too much into an "if" condition.)
- Bug fixes and conceptual improvements in biased sampling. The sampling now
tries to stick to the bias, even when an unsuitable corpus dominates
the occurrences.