locale.getpreferredencoding() was buggy in OS X for Python <2.7.
Since we no longer support Python <2.7, we no longer need this
workaround.
This essentially reverts 4a8b821a69fb.
Otherwise tolocal() and fromlocal() wouldn't work on Python 3. Still tolocal()
can't make a valid localstr object because localstr inherits str, but it can
return some object without raising exceptions.
Since Py3 bytes() behaves much like bytearray() than str() of Py2, we can't
simply do s/str/bytes/g. I have no good idea to handle str/bytes divergence.
Python 3 doesn't have xrange(). Instead, range() on Python 3
is a generator, like xrange() is on Python 2.
The benefits of xrange() over range() are when there are very
large ranges that are too expensive to pre-allocate. The code
here is only creating <128 values, so the benefits of xrange()
should be negligible.
With this patch, encoding.py imports safely on Python 3.
unichr() doesn't exist in Python 3. chr() is the equivalent there.
Unfortunately, we can't use chr() outright because Python 2 only
accepts values smaller than 256.
Also, Python 3 returns an int when accessing a character of a
bytes type (s[x]). So, we have to ord() the values in the assert
statement.
This is necessary for hgweb to embed JSON data in HTML. JSON data must be
able to be embedded in non-UTF-8 HTML page so long as the page encoding is
compatible with ASCII.
According to RFC 7159, non-BMP character is represented as UTF-16 surrogate
pair. This function first splits an input string into an array of UTF-16
code points.
https://tools.ietf.org/html/rfc7159.html#section-7
This makes jsonescape() a thread-safe function, which is necessary for hgweb.
The initialization stuff isn't that slow:
$ python -m timeit -n1000 -s 'from mercurial import encoding as x' 'reload(x)'
original: 1000 loops, best of 3: 158 usec per loop
this patch: 1000 loops, best of 3: 214 usec per loop
compared to loading the commands module:
$ python -m timeit -n1000 -s 'from mercurial import commands as x' 'reload(x)'
1000 loops, best of 3: 1.11 msec per loop
This is slightly faster and convenient to implement a paranoid escaping.
$ python -m timeit \
-s 'from mercurial import encoding; data = str(bytearray(xrange(128)))' \
'encoding.jsonescape(data)'
original: 100000 loops, best of 3: 15.1 usec per loop
this patch: 100000 loops, best of 3: 13.7 usec per loop
RFC 7159 does not state that U+007F must be escaped, but it is widely
considered a control character. As '\x7f' is invisible on a terminal, and
Python's json.dumps() escapes '\x7f', let's do the same.
Default builds of Python have a Unicode type that isn't actually full
Unicode but UTF-16, which encodes non-BMP codepoints to a pair of BMP
codepoints with surrogate escaping. Since our UTF-8b hack escaping
uses a plane that overlaps with the UTF-16 escaping system, this gets
extra complicated. In addition, unichr() for codepoints greater than
U+FFFF may not work either.
This changes the code to reuse getutf8char to walk the byte string, so we
only rely on Python for unpacking our U+DCxx characters.
In Python 2, '\u' == '\\u'. However, in Python 3, '\u' results in:
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in
position 0-1: truncated \uXXXX escape
The minor change in this patch allows Python 3 to ast parse encoding.py.
This is the final missing piece in fully round-tripping random byte
strings through UTF-8b. While this issue means that UTF-8 <-> UTF-8b
isn't fully bijective, we don't expect to ever see U+DCxx codepoints
in "real" UTF-8 data, so it should remain bijective in practice.
Python 2.6 introduced the "except type as instance" syntax, replacing
the "except type, instance" syntax that came before. Python 3 dropped
support for the latter syntax. Since we no longer support Python 2.4 or
2.5, we have no need to continue supporting the "except type, instance".
This patch mass rewrites the exception syntax to be Python 2.6+ and
Python 3 compatible.
This patch was produced by running `2to3 -f except -w -n .`.
For C code we don't want to pay the cost of calling into a Python function for
the common case of ASCII filenames. However, while on most POSIX platforms we
normalize filenames by lowercasing them, on Windows we uppercase them. We
define an enum here indicating the direction that filenames should be
normalized as. Some platforms (notably Cygwin) have more complicated
normalization behavior -- we add a case for that too.
In upcoming patches we'll also define a fallback function that is called if the
string has non-ASCII bytes.
This enum will be replicated in the C code to make foldmaps. There's
unfortunately no nice way to avoid that -- we can't have encoding import
parsers because of import cycles. One way might be to have parsers import
encoding, but accessing Python modules from C code is just awkward.
The name 'normcasespecs' was chosen to indicate that this is merely an integer
that specifies a behavior, not a function. The name was pluralized since in
upcoming patches we'll introduce 'normcasespec' which will be one of these
values.
According to Apple Technote 1150 (unavailable from Apple as far as I
can tell, but archived in several places online), HFS+ ignores sixteen
specific unicode runes when doing path normalization. We need to
handle those cases, so this function lets us efficiently strip the
offending characters from a UTF-8 encoded string (which is the only
way it seems to matter on OS X.)
39fbe33f95fa brought "asciilower" and "import parsers" into
"encoding.py".
This works fine with "parsers" module in C implementation, but doesn't
with one in pure Python implementation, because the latter causes
cyclic dependency below and aborting execution:
util => i18n => encoding => parsers => util
This patch delays importing "parsers" module until it is really
needed, to avoid cyclic dependency around "parsers" in pure Python
build.
This benefits, among other things, the case collision auditor.
On a Linux system with a large real-world repo where all filenames are ASCII,
hg perfcca:
before: wall 0.260157 comb 0.270000 user 0.230000 sys 0.040000 (best of 38)
after: wall 0.164616 comb 0.160000 user 0.160000 sys 0.000000 (best of 54)
We need a way to efficiently lowercase ASCII strings. For example, 'hg status'
needs to build up the fold map -- a map from a canonical case (for OS X,
lowercase) to the actual case of each file and directory in the dirstate.
The current way we do that is to try decoding to ASCII and then calling
lower() on the string, labeled 'orig' below:
str.decode('ascii')
return str.lower()
This is pretty inefficient, and it turns out we can do much better.
I also tested out a condition-based approach, labeled 'cond' below:
(c >= 'A' && c <= 'Z') ? (c + ('a' - 'A')) : c
'cond' turned out to be slower in all cases. A 256-byte lookup table with
invalid values for everything past 127 performed similarly, but this was less
verbose.
On OS X 10.9 with LLVM version 6.0 (clang-600.0.51), the asciilower function
was run against two corpuses.
Corpus 1 (list of files from real-world repo, > 100k files):
orig: wall 0.428567 comb 0.430000 user 0.430000 sys 0.000000 (best of 24)
cond: wall 0.077204 comb 0.070000 user 0.070000 sys 0.000000 (best of 100)
lookup: wall 0.060714 comb 0.060000 user 0.060000 sys 0.000000 (best of 100)
Corpus 2 (mozilla-central, 113k files):
orig: wall 0.238406 comb 0.240000 user 0.240000 sys 0.000000 (best of 42)
cond: wall 0.040779 comb 0.040000 user 0.040000 sys 0.000000 (best of 100)
lookup: wall 0.037623 comb 0.040000 user 0.040000 sys 0.000000 (best of 100)
On a Linux server-class machine with GCC 4.4.6 20120305 (Red Hat 4.4.6-4):
Corpus 1 (real-world repo, > 100k files):
orig: wall 0.260899 comb 0.260000 user 0.260000 sys 0.000000 (best of 38)
cond: wall 0.054818 comb 0.060000 user 0.060000 sys 0.000000 (best of 100)
lookup: wall 0.048489 comb 0.050000 user 0.050000 sys 0.000000 (best of 100)
Corpus 2 (mozilla-central, 113k files):
orig: wall 0.153082 comb 0.150000 user 0.150000 sys 0.000000 (best of 65)
cond: wall 0.031007 comb 0.040000 user 0.040000 sys 0.000000 (best of 100)
lookup: wall 0.028793 comb 0.030000 user 0.030000 sys 0.000000 (best of 100)
SSE instructions might help even more, but I didn't experiment with those.
Newly added 'trim' is used to trim multi-byte characters at most
specified columns correctly: directly slicing byte sequence should be
replaced with 'encoding.trim', because the former may split at
intermediate multi-byte sequence.
Slicing unicode sequence ('uslice') and concatenation with ellipsis
('concat') are defined as function, to make enhancement in subsequent
patch easier.
This copies the performance hack from encoding.lower (e7a5733d533f).
The case-folding logic that kicks in on case-insensitive filesystems
hits encoding.upper hard: with a repository with 75k files, the
timings went from
hg perfstatus
! wall 3.156000 comb 3.156250 user 1.625000 sys 1.531250 (best of 3)
to
hg perfstatus
! wall 2.390000 comb 2.390625 user 1.078125 sys 1.312500 (best of 5)
This is a 24% decrease. For comparison, Mercurial 2.0 gives:
hg perfstatus
! wall 2.172000 comb 2.171875 user 0.984375 sys 1.187500 (best of 5)
so we're only 10% slower than before we added the extra case-folding
logic.
The same decrease is seen when executing 'hg status' as normal, where
we go from:
hg status --time
time: real 4.322 secs (user 2.219+0.000 sys 2.094+0.000)
to
hg status --time
time: real 3.307 secs (user 1.750+0.000 sys 1.547+0.000)
When calling encode on a str, the string is first decoded using the
default encoding and then encoded. So
s.encode('ascii') == s.decode().encode('ascii')
We don't care about the encode step here -- we're just after the
UnicodeDecodeError raised by decode if it finds a non-ASCII character.
This way is also marginally faster since it saves the construction of
the extra str object.
If the default python encoding was changed from ascii, the attempt to
encode as ascii before lower() could throw a UnicodeEncodeError.
Catch UnicodeError instead to prevent an unhandled exception.