Commit Graph

90 Commits

Author SHA1 Message Date
Yuya Nishihara
0284e70ec0 py3: use 'surrogatepass' error handler to process U+DCxx transparently
It's disallowed by default on Python 3.

https://docs.python.org/3/library/codecs.html#error-handlers
2017-09-16 22:55:48 +09:00
Yuya Nishihara
f726056d04 py3: wrap bytes in encoding.from/toutf8b() with bytestr 2017-09-03 15:54:29 +09:00
Augie Fackler
7fd201a134 encoding: ensure getutf8char always returns a bytestr, never an int 2017-09-15 19:43:32 -04:00
Yuya Nishihara
dcc07e5503 doctest: use print_function and convert bytes to unicode where needed 2017-09-03 14:56:31 +09:00
Yuya Nishihara
21687d1395 doctest: do not embed non-ascii characters in docstring
Since the outer docstring is parsed as a unicode on Python 3, we have to
either double-escape or construct non-ascii string from ascii string.
2017-09-03 15:47:17 +09:00
Yuya Nishihara
c7d6fe4d91 doctest: pass encoding name as system string 2017-09-03 15:42:27 +09:00
Yuya Nishihara
a71f259bd2 doctest: bulk-replace string literals with b'' for Python 3
Our code transformer can't rewrite string literals in docstrings, and I
don't want to make the transformer more complex.
2017-09-03 14:32:11 +09:00
Yuya Nishihara
a5ae36fcb1 encoding: add fast path of from/toutf8b() for ASCII strings
See the previous patch for why.

The added test seems not making much sense because ASCII strings should
never contain "\xed" and be valid UTF-8.

  (with mercurial repo)
  $ export HGRCPATH=/dev/null HGPLAIN=
  $ hg log --time --config experimental.stabilization=all -Tjson > /dev/null

  (original)
  time: real 6.830 secs (user 6.740+0.000 sys 0.080+0.000)
  time: real 6.690 secs (user 6.650+0.000 sys 0.040+0.000)
  time: real 6.700 secs (user 6.640+0.000 sys 0.060+0.000)

  (fast jsonescape)
  time: real 5.630 secs (user 5.550+0.000 sys 0.070+0.000)
  time: real 5.700 secs (user 5.650+0.000 sys 0.050+0.000)
  time: real 5.690 secs (user 5.640+0.000 sys 0.050+0.000)

  (this patch)
  time: real 5.190 secs (user 5.120+0.000 sys 0.070+0.000)
  time: real 5.230 secs (user 5.170+0.000 sys 0.050+0.000)
  time: real 5.220 secs (user 5.150+0.000 sys 0.070+0.000)
2017-04-23 13:08:58 +09:00
Yuya Nishihara
42ccee312b encoding: add fast path of from/tolocal() for ASCII strings
This is micro optimization, but seems not bad since to/fromlocal() is called
lots of times and isasciistr() is cheap and simple.

We boldly assume that any non-ASCII characters have at least one 8-bit byte.
This isn't true for some email character sets (e.g. ISO-2022-JP and UTF-7),
but I believe no such encodings are used as a platform default. Shift_JIS,
a major crap, is okay as it should have a leading byte in 0x80-0xff range.

  (with mercurial repo)
  $ export HGRCPATH=/dev/null HGPLAIN=
  $ hg log --time --config experimental.stabilization=all > /dev/null

  (original)
  time: real 7.460 secs (user 7.420+0.000 sys 0.030+0.000)
  time: real 7.670 secs (user 7.590+0.000 sys 0.080+0.000)
  time: real 7.560 secs (user 7.510+0.000 sys 0.040+0.000)

  (this patch)
  time: real 7.340 secs (user 7.260+0.000 sys 0.060+0.000)
  time: real 7.260 secs (user 7.210+0.000 sys 0.030+0.000)
  time: real 7.310 secs (user 7.260+0.000 sys 0.060+0.000)
2017-04-23 13:06:23 +09:00
Yuya Nishihara
a22ffac20b encoding: add function to test if a str consists of ASCII characters
Most strings are ASCII. Let's optimize for it.

Using uint64_t is slightly faster than uint32_t on 64bit system, but there
isn't huge difference.
2017-04-23 12:59:42 +09:00
Yuya Nishihara
569f77ac30 encoding: add fast path of jsonescape() (issue5533)
This isn't highly optimized as it copies characters one by one, but seems
reasonably simple and not slow.

  (with mercurial repo)
  $ export HGRCPATH=/dev/null HGPLAIN=
  $ hg log --time --config experimental.stabilization=all -Tjson > /dev/null

  (original)
  time: real 6.830 secs (user 6.740+0.000 sys 0.080+0.000)
  time: real 6.690 secs (user 6.650+0.000 sys 0.040+0.000)
  time: real 6.700 secs (user 6.640+0.000 sys 0.060+0.000)

  (this patch)
  time: real 5.630 secs (user 5.550+0.000 sys 0.070+0.000)
  time: real 5.700 secs (user 5.650+0.000 sys 0.050+0.000)
  time: real 5.690 secs (user 5.640+0.000 sys 0.050+0.000)
2017-04-23 14:47:52 +09:00
Yuya Nishihara
961b46e864 encoding: extract stub for fast JSON escape
This moves JSON character maps to pure/charencode.py because they will be
used only when the fast-path fails.
2017-04-23 16:10:51 +09:00
Yuya Nishihara
9837558150 py3: make encoding.strio() an identity function on Python 2
It's the convention the other encoding.str*() functions follow. To make things
simple, this also drops kwargs from the strio() constructor.
2017-08-16 13:50:11 +09:00
Augie Fackler
a80f148d0c py3: introduce a wrapper for __builtins__.{raw_,}input()
In order to make this work, we have to wrap the io streams in a
TextIOWrapper so that __builtins__.input() can do unicode IO on Python
3. We can't just restore the original (unicode) sys.std* because we
might be running a cmdserver, and if we blindly restore sys.* to the
original values then we end up breaking the cmdserver. Sadly,
TextIOWrapper tries to close the underlying stream during its __del__,
so we have to make a sublcass to prevent that.

If you see errors like:

TypeError: a bytes-like object is required, not 'str'

On an input() or print() call on Python 3, the substitution of
sys.std* is probably the root cause.

A previous version of this change tried to put the bytesinput() method
in pycompat - it turns out we need to do some encoding handling, so we
have to be in a higher layer that's allowed to use
mercurial.encoding.encoding. As a result, this is in util for now,
with the TextIOWrapper subclass hiding in encoding.py. I'm not sure of
a better place for the time being.

Differential Revision: https://phab.mercurial-scm.org/D299
2017-07-24 14:38:40 -04:00
Yuya Nishihara
3b5c5b1b96 py3: change encoding.localstr to a subclass of bytes, not str 2017-08-14 15:50:40 +09:00
Yuya Nishihara
854edbfe8b encoding: drop circular import by proxying through '<policy>.charencode'
I decided not to split charencode.c to new C extension module because it
would duplicate binary codes unnecessarily.
2017-07-31 23:13:47 +09:00
Yuya Nishihara
0c45446525 py3: add utility to forward __str__() to __bytes__()
It calls unifromlocal() instead of sysstr() because __bytes__() may contain
locale-dependent values such as paths.
2017-06-24 13:48:04 +09:00
Yuya Nishihara
97a0e41896 encoding: make sure "wide" variable never be referenced from other modules
Better to not expose (maybe-) unicode objects.
2017-05-29 21:57:51 +09:00
Augie Fackler
0d4ce4ae60 encoding: make wide character class list a sysstr
That's what east_asian_width returns, so just match it.
2017-05-28 13:27:29 -04:00
Yuya Nishihara
4563e16232 parsers: switch to policy importer
# no-check-commit
2016-08-13 12:23:56 +09:00
Yuya Nishihara
34b243ac82 encoding: use i.startswith() instead of i[0] to eliminate py2/3 divergence 2017-05-16 23:36:38 +09:00
Martin von Zweigbergk
c3406ac3db cleanup: use set literals
We no longer support Python 2.6, so we can now use set literals.
2017-02-10 16:56:29 -08:00
Gregory Szorc
4c11e54e6e encoding: remove workaround for locale.getpreferredencoding()
locale.getpreferredencoding() was buggy in OS X for Python <2.7.
Since we no longer support Python <2.7, we no longer need this
workaround.

This essentially reverts 4a8b821a69fb.
2017-05-13 11:20:51 -07:00
Yuya Nishihara
e4989d80e7 check-code: ignore re-exports of os.environ in encoding.py
These are valid uses of os.environ.
2017-05-01 17:23:48 +09:00
Pulkit Goyal
ea140325b7 py3: use pycompat.bytechr instead of chr 2017-05-03 15:37:51 +05:30
Yuya Nishihara
236d81fcc8 pycompat: introduce identity function as a compat stub
I was sometimes too lazy to use 'str' instead of 'lambda a: a'. Let's add
a named function for that purpose.
2017-03-29 21:13:55 +09:00
Yuya Nishihara
af7f25fdb3 encoding: add converter between native str and byte string
This kind of encoding conversion is unavoidable on Python 3.
2017-03-13 09:12:56 -07:00
Yuya Nishihara
dcade16cf7 encoding: factor out unicode variants of from/tolocal()
Unfortunately, these functions will be commonly used on Python 3.
2017-03-13 09:11:08 -07:00
Pulkit Goyal
ab21746511 py3: make sure encoding.encoding is a bytes variable
encoding.encoding returns unicodes when locale.getpreferredencoding() is used
to get the preferred encoding. This patch fixes that.
2016-12-17 23:55:25 +05:30
Yuya Nishihara
52ffc6a5bd py3: provide encoding.environ which is a dict of bytes
This can't be moved to pycompat.py since we need encoding.tolocal() to
build bytes dict from unicode os.environ.
2016-09-28 20:05:34 +09:00
Yuya Nishihara
d71e06adf5 py3: convert encoding name and mode to str
Otherwise tolocal() and fromlocal() wouldn't work on Python 3. Still tolocal()
can't make a valid localstr object because localstr inherits str, but it can
return some object without raising exceptions.

Since Py3 bytes() behaves much like bytearray() than str() of Py2, we can't
simply do s/str/bytes/g. I have no good idea to handle str/bytes divergence.
2016-09-28 20:39:06 +09:00
Yuya Nishihara
7c03e0d6ba pycompat: provide 'ispy3' constant
We compare version_info at several places, which seems enough to define
a constant.
2016-09-28 20:01:23 +09:00
Gregory Szorc
fcadf5c68a encoding: use range() instead of xrange()
Python 3 doesn't have xrange(). Instead, range() on Python 3
is a generator, like xrange() is on Python 2.

The benefits of xrange() over range() are when there are very
large ranges that are too expensive to pre-allocate. The code
here is only creating <128 values, so the benefits of xrange()
should be negligible.

With this patch, encoding.py imports safely on Python 3.
2016-03-11 21:27:26 -08:00
Gregory Szorc
10727db849 encoding: make HFS+ ignore code Python 3 compatible
unichr() doesn't exist in Python 3. chr() is the equivalent there.
Unfortunately, we can't use chr() outright because Python 2 only
accepts values smaller than 256.

Also, Python 3 returns an int when accessing a character of a
bytes type (s[x]). So, we have to ord() the values in the assert
statement.
2016-03-11 21:23:34 -08:00
Yuya Nishihara
8e19f2cb3f encoding: backport paranoid escaping from templatefilters.jsonescape()
This was introduced by e1e8de66f2e1. It is required to embed JSON data in
HTML page. Convince yourself here:

http://escape.alf.nu/1
2015-12-27 19:58:11 +09:00
Yuya Nishihara
89ded2c8a7 encoding: add option to escape non-ascii characters in JSON
This is necessary for hgweb to embed JSON data in HTML. JSON data must be
able to be embedded in non-UTF-8 HTML page so long as the page encoding is
compatible with ASCII.

According to RFC 7159, non-BMP character is represented as UTF-16 surrogate
pair. This function first splits an input string into an array of UTF-16
code points.

https://tools.ietf.org/html/rfc7159.html#section-7
2015-12-27 19:28:34 +09:00
Yuya Nishihara
2b5d7daa88 encoding: initialize jsonmap when module is loaded
This makes jsonescape() a thread-safe function, which is necessary for hgweb.
The initialization stuff isn't that slow:

  $ python -m timeit -n1000 -s 'from mercurial import encoding as x' 'reload(x)'
  original:   1000 loops, best of 3: 158 usec per loop
  this patch: 1000 loops, best of 3: 214 usec per loop

compared to loading the commands module:

  $ python -m timeit -n1000 -s 'from mercurial import commands as x' 'reload(x)'
  1000 loops, best of 3: 1.11 msec per loop
2016-01-30 19:48:35 +09:00
Yuya Nishihara
edbbf3796b encoding: change jsonmap to a list indexed by code point
This is slightly faster and convenient to implement a paranoid escaping.

  $ python -m timeit \
  -s 'from mercurial import encoding; data = str(bytearray(xrange(128)))' \
  'encoding.jsonescape(data)'

  original:   100000 loops, best of 3: 15.1 usec per loop
  this patch: 100000 loops, best of 3: 13.7 usec per loop
2016-01-30 19:41:34 +09:00
Yuya Nishihara
6727c29486 encoding: escape U+007F (DEL) character in JSON
RFC 7159 does not state that U+007F must be escaped, but it is widely
considered a control character. As '\x7f' is invisible on a terminal, and
Python's json.dumps() escapes '\x7f', let's do the same.
2016-01-16 18:30:01 +09:00
Matt Mackall
6824529f1e encoding: handle UTF-16 internal limit with fromutf8b (issue5031)
Default builds of Python have a Unicode type that isn't actually full
Unicode but UTF-16, which encodes non-BMP codepoints to a pair of BMP
codepoints with surrogate escaping. Since our UTF-8b hack escaping
uses a plane that overlaps with the UTF-16 escaping system, this gets
extra complicated. In addition, unichr() for codepoints greater than
U+FFFF may not work either.

This changes the code to reuse getutf8char to walk the byte string, so we
only rely on Python for unpacking our U+DCxx characters.
2016-01-07 14:57:57 -06:00
Gregory Szorc
3fecfb1d5a encoding: use double backslash
In Python 2, '\u' == '\\u'. However, in Python 3, '\u' results in:

SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in
position 0-1: truncated \uXXXX escape

The minor change in this patch allows Python 3 to ast parse encoding.py.
2015-12-12 23:26:12 -08:00
Gregory Szorc
5d71f3b2b9 encoding: use absolute_import 2015-12-12 22:57:48 -05:00
Matt Mackall
00fdf56300 encoding: extend test cases for utf8b
This adds a round-trip helper and a few tests of streams that could
cause synchronization problems in the encoder.
2015-11-02 17:17:33 -06:00
Matt Mackall
85a6f7932d encoding: re-escape U+DCxx characters in toutf8b input (issue4927)
This is the final missing piece in fully round-tripping random byte
strings through UTF-8b. While this issue means that UTF-8 <-> UTF-8b
isn't fully bijective, we don't expect to ever see U+DCxx codepoints
in "real" UTF-8 data, so it should remain bijective in practice.
2015-11-05 17:30:10 -06:00
Matt Mackall
09bfc43ef0 encoding: use getutf8char in toutf8b
This correctly avoids the ambiguity of U+FFFD already present in the
input and similar confusion by working a character at a time.
2015-11-05 17:21:43 -06:00
Matt Mackall
cc7a93dfa3 encoding: handle non-BMP characters in fromutf8b 2015-11-05 17:11:50 -06:00
Matt Mackall
322dbe32ca encoding: add getutf8char helper
This allows us to find character boundaries in byte strings when
trying to do custom encodings.
2015-11-05 16:48:46 -06:00
Gregory Szorc
5380dea2a7 global: mass rewrite to use modern exception syntax
Python 2.6 introduced the "except type as instance" syntax, replacing
the "except type, instance" syntax that came before. Python 3 dropped
support for the latter syntax. Since we no longer support Python 2.4 or
2.5, we have no need to continue supporting the "except type, instance".

This patch mass rewrites the exception syntax to be Python 2.6+ and
Python 3 compatible.

This patch was produced by running `2to3 -f except -w -n .`.
2015-06-23 22:20:08 -07:00
Siddharth Agarwal
fada28ff91 util.h: define an enum for normcase specs
These will be used in upcoming patches to efficiently create a dirstate
foldmap.
2015-04-02 19:17:32 -07:00
Siddharth Agarwal
471dc0d569 encoding.upper: factor out fallback code
This will be used as the fallback function on Windows.
2015-04-01 00:30:41 -07:00