sapling

mirror of https://github.com/facebook/sapling.git synced 2024-10-11 09:17:30 +03:00

Author	SHA1	Message	Date
Yuya Nishihara	0c45446525	py3: add utility to forward __str__() to __bytes__() It calls unifromlocal() instead of sysstr() because __bytes__() may contain locale-dependent values such as paths.	2017-06-24 13:48:04 +09:00
Yuya Nishihara	97a0e41896	encoding: make sure "wide" variable never be referenced from other modules Better to not expose (maybe-) unicode objects.	2017-05-29 21:57:51 +09:00
Augie Fackler	0d4ce4ae60	encoding: make wide character class list a sysstr That's what east_asian_width returns, so just match it.	2017-05-28 13:27:29 -04:00
Yuya Nishihara	4563e16232	parsers: switch to policy importer # no-check-commit	2016-08-13 12:23:56 +09:00
Yuya Nishihara	34b243ac82	encoding: use i.startswith() instead of i[0] to eliminate py2/3 divergence	2017-05-16 23:36:38 +09:00
Martin von Zweigbergk	c3406ac3db	cleanup: use set literals We no longer support Python 2.6, so we can now use set literals.	2017-02-10 16:56:29 -08:00
Gregory Szorc	4c11e54e6e	encoding: remove workaround for locale.getpreferredencoding() locale.getpreferredencoding() was buggy in OS X for Python <2.7. Since we no longer support Python <2.7, we no longer need this workaround. This essentially reverts 4a8b821a69fb.	2017-05-13 11:20:51 -07:00
Yuya Nishihara	e4989d80e7	check-code: ignore re-exports of os.environ in encoding.py These are valid uses of os.environ.	2017-05-01 17:23:48 +09:00
Pulkit Goyal	ea140325b7	py3: use pycompat.bytechr instead of chr	2017-05-03 15:37:51 +05:30
Yuya Nishihara	236d81fcc8	pycompat: introduce identity function as a compat stub I was sometimes too lazy to use 'str' instead of 'lambda a: a'. Let's add a named function for that purpose.	2017-03-29 21:13:55 +09:00
Yuya Nishihara	af7f25fdb3	encoding: add converter between native str and byte string This kind of encoding conversion is unavoidable on Python 3.	2017-03-13 09:12:56 -07:00
Yuya Nishihara	dcade16cf7	encoding: factor out unicode variants of from/tolocal() Unfortunately, these functions will be commonly used on Python 3.	2017-03-13 09:11:08 -07:00
Pulkit Goyal	ab21746511	py3: make sure encoding.encoding is a bytes variable encoding.encoding returns unicodes when locale.getpreferredencoding() is used to get the preferred encoding. This patch fixes that.	2016-12-17 23:55:25 +05:30
Yuya Nishihara	52ffc6a5bd	py3: provide encoding.environ which is a dict of bytes This can't be moved to pycompat.py since we need encoding.tolocal() to build bytes dict from unicode os.environ.	2016-09-28 20:05:34 +09:00
Yuya Nishihara	d71e06adf5	py3: convert encoding name and mode to str Otherwise tolocal() and fromlocal() wouldn't work on Python 3. Still tolocal() can't make a valid localstr object because localstr inherits str, but it can return some object without raising exceptions. Since Py3 bytes() behaves much like bytearray() than str() of Py2, we can't simply do s/str/bytes/g. I have no good idea to handle str/bytes divergence.	2016-09-28 20:39:06 +09:00
Yuya Nishihara	7c03e0d6ba	pycompat: provide 'ispy3' constant We compare version_info at several places, which seems enough to define a constant.	2016-09-28 20:01:23 +09:00
Gregory Szorc	fcadf5c68a	encoding: use range() instead of xrange() Python 3 doesn't have xrange(). Instead, range() on Python 3 is a generator, like xrange() is on Python 2. The benefits of xrange() over range() are when there are very large ranges that are too expensive to pre-allocate. The code here is only creating <128 values, so the benefits of xrange() should be negligible. With this patch, encoding.py imports safely on Python 3.	2016-03-11 21:27:26 -08:00
Gregory Szorc	10727db849	encoding: make HFS+ ignore code Python 3 compatible unichr() doesn't exist in Python 3. chr() is the equivalent there. Unfortunately, we can't use chr() outright because Python 2 only accepts values smaller than 256. Also, Python 3 returns an int when accessing a character of a bytes type (s[x]). So, we have to ord() the values in the assert statement.	2016-03-11 21:23:34 -08:00
Yuya Nishihara	8e19f2cb3f	encoding: backport paranoid escaping from templatefilters.jsonescape() This was introduced by e1e8de66f2e1. It is required to embed JSON data in HTML page. Convince yourself here: http://escape.alf.nu/1	2015-12-27 19:58:11 +09:00
Yuya Nishihara	89ded2c8a7	encoding: add option to escape non-ascii characters in JSON This is necessary for hgweb to embed JSON data in HTML. JSON data must be able to be embedded in non-UTF-8 HTML page so long as the page encoding is compatible with ASCII. According to RFC 7159, non-BMP character is represented as UTF-16 surrogate pair. This function first splits an input string into an array of UTF-16 code points. https://tools.ietf.org/html/rfc7159.html#section-7	2015-12-27 19:28:34 +09:00
Yuya Nishihara	2b5d7daa88	encoding: initialize jsonmap when module is loaded This makes jsonescape() a thread-safe function, which is necessary for hgweb. The initialization stuff isn't that slow: $ python -m timeit -n1000 -s 'from mercurial import encoding as x' 'reload(x)' original: 1000 loops, best of 3: 158 usec per loop this patch: 1000 loops, best of 3: 214 usec per loop compared to loading the commands module: $ python -m timeit -n1000 -s 'from mercurial import commands as x' 'reload(x)' 1000 loops, best of 3: 1.11 msec per loop	2016-01-30 19:48:35 +09:00
Yuya Nishihara	edbbf3796b	encoding: change jsonmap to a list indexed by code point This is slightly faster and convenient to implement a paranoid escaping. $ python -m timeit \ -s 'from mercurial import encoding; data = str(bytearray(xrange(128)))' \ 'encoding.jsonescape(data)' original: 100000 loops, best of 3: 15.1 usec per loop this patch: 100000 loops, best of 3: 13.7 usec per loop	2016-01-30 19:41:34 +09:00
Yuya Nishihara	6727c29486	encoding: escape U+007F (DEL) character in JSON RFC 7159 does not state that U+007F must be escaped, but it is widely considered a control character. As '\x7f' is invisible on a terminal, and Python's json.dumps() escapes '\x7f', let's do the same.	2016-01-16 18:30:01 +09:00
Matt Mackall	6824529f1e	encoding: handle UTF-16 internal limit with fromutf8b (issue5031) Default builds of Python have a Unicode type that isn't actually full Unicode but UTF-16, which encodes non-BMP codepoints to a pair of BMP codepoints with surrogate escaping. Since our UTF-8b hack escaping uses a plane that overlaps with the UTF-16 escaping system, this gets extra complicated. In addition, unichr() for codepoints greater than U+FFFF may not work either. This changes the code to reuse getutf8char to walk the byte string, so we only rely on Python for unpacking our U+DCxx characters.	2016-01-07 14:57:57 -06:00
Gregory Szorc	3fecfb1d5a	encoding: use double backslash In Python 2, '\u' == '\\u'. However, in Python 3, '\u' results in: SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape The minor change in this patch allows Python 3 to ast parse encoding.py.	2015-12-12 23:26:12 -08:00
Gregory Szorc	5d71f3b2b9	encoding: use absolute_import	2015-12-12 22:57:48 -05:00
Matt Mackall	00fdf56300	encoding: extend test cases for utf8b This adds a round-trip helper and a few tests of streams that could cause synchronization problems in the encoder.	2015-11-02 17:17:33 -06:00
Matt Mackall	85a6f7932d	encoding: re-escape U+DCxx characters in toutf8b input (issue4927) This is the final missing piece in fully round-tripping random byte strings through UTF-8b. While this issue means that UTF-8 <-> UTF-8b isn't fully bijective, we don't expect to ever see U+DCxx codepoints in "real" UTF-8 data, so it should remain bijective in practice.	2015-11-05 17:30:10 -06:00
Matt Mackall	09bfc43ef0	encoding: use getutf8char in toutf8b This correctly avoids the ambiguity of U+FFFD already present in the input and similar confusion by working a character at a time.	2015-11-05 17:21:43 -06:00
Matt Mackall	cc7a93dfa3	encoding: handle non-BMP characters in fromutf8b	2015-11-05 17:11:50 -06:00
Matt Mackall	322dbe32ca	encoding: add getutf8char helper This allows us to find character boundaries in byte strings when trying to do custom encodings.	2015-11-05 16:48:46 -06:00
Gregory Szorc	5380dea2a7	global: mass rewrite to use modern exception syntax Python 2.6 introduced the "except type as instance" syntax, replacing the "except type, instance" syntax that came before. Python 3 dropped support for the latter syntax. Since we no longer support Python 2.4 or 2.5, we have no need to continue supporting the "except type, instance". This patch mass rewrites the exception syntax to be Python 2.6+ and Python 3 compatible. This patch was produced by running `2to3 -f except -w -n .`.	2015-06-23 22:20:08 -07:00
Siddharth Agarwal	fada28ff91	util.h: define an enum for normcase specs These will be used in upcoming patches to efficiently create a dirstate foldmap.	2015-04-02 19:17:32 -07:00
Siddharth Agarwal	471dc0d569	encoding.upper: factor out fallback code This will be used as the fallback function on Windows.	2015-04-01 00:30:41 -07:00
Siddharth Agarwal	950e16d188	encoding: define an enum that specifies what normcase does to ASCII strings For C code we don't want to pay the cost of calling into a Python function for the common case of ASCII filenames. However, while on most POSIX platforms we normalize filenames by lowercasing them, on Windows we uppercase them. We define an enum here indicating the direction that filenames should be normalized as. Some platforms (notably Cygwin) have more complicated normalization behavior -- we add a case for that too. In upcoming patches we'll also define a fallback function that is called if the string has non-ASCII bytes. This enum will be replicated in the C code to make foldmaps. There's unfortunately no nice way to avoid that -- we can't have encoding import parsers because of import cycles. One way might be to have parsers import encoding, but accessing Python modules from C code is just awkward. The name 'normcasespecs' was chosen to indicate that this is merely an integer that specifies a behavior, not a function. The name was pluralized since in upcoming patches we'll introduce 'normcasespec' which will be one of these values.	2015-04-01 00:21:10 -07:00
Siddharth Agarwal	9e6d9e8c62	encoding: use parsers.asciiupper when available This is used on Windows and Cygwin, and the gains from this are expected to be similar to what was seen in 39fbe33f95fa.	2015-03-31 15:22:09 -07:00
Augie Fackler	3c9e7fcc66	encoding: add hfsignoreclean to clean out HFS-ignored characters According to Apple Technote 1150 (unavailable from Apple as far as I can tell, but archived in several places online), HFS+ ignores sixteen specific unicode runes when doing path normalization. We need to handle those cases, so this function lets us efficiently strip the offending characters from a UTF-8 encoded string (which is the only way it seems to matter on OS X.)	2014-12-16 13:06:41 -05:00
FUJIWARA Katsunori	7120dc2e96	encoding: avoid cyclic dependency around "parsers" in pure Python build 39fbe33f95fa brought "asciilower" and "import parsers" into "encoding.py". This works fine with "parsers" module in C implementation, but doesn't with one in pure Python implementation, because the latter causes cyclic dependency below and aborting execution: util => i18n => encoding => parsers => util This patch delays importing "parsers" module until it is really needed, to avoid cyclic dependency around "parsers" in pure Python build.	2014-10-17 02:07:04 +09:00
Siddharth Agarwal	8298f08c96	encoding.lower: use fast ASCII lower This benefits, among other things, the case collision auditor. On a Linux system with a large real-world repo where all filenames are ASCII, hg perfcca: before: wall 0.260157 comb 0.270000 user 0.230000 sys 0.040000 (best of 38) after: wall 0.164616 comb 0.160000 user 0.160000 sys 0.000000 (best of 54)	2014-10-03 18:45:56 -07:00
Siddharth Agarwal	e56ab5399b	parsers: add a function to efficiently lowercase ASCII strings We need a way to efficiently lowercase ASCII strings. For example, 'hg status' needs to build up the fold map -- a map from a canonical case (for OS X, lowercase) to the actual case of each file and directory in the dirstate. The current way we do that is to try decoding to ASCII and then calling lower() on the string, labeled 'orig' below: str.decode('ascii') return str.lower() This is pretty inefficient, and it turns out we can do much better. I also tested out a condition-based approach, labeled 'cond' below: (c >= 'A' && c <= 'Z') ? (c + ('a' - 'A')) : c 'cond' turned out to be slower in all cases. A 256-byte lookup table with invalid values for everything past 127 performed similarly, but this was less verbose. On OS X 10.9 with LLVM version 6.0 (clang-600.0.51), the asciilower function was run against two corpuses. Corpus 1 (list of files from real-world repo, > 100k files): orig: wall 0.428567 comb 0.430000 user 0.430000 sys 0.000000 (best of 24) cond: wall 0.077204 comb 0.070000 user 0.070000 sys 0.000000 (best of 100) lookup: wall 0.060714 comb 0.060000 user 0.060000 sys 0.000000 (best of 100) Corpus 2 (mozilla-central, 113k files): orig: wall 0.238406 comb 0.240000 user 0.240000 sys 0.000000 (best of 42) cond: wall 0.040779 comb 0.040000 user 0.040000 sys 0.000000 (best of 100) lookup: wall 0.037623 comb 0.040000 user 0.040000 sys 0.000000 (best of 100) On a Linux server-class machine with GCC 4.4.6 20120305 (Red Hat 4.4.6-4): Corpus 1 (real-world repo, > 100k files): orig: wall 0.260899 comb 0.260000 user 0.260000 sys 0.000000 (best of 38) cond: wall 0.054818 comb 0.060000 user 0.060000 sys 0.000000 (best of 100) lookup: wall 0.048489 comb 0.050000 user 0.050000 sys 0.000000 (best of 100) Corpus 2 (mozilla-central, 113k files): orig: wall 0.153082 comb 0.150000 user 0.150000 sys 0.000000 (best of 65) cond: wall 0.031007 comb 0.040000 user 0.040000 sys 0.000000 (best of 100) lookup: wall 0.028793 comb 0.030000 user 0.030000 sys 0.000000 (best of 100) SSE instructions might help even more, but I didn't experiment with those.	2014-10-03 18:42:39 -07:00
Matt Mackall	0b8e08e772	encoding: add json escaping filter This ends up here because it needs to be somewhat encoding aware.	2014-09-15 13:12:49 -05:00
Matt Mackall	00a262f2bb	encoding: handle empty string in toutf8	2014-09-15 13:12:20 -05:00
FUJIWARA Katsunori	b34bd803eb	encoding: add 'leftside' argument into 'trim' to switch trimming side	2014-07-06 02:56:41 +09:00
FUJIWARA Katsunori	71717db270	encoding: add 'trim' to trim multi-byte characters at most specified columns Newly added 'trim' is used to trim multi-byte characters at most specified columns correctly: directly slicing byte sequence should be replaced with 'encoding.trim', because the former may split at intermediate multi-byte sequence. Slicing unicode sequence ('uslice') and concatenation with ellipsis ('concat') are defined as function, to make enhancement in subsequent patch easier.	2014-07-06 02:56:41 +09:00
Mads Kiilerich	403c97887d	tests: stabilize doctest output Avoid dependencies to dict iteration order.	2013-01-15 02:59:14 +01:00
Mads Kiilerich	2f4504e446	fix trivial spelling errors	2012-08-15 22:38:42 +02:00
Martin Geisler	5b013e2061	encoding: add fast-path for ASCII uppercase. This copies the performance hack from encoding.lower (e7a5733d533f). The case-folding logic that kicks in on case-insensitive filesystems hits encoding.upper hard: with a repository with 75k files, the timings went from hg perfstatus ! wall 3.156000 comb 3.156250 user 1.625000 sys 1.531250 (best of 3) to hg perfstatus ! wall 2.390000 comb 2.390625 user 1.078125 sys 1.312500 (best of 5) This is a 24% decrease. For comparison, Mercurial 2.0 gives: hg perfstatus ! wall 2.172000 comb 2.171875 user 0.984375 sys 1.187500 (best of 5) so we're only 10% slower than before we added the extra case-folding logic. The same decrease is seen when executing 'hg status' as normal, where we go from: hg status --time time: real 4.322 secs (user 2.219+0.000 sys 2.094+0.000) to hg status --time time: real 3.307 secs (user 1.750+0.000 sys 1.547+0.000)	2012-07-23 15:55:26 -06:00
Martin Geisler	4f96956c09	encoding: use s.decode to trigger UnicodeDecodeError When calling encode on a str, the string is first decoded using the default encoding and then encoded. So s.encode('ascii') == s.decode().encode('ascii') We don't care about the encode step here -- we're just after the UnicodeDecodeError raised by decode if it finds a non-ASCII character. This way is also marginally faster since it saves the construction of the extra str object.	2012-07-23 15:55:22 -06:00
Cesar Mena	5d1ea9328c	encoding: protect against non-ascii default encoding If the default python encoding was changed from ascii, the attempt to encode as ascii before lower() could throw a UnicodeEncodeError. Catch UnicodeError instead to prevent an unhandled exception.	2012-04-22 21:27:52 -04:00
Matt Mackall	b7245bb05e	encoding: add fast-path for ASCII lowercase	2012-04-10 12:07:18 -05:00

1 2

74 Commits