sapling

mirror of https://github.com/facebook/sapling.git synced 2024-10-10 16:57:49 +03:00

Author	SHA1	Message	Date
Siddharth Agarwal	950e16d188	encoding: define an enum that specifies what normcase does to ASCII strings For C code we don't want to pay the cost of calling into a Python function for the common case of ASCII filenames. However, while on most POSIX platforms we normalize filenames by lowercasing them, on Windows we uppercase them. We define an enum here indicating the direction that filenames should be normalized as. Some platforms (notably Cygwin) have more complicated normalization behavior -- we add a case for that too. In upcoming patches we'll also define a fallback function that is called if the string has non-ASCII bytes. This enum will be replicated in the C code to make foldmaps. There's unfortunately no nice way to avoid that -- we can't have encoding import parsers because of import cycles. One way might be to have parsers import encoding, but accessing Python modules from C code is just awkward. The name 'normcasespecs' was chosen to indicate that this is merely an integer that specifies a behavior, not a function. The name was pluralized since in upcoming patches we'll introduce 'normcasespec' which will be one of these values.	2015-04-01 00:21:10 -07:00
Siddharth Agarwal	9e6d9e8c62	encoding: use parsers.asciiupper when available This is used on Windows and Cygwin, and the gains from this are expected to be similar to what was seen in 39fbe33f95fa.	2015-03-31 15:22:09 -07:00
Augie Fackler	3c9e7fcc66	encoding: add hfsignoreclean to clean out HFS-ignored characters According to Apple Technote 1150 (unavailable from Apple as far as I can tell, but archived in several places online), HFS+ ignores sixteen specific unicode runes when doing path normalization. We need to handle those cases, so this function lets us efficiently strip the offending characters from a UTF-8 encoded string (which is the only way it seems to matter on OS X.)	2014-12-16 13:06:41 -05:00
FUJIWARA Katsunori	7120dc2e96	encoding: avoid cyclic dependency around "parsers" in pure Python build 39fbe33f95fa brought "asciilower" and "import parsers" into "encoding.py". This works fine with "parsers" module in C implementation, but doesn't with one in pure Python implementation, because the latter causes cyclic dependency below and aborting execution: util => i18n => encoding => parsers => util This patch delays importing "parsers" module until it is really needed, to avoid cyclic dependency around "parsers" in pure Python build.	2014-10-17 02:07:04 +09:00
Siddharth Agarwal	8298f08c96	encoding.lower: use fast ASCII lower This benefits, among other things, the case collision auditor. On a Linux system with a large real-world repo where all filenames are ASCII, hg perfcca: before: wall 0.260157 comb 0.270000 user 0.230000 sys 0.040000 (best of 38) after: wall 0.164616 comb 0.160000 user 0.160000 sys 0.000000 (best of 54)	2014-10-03 18:45:56 -07:00
Siddharth Agarwal	e56ab5399b	parsers: add a function to efficiently lowercase ASCII strings We need a way to efficiently lowercase ASCII strings. For example, 'hg status' needs to build up the fold map -- a map from a canonical case (for OS X, lowercase) to the actual case of each file and directory in the dirstate. The current way we do that is to try decoding to ASCII and then calling lower() on the string, labeled 'orig' below: str.decode('ascii') return str.lower() This is pretty inefficient, and it turns out we can do much better. I also tested out a condition-based approach, labeled 'cond' below: (c >= 'A' && c <= 'Z') ? (c + ('a' - 'A')) : c 'cond' turned out to be slower in all cases. A 256-byte lookup table with invalid values for everything past 127 performed similarly, but this was less verbose. On OS X 10.9 with LLVM version 6.0 (clang-600.0.51), the asciilower function was run against two corpuses. Corpus 1 (list of files from real-world repo, > 100k files): orig: wall 0.428567 comb 0.430000 user 0.430000 sys 0.000000 (best of 24) cond: wall 0.077204 comb 0.070000 user 0.070000 sys 0.000000 (best of 100) lookup: wall 0.060714 comb 0.060000 user 0.060000 sys 0.000000 (best of 100) Corpus 2 (mozilla-central, 113k files): orig: wall 0.238406 comb 0.240000 user 0.240000 sys 0.000000 (best of 42) cond: wall 0.040779 comb 0.040000 user 0.040000 sys 0.000000 (best of 100) lookup: wall 0.037623 comb 0.040000 user 0.040000 sys 0.000000 (best of 100) On a Linux server-class machine with GCC 4.4.6 20120305 (Red Hat 4.4.6-4): Corpus 1 (real-world repo, > 100k files): orig: wall 0.260899 comb 0.260000 user 0.260000 sys 0.000000 (best of 38) cond: wall 0.054818 comb 0.060000 user 0.060000 sys 0.000000 (best of 100) lookup: wall 0.048489 comb 0.050000 user 0.050000 sys 0.000000 (best of 100) Corpus 2 (mozilla-central, 113k files): orig: wall 0.153082 comb 0.150000 user 0.150000 sys 0.000000 (best of 65) cond: wall 0.031007 comb 0.040000 user 0.040000 sys 0.000000 (best of 100) lookup: wall 0.028793 comb 0.030000 user 0.030000 sys 0.000000 (best of 100) SSE instructions might help even more, but I didn't experiment with those.	2014-10-03 18:42:39 -07:00
Matt Mackall	0b8e08e772	encoding: add json escaping filter This ends up here because it needs to be somewhat encoding aware.	2014-09-15 13:12:49 -05:00
Matt Mackall	00a262f2bb	encoding: handle empty string in toutf8	2014-09-15 13:12:20 -05:00
FUJIWARA Katsunori	b34bd803eb	encoding: add 'leftside' argument into 'trim' to switch trimming side	2014-07-06 02:56:41 +09:00
FUJIWARA Katsunori	71717db270	encoding: add 'trim' to trim multi-byte characters at most specified columns Newly added 'trim' is used to trim multi-byte characters at most specified columns correctly: directly slicing byte sequence should be replaced with 'encoding.trim', because the former may split at intermediate multi-byte sequence. Slicing unicode sequence ('uslice') and concatenation with ellipsis ('concat') are defined as function, to make enhancement in subsequent patch easier.	2014-07-06 02:56:41 +09:00
Mads Kiilerich	403c97887d	tests: stabilize doctest output Avoid dependencies to dict iteration order.	2013-01-15 02:59:14 +01:00
Mads Kiilerich	2f4504e446	fix trivial spelling errors	2012-08-15 22:38:42 +02:00
Martin Geisler	5b013e2061	encoding: add fast-path for ASCII uppercase. This copies the performance hack from encoding.lower (e7a5733d533f). The case-folding logic that kicks in on case-insensitive filesystems hits encoding.upper hard: with a repository with 75k files, the timings went from hg perfstatus ! wall 3.156000 comb 3.156250 user 1.625000 sys 1.531250 (best of 3) to hg perfstatus ! wall 2.390000 comb 2.390625 user 1.078125 sys 1.312500 (best of 5) This is a 24% decrease. For comparison, Mercurial 2.0 gives: hg perfstatus ! wall 2.172000 comb 2.171875 user 0.984375 sys 1.187500 (best of 5) so we're only 10% slower than before we added the extra case-folding logic. The same decrease is seen when executing 'hg status' as normal, where we go from: hg status --time time: real 4.322 secs (user 2.219+0.000 sys 2.094+0.000) to hg status --time time: real 3.307 secs (user 1.750+0.000 sys 1.547+0.000)	2012-07-23 15:55:26 -06:00
Martin Geisler	4f96956c09	encoding: use s.decode to trigger UnicodeDecodeError When calling encode on a str, the string is first decoded using the default encoding and then encoded. So s.encode('ascii') == s.decode().encode('ascii') We don't care about the encode step here -- we're just after the UnicodeDecodeError raised by decode if it finds a non-ASCII character. This way is also marginally faster since it saves the construction of the extra str object.	2012-07-23 15:55:22 -06:00
Cesar Mena	5d1ea9328c	encoding: protect against non-ascii default encoding If the default python encoding was changed from ascii, the attempt to encode as ascii before lower() could throw a UnicodeEncodeError. Catch UnicodeError instead to prevent an unhandled exception.	2012-04-22 21:27:52 -04:00
Matt Mackall	b7245bb05e	encoding: add fast-path for ASCII lowercase	2012-04-10 12:07:18 -05:00
Matt Mackall	369755dc10	encoding: tune fast-path of tolocal a bit	2012-03-22 16:54:46 -05:00
Matt Mackall	feb580aa0d	encoding: introduce utf8-b helpers	2012-02-20 16:42:45 -06:00
Mads Kiilerich	c180ed9101	encoding: use hint markup for "please check your locale settings" This will also make test-encoding.t pass on windows. The test would hit some other code path that already used hint markup.	2011-12-26 15:01:06 +01:00
FUJIWARA Katsunori	fe972435d4	i18n: use encoding.lower/upper for encoding aware case folding this patch uses encoding.lower/upper for case folding, because ones of str can not fold case of non ascii characters correctly. to avoid cyclic dependency and to encapsulate logic of normcase in each platforms, this patch introduces encodinglower/encodingupper in both posix/windows specific files. this patch does not change implementation of normcase() in posix.py, because we do not know the encoding of filenames on POSIX. some "normcase()" are excluded from function wrap list in hgext/win32mbcs.py, because they become encoding aware by this patch.	2011-12-16 21:09:41 +09:00
Matt Mackall	9310b276b8	encoding: add getcols to extract substrings based on column width	2011-09-21 13:00:46 -05:00
Matt Mackall	9b84bd37fa	encoding: colwidth input is in the local encoding	2011-09-21 13:00:41 -05:00
FUJIWARA Katsunori	5b5a083f16	i18n: calculate terminal columns by width information of each characters neither number of 'bytes' in any encoding nor 'characters' is appropriate to calculate terminal columns for specified string. this patch modifies MBTextWrapper for: - overriding '_wrap_chunks()' to make it use not built-in 'len()' but 'encoding.colwidth()' for columns of string - fixing '_cutdown()' to make it use 'encoding.colwidth()' instead of local, similar but incorrect implementation this patch also modifies 'encoding.py': - dividing 'colwith()' into 2 pieces: one for calculation columns of specified UNICODE string, and another for rest part of original one. the former is used from MBTextWrapper in 'util.py'. - preventing 'colwidth()' from evaluating HGENCODINGAMBIGUOUS configuration per each invocation: 'unicodedata.east_asian_width' checking is kept intact for reducing startup cost.	2011-08-27 04:56:12 +09:00
Augie Fackler	e16b528122	encoding: use getattr isntead of hasattr	2011-07-25 15:19:43 -05:00
Matt Mackall	f865cc3f06	encoding: add an encoding-aware lower function	2011-04-30 10:57:13 -05:00
Matt Mackall	1cf3cf83b1	encoding: avoid localstr when a string can be encoded losslessly (issue2763) localstr's hash method exists to prevent bogus matching on lossy local encodings. For instance, we don't want 'caf?' to match 'café' in an ASCII locale. But when café can be losslessly encoded in the local charset, we can simply use a normal string and avoid the hashing trick. This avoids using localstr's hash method, which would prevent a match between	2011-04-15 23:45:41 -05:00
Martin Geisler	dd0f217423	encoding: fix typo in variable name The typo had no real effect, except for an unnecessary UTF-8 encoding.	2010-11-29 10:13:55 +01:00
Matt Mackall	c7059d3926	encoding: add localstr class to track UTF-8 version of transcoded strings This allows UTF-8 strings to losslessly round-trip through Mercurial	2010-11-24 15:38:52 -06:00
Matt Mackall	50b99d1a5a	encoding: default ambiguous character to narrow The current implementation of colwidth was treating 'A'mbiguous characters as wide, which was incorrect in a non-East Asian context. As per http://unicode.org/reports/tr11/#Recommendations, we should instead default to 'narrow' if we don't know better. As character width is dependent on the particular font used and we have no idea what fonts are in use, this recommendation applies. This introduces HGENCODINGAMBIGUOUS to get the old behavior back.	2010-10-27 15:35:21 -05:00
Martin Geisler	77ce66fb6a	check-code: find trailing whitespace	2010-10-20 10:13:04 +02:00
Brodie Rao	203cf2fbd9	cleanup: remove unused imports	2010-08-27 13:32:38 -04:00
Dan Villiom Podlaski Christiansen	d64d4dc9f0	encoding: improve handling of buggy getpreferredencoding() on Mac OS X Prior to version 2.7, calling locale.getpreferredencoding() would always return 'mac-roman' on Mac OS X. Previously, this was handled by a call to locale.setlocale(). Unfortunately, Python 2.6.5 and older have a bug where isspace() would incorrectly report True for 0x85 and 0xa0 after such a call. In order to fix this, we replace the previous _encodingfixup mapping to an _encodingfixers mapping. Rather than mapping encodings to their replacement, it maps them to a function returning the replacement. This allows us to provide an simplified implementation of getpreferredencoding() which extracts the expected encoding and restores the locale. This fix is based on a patch originally submitted by Martijn Pieters as well as feedback from Brodie Rao.	2010-08-14 01:30:54 +02:00
FUJIWARA Katsunori	9cce255bec	replace Python standard textwrap by MBCS sensitive one for i18n text Mercurial has problem around text wrapping/filling in MBCS encoding environment, because standard 'textwrap' module of Python can not treat it correctly. It splits byte sequence for one character into two lines. According to unicode specification, "east asian width" classifies characters into: W(ide), N(arrow), F(ull-width), H(alf-width), A(mbiguous) W/N/F/H can be always recognized as 2/1/2/1 bytes in byte sequence, but 'A' can not. Size of 'A' depends on language in which it is used. Unicode specification says: If the context(= language) cannot be established reliably they should be treated as narrow characters by default but many of class 'A' characters are full-width, at least, in Japanese environment. So, this patch treats class 'A' characters as full-width always for safety wrapping. This patch focuses only on MBCS safe-ness, not on writing/printing rule strict wrapping for each languages MBCS sensitive textwrap class is originally implemented by ITO Nobuaki <daydream.trippers@gmail.com>.	2010-06-06 17:20:10 +09:00
Matt Mackall	8d99be19f0	many, many trivial check-code fixups	2010-01-25 00:05:27 -06:00
Matt Mackall	595d66f424	Update license to GPLv2+	2010-01-19 22:20:08 -06:00
Dirkjan Ochtman	02b4677d86	encoding: fix issue with non-standard UTF-8 CTYPE on OS X	2009-10-10 12:00:43 +02:00
Simon Heimberg	09ac1e6c92	separate import lines from mercurial and general python modules	2009-04-28 17:40:46 +02:00
Martin Geisler	8e4bc1e9ad	put license and copyright info into comment blocks	2009-04-26 01:13:08 +02:00
Martin Geisler	750183bdad	updated license to be explicit about GPL version 2	2009-04-26 01:08:54 +02:00
Matt Mackall	642f4d7151	move encoding bits from util to encoding In addition to cleaning up util, this gets rid of some circular dependencies.	2009-04-03 14:51:48 -05:00

1 2

90 Commits