Commit Graph

38 Commits

Author SHA1 Message Date
Augie Fackler
3c9e7fcc66 encoding: add hfsignoreclean to clean out HFS-ignored characters
According to Apple Technote 1150 (unavailable from Apple as far as I
can tell, but archived in several places online), HFS+ ignores sixteen
specific unicode runes when doing path normalization. We need to
handle those cases, so this function lets us efficiently strip the
offending characters from a UTF-8 encoded string (which is the only
way it seems to matter on OS X.)
2014-12-16 13:06:41 -05:00
FUJIWARA Katsunori
7120dc2e96 encoding: avoid cyclic dependency around "parsers" in pure Python build
39fbe33f95fa brought "asciilower" and "import parsers" into
"encoding.py".

This works fine with "parsers" module in C implementation, but doesn't
with one in pure Python implementation, because the latter causes
cyclic dependency below and aborting execution:

    util => i18n => encoding => parsers => util

This patch delays importing "parsers" module until it is really
needed, to avoid cyclic dependency around "parsers" in pure Python
build.
2014-10-17 02:07:04 +09:00
Siddharth Agarwal
8298f08c96 encoding.lower: use fast ASCII lower
This benefits, among other things, the case collision auditor.

On a Linux system with a large real-world repo where all filenames are ASCII,
hg perfcca:

before: wall 0.260157 comb 0.270000 user 0.230000 sys 0.040000 (best of 38)
after:  wall 0.164616 comb 0.160000 user 0.160000 sys 0.000000 (best of 54)
2014-10-03 18:45:56 -07:00
Siddharth Agarwal
e56ab5399b parsers: add a function to efficiently lowercase ASCII strings
We need a way to efficiently lowercase ASCII strings. For example, 'hg status'
needs to build up the fold map -- a map from a canonical case (for OS X,
lowercase) to the actual case of each file and directory in the dirstate.

The current way we do that is to try decoding to ASCII and then calling
lower() on the string, labeled 'orig' below:

    str.decode('ascii')
    return str.lower()

This is pretty inefficient, and it turns out we can do much better.

I also tested out a condition-based approach, labeled 'cond' below:

    (c >= 'A' && c <= 'Z') ? (c + ('a' - 'A')) : c

'cond' turned out to be slower in all cases. A 256-byte lookup table with
invalid values for everything past 127 performed similarly, but this was less
verbose.

On OS X 10.9 with LLVM version 6.0 (clang-600.0.51), the asciilower function
was run against two corpuses.

Corpus 1 (list of files from real-world repo, > 100k files):
orig:   wall 0.428567 comb 0.430000 user 0.430000 sys 0.000000 (best of 24)
cond:   wall 0.077204 comb 0.070000 user 0.070000 sys 0.000000 (best of 100)
lookup: wall 0.060714 comb 0.060000 user 0.060000 sys 0.000000 (best of 100)

Corpus 2 (mozilla-central, 113k files):
orig:   wall 0.238406 comb 0.240000 user 0.240000 sys 0.000000 (best of 42)
cond:   wall 0.040779 comb 0.040000 user 0.040000 sys 0.000000 (best of 100)
lookup: wall 0.037623 comb 0.040000 user 0.040000 sys 0.000000 (best of 100)

On a Linux server-class machine with GCC 4.4.6 20120305 (Red Hat 4.4.6-4):

Corpus 1 (real-world repo, > 100k files):
orig:   wall 0.260899 comb 0.260000 user 0.260000 sys 0.000000 (best of 38)
cond:   wall 0.054818 comb 0.060000 user 0.060000 sys 0.000000 (best of 100)
lookup: wall 0.048489 comb 0.050000 user 0.050000 sys 0.000000 (best of 100)

Corpus 2 (mozilla-central, 113k files):
orig:   wall 0.153082 comb 0.150000 user 0.150000 sys 0.000000 (best of 65)
cond:   wall 0.031007 comb 0.040000 user 0.040000 sys 0.000000 (best of 100)
lookup: wall 0.028793 comb 0.030000 user 0.030000 sys 0.000000 (best of 100)

SSE instructions might help even more, but I didn't experiment with those.
2014-10-03 18:42:39 -07:00
Matt Mackall
0b8e08e772 encoding: add json escaping filter
This ends up here because it needs to be somewhat encoding aware.
2014-09-15 13:12:49 -05:00
Matt Mackall
00a262f2bb encoding: handle empty string in toutf8 2014-09-15 13:12:20 -05:00
FUJIWARA Katsunori
b34bd803eb encoding: add 'leftside' argument into 'trim' to switch trimming side 2014-07-06 02:56:41 +09:00
FUJIWARA Katsunori
71717db270 encoding: add 'trim' to trim multi-byte characters at most specified columns
Newly added 'trim' is used to trim multi-byte characters at most
specified columns correctly: directly slicing byte sequence should be
replaced with 'encoding.trim', because the former may split at
intermediate multi-byte sequence.

Slicing unicode sequence ('uslice') and concatenation with ellipsis
('concat') are defined as function, to make enhancement in subsequent
patch easier.
2014-07-06 02:56:41 +09:00
Mads Kiilerich
403c97887d tests: stabilize doctest output
Avoid dependencies to dict iteration order.
2013-01-15 02:59:14 +01:00
Mads Kiilerich
2f4504e446 fix trivial spelling errors 2012-08-15 22:38:42 +02:00
Martin Geisler
5b013e2061 encoding: add fast-path for ASCII uppercase.
This copies the performance hack from encoding.lower (e7a5733d533f).

The case-folding logic that kicks in on case-insensitive filesystems
hits encoding.upper hard: with a repository with 75k files, the
timings went from

  hg perfstatus
  ! wall 3.156000 comb 3.156250 user 1.625000 sys 1.531250 (best of 3)

to

  hg perfstatus
  ! wall 2.390000 comb 2.390625 user 1.078125 sys 1.312500 (best of 5)

This is a 24% decrease. For comparison, Mercurial 2.0 gives:

  hg perfstatus
  ! wall 2.172000 comb 2.171875 user 0.984375 sys 1.187500 (best of 5)

so we're only 10% slower than before we added the extra case-folding
logic.

The same decrease is seen when executing 'hg status' as normal, where
we go from:

  hg status --time
  time: real 4.322 secs (user 2.219+0.000 sys 2.094+0.000)

to

  hg status --time
  time: real 3.307 secs (user 1.750+0.000 sys 1.547+0.000)
2012-07-23 15:55:26 -06:00
Martin Geisler
4f96956c09 encoding: use s.decode to trigger UnicodeDecodeError
When calling encode on a str, the string is first decoded using the
default encoding and then encoded. So

  s.encode('ascii') == s.decode().encode('ascii')

We don't care about the encode step here -- we're just after the
UnicodeDecodeError raised by decode if it finds a non-ASCII character.

This way is also marginally faster since it saves the construction of
the extra str object.
2012-07-23 15:55:22 -06:00
Cesar Mena
5d1ea9328c encoding: protect against non-ascii default encoding
If the default python encoding was changed from ascii, the attempt to
encode as ascii before lower() could throw a UnicodeEncodeError.
Catch UnicodeError instead to prevent an unhandled exception.
2012-04-22 21:27:52 -04:00
Matt Mackall
b7245bb05e encoding: add fast-path for ASCII lowercase 2012-04-10 12:07:18 -05:00
Matt Mackall
369755dc10 encoding: tune fast-path of tolocal a bit 2012-03-22 16:54:46 -05:00
Matt Mackall
feb580aa0d encoding: introduce utf8-b helpers 2012-02-20 16:42:45 -06:00
Mads Kiilerich
c180ed9101 encoding: use hint markup for "please check your locale settings"
This will also make test-encoding.t pass on windows. The test would hit some
other code path that already used hint markup.
2011-12-26 15:01:06 +01:00
FUJIWARA Katsunori
fe972435d4 i18n: use encoding.lower/upper for encoding aware case folding
this patch uses encoding.lower/upper for case folding, because ones of
str can not fold case of non ascii characters correctly.

to avoid cyclic dependency and to encapsulate logic of normcase in
each platforms, this patch introduces encodinglower/encodingupper in
both posix/windows specific files.

this patch does not change implementation of normcase() in posix.py,
because we do not know the encoding of filenames on POSIX.

some "normcase()" are excluded from function wrap list in
hgext/win32mbcs.py, because they become encoding aware by this patch.
2011-12-16 21:09:41 +09:00
Matt Mackall
9310b276b8 encoding: add getcols to extract substrings based on column width 2011-09-21 13:00:46 -05:00
Matt Mackall
9b84bd37fa encoding: colwidth input is in the local encoding 2011-09-21 13:00:41 -05:00
FUJIWARA Katsunori
5b5a083f16 i18n: calculate terminal columns by width information of each characters
neither number of 'bytes' in any encoding nor 'characters' is
appropriate to calculate terminal columns for specified string.

this patch modifies MBTextWrapper for:

  - overriding '_wrap_chunks()' to make it use not built-in 'len()'
    but 'encoding.colwidth()' for columns of string

  - fixing '_cutdown()' to make it use 'encoding.colwidth()' instead
    of local, similar but incorrect implementation

this patch also modifies 'encoding.py':

  - dividing 'colwith()' into 2 pieces: one for calculation columns of
    specified UNICODE string, and another for rest part of original
    one. the former is used from MBTextWrapper in 'util.py'.

  - preventing 'colwidth()' from evaluating HGENCODINGAMBIGUOUS
    configuration per each invocation: 'unicodedata.east_asian_width'
    checking is kept intact for reducing startup cost.
2011-08-27 04:56:12 +09:00
Augie Fackler
e16b528122 encoding: use getattr isntead of hasattr 2011-07-25 15:19:43 -05:00
Matt Mackall
f865cc3f06 encoding: add an encoding-aware lower function 2011-04-30 10:57:13 -05:00
Matt Mackall
1cf3cf83b1 encoding: avoid localstr when a string can be encoded losslessly (issue2763)
localstr's hash method exists to prevent bogus matching on lossy local
encodings. For instance, we don't want 'caf?' to match 'café' in an
ASCII locale.

But when café can be losslessly encoded in the local charset, we can
simply use a normal string and avoid the hashing trick.


This avoids using localstr's hash method, which would prevent a match between
2011-04-15 23:45:41 -05:00
Martin Geisler
dd0f217423 encoding: fix typo in variable name
The typo had no real effect, except for an unnecessary UTF-8 encoding.
2010-11-29 10:13:55 +01:00
Matt Mackall
c7059d3926 encoding: add localstr class to track UTF-8 version of transcoded strings
This allows UTF-8 strings to losslessly round-trip through Mercurial
2010-11-24 15:38:52 -06:00
Matt Mackall
50b99d1a5a encoding: default ambiguous character to narrow
The current implementation of colwidth was treating 'A'mbiguous
characters as wide, which was incorrect in a non-East Asian context.
As per http://unicode.org/reports/tr11/#Recommendations, we should
instead default to 'narrow' if we don't know better. As character
width is dependent on the particular font used and we have no idea
what fonts are in use, this recommendation applies.

This introduces HGENCODINGAMBIGUOUS to get the old behavior back.
2010-10-27 15:35:21 -05:00
Martin Geisler
77ce66fb6a check-code: find trailing whitespace 2010-10-20 10:13:04 +02:00
Brodie Rao
203cf2fbd9 cleanup: remove unused imports 2010-08-27 13:32:38 -04:00
Dan Villiom Podlaski Christiansen
d64d4dc9f0 encoding: improve handling of buggy getpreferredencoding() on Mac OS X
Prior to version 2.7, calling locale.getpreferredencoding() would
always return 'mac-roman' on Mac OS X. Previously, this was handled by
a call to locale.setlocale(). Unfortunately, Python 2.6.5 and older
have a bug where isspace() would incorrectly report True for 0x85 and
0xa0 after such a call.

In order to fix this, we replace the previous _encodingfixup mapping
to an _encodingfixers mapping. Rather than mapping encodings to their
replacement, it maps them to a function returning the
replacement. This allows us to provide an simplified implementation of
getpreferredencoding() which extracts the expected encoding and
restores the locale.

This fix is based on a patch originally submitted by Martijn Pieters
as well as feedback from Brodie Rao.
2010-08-14 01:30:54 +02:00
FUJIWARA Katsunori
9cce255bec replace Python standard textwrap by MBCS sensitive one for i18n text
Mercurial has problem around text wrapping/filling in MBCS encoding
environment, because standard 'textwrap' module of Python can not
treat it correctly. It splits byte sequence for one character into two
lines.

According to unicode specification, "east asian width" classifies
characters into:

   W(ide), N(arrow), F(ull-width), H(alf-width), A(mbiguous)


W/N/F/H can be always recognized as 2/1/2/1 bytes in byte sequence,
but 'A' can not. Size of 'A' depends on language in which it is used.

Unicode specification says:

   If the context(= language) cannot be established reliably they
   should be treated as narrow characters by default

but many of class 'A' characters are full-width, at least, in Japanese
environment.

So, this patch treats class 'A' characters as full-width always for
safety wrapping.

This patch focuses only on MBCS safe-ness, not on writing/printing
rule strict wrapping for each languages

MBCS sensitive textwrap class is originally implemented
by ITO Nobuaki <daydream.trippers@gmail.com>.
2010-06-06 17:20:10 +09:00
Matt Mackall
8d99be19f0 many, many trivial check-code fixups 2010-01-25 00:05:27 -06:00
Matt Mackall
595d66f424 Update license to GPLv2+ 2010-01-19 22:20:08 -06:00
Dirkjan Ochtman
02b4677d86 encoding: fix issue with non-standard UTF-8 CTYPE on OS X 2009-10-10 12:00:43 +02:00
Simon Heimberg
09ac1e6c92 separate import lines from mercurial and general python modules 2009-04-28 17:40:46 +02:00
Martin Geisler
8e4bc1e9ad put license and copyright info into comment blocks 2009-04-26 01:13:08 +02:00
Martin Geisler
750183bdad updated license to be explicit about GPL version 2 2009-04-26 01:08:54 +02:00
Matt Mackall
642f4d7151 move encoding bits from util to encoding
In addition to cleaning up util, this gets rid of some circular dependencies.
2009-04-03 14:51:48 -05:00