This copies the performance hack from encoding.lower (e7a5733d533f).
The case-folding logic that kicks in on case-insensitive filesystems
hits encoding.upper hard: with a repository with 75k files, the
timings went from
hg perfstatus
! wall 3.156000 comb 3.156250 user 1.625000 sys 1.531250 (best of 3)
to
hg perfstatus
! wall 2.390000 comb 2.390625 user 1.078125 sys 1.312500 (best of 5)
This is a 24% decrease. For comparison, Mercurial 2.0 gives:
hg perfstatus
! wall 2.172000 comb 2.171875 user 0.984375 sys 1.187500 (best of 5)
so we're only 10% slower than before we added the extra case-folding
logic.
The same decrease is seen when executing 'hg status' as normal, where
we go from:
hg status --time
time: real 4.322 secs (user 2.219+0.000 sys 2.094+0.000)
to
hg status --time
time: real 3.307 secs (user 1.750+0.000 sys 1.547+0.000)
When calling encode on a str, the string is first decoded using the
default encoding and then encoded. So
s.encode('ascii') == s.decode().encode('ascii')
We don't care about the encode step here -- we're just after the
UnicodeDecodeError raised by decode if it finds a non-ASCII character.
This way is also marginally faster since it saves the construction of
the extra str object.
If the default python encoding was changed from ascii, the attempt to
encode as ascii before lower() could throw a UnicodeEncodeError.
Catch UnicodeError instead to prevent an unhandled exception.
this patch uses encoding.lower/upper for case folding, because ones of
str can not fold case of non ascii characters correctly.
to avoid cyclic dependency and to encapsulate logic of normcase in
each platforms, this patch introduces encodinglower/encodingupper in
both posix/windows specific files.
this patch does not change implementation of normcase() in posix.py,
because we do not know the encoding of filenames on POSIX.
some "normcase()" are excluded from function wrap list in
hgext/win32mbcs.py, because they become encoding aware by this patch.
neither number of 'bytes' in any encoding nor 'characters' is
appropriate to calculate terminal columns for specified string.
this patch modifies MBTextWrapper for:
- overriding '_wrap_chunks()' to make it use not built-in 'len()'
but 'encoding.colwidth()' for columns of string
- fixing '_cutdown()' to make it use 'encoding.colwidth()' instead
of local, similar but incorrect implementation
this patch also modifies 'encoding.py':
- dividing 'colwith()' into 2 pieces: one for calculation columns of
specified UNICODE string, and another for rest part of original
one. the former is used from MBTextWrapper in 'util.py'.
- preventing 'colwidth()' from evaluating HGENCODINGAMBIGUOUS
configuration per each invocation: 'unicodedata.east_asian_width'
checking is kept intact for reducing startup cost.
localstr's hash method exists to prevent bogus matching on lossy local
encodings. For instance, we don't want 'caf?' to match 'café' in an
ASCII locale.
But when café can be losslessly encoded in the local charset, we can
simply use a normal string and avoid the hashing trick.
This avoids using localstr's hash method, which would prevent a match between
The current implementation of colwidth was treating 'A'mbiguous
characters as wide, which was incorrect in a non-East Asian context.
As per http://unicode.org/reports/tr11/#Recommendations, we should
instead default to 'narrow' if we don't know better. As character
width is dependent on the particular font used and we have no idea
what fonts are in use, this recommendation applies.
This introduces HGENCODINGAMBIGUOUS to get the old behavior back.
Prior to version 2.7, calling locale.getpreferredencoding() would
always return 'mac-roman' on Mac OS X. Previously, this was handled by
a call to locale.setlocale(). Unfortunately, Python 2.6.5 and older
have a bug where isspace() would incorrectly report True for 0x85 and
0xa0 after such a call.
In order to fix this, we replace the previous _encodingfixup mapping
to an _encodingfixers mapping. Rather than mapping encodings to their
replacement, it maps them to a function returning the
replacement. This allows us to provide an simplified implementation of
getpreferredencoding() which extracts the expected encoding and
restores the locale.
This fix is based on a patch originally submitted by Martijn Pieters
as well as feedback from Brodie Rao.
Mercurial has problem around text wrapping/filling in MBCS encoding
environment, because standard 'textwrap' module of Python can not
treat it correctly. It splits byte sequence for one character into two
lines.
According to unicode specification, "east asian width" classifies
characters into:
W(ide), N(arrow), F(ull-width), H(alf-width), A(mbiguous)
W/N/F/H can be always recognized as 2/1/2/1 bytes in byte sequence,
but 'A' can not. Size of 'A' depends on language in which it is used.
Unicode specification says:
If the context(= language) cannot be established reliably they
should be treated as narrow characters by default
but many of class 'A' characters are full-width, at least, in Japanese
environment.
So, this patch treats class 'A' characters as full-width always for
safety wrapping.
This patch focuses only on MBCS safe-ness, not on writing/printing
rule strict wrapping for each languages
MBCS sensitive textwrap class is originally implemented
by ITO Nobuaki <daydream.trippers@gmail.com>.