Since 1d07d9da84a0, pycompat.bytestr() wrapped by win32mbcs returns
unicode object, if an argument is not byte-str object. And this causes
unexpected failure at colorization.
pycompat.bytestr() is used to convert from color effect "int" value to
byte-str object in color module. Wrapped pycompat.bytestr() returns
unicode object for such "int" value, because it isn't byte-str.
If this returned unicode object is used to colorize non-ASCII byte-str
in cases below, UnicodeDecodeError is raised at an operation between
them.
- colorization uses "ansi" color mode, or
Even though this isn't default on Windows, user might use this
color mode for third party pager.
- ui.write() is buffered with labeled=True
Buffering causes "ansi" color mode internally, regardless of
actual color mode. With "win32" color mode, extra escape sequences
are omitted at writing data out.
For example, with "win32" color mode, "hg status" doesn't fail for
non-ASCII filenames, but "hg log" does for non-ASCII text, because
the latter implies buffered formatter.
There are many "color effect" value lines in color.py, and making them
byte-str objects isn't suitable for fixing on stable. In addition to
it, pycompat.bytestr will be used to get byte-str object from any
types other than int, too.
To resolve this issue, this patch does:
- replace pycompat.bytestr in checkwinfilename() with newly added
hook point util._filenamebytestr, and
- make win32mbcs reverse-wrap util._filenamebytestr
(this is a replacement of 1d07d9da84a0)
This patch does two things above at same time, because separately
applying the former change adds broken revision (from point of view of
win32mbcs) to stable branch.
"_" prefix is added to "filenamebytestr", because it is win32mbcs
specific hook point.
win32mbcs wraps some functions, to prevent them from unintentionally
treating backslash (0x5c), which is used as the second or later byte
of multi bytes characters by problematic encodings, as a path
component delimiter on Windows platform.
This wrapping assumes that wrapped functions can safely accept unicode
string arguments.
Unfortunately, 440868900036 broke this assumption by introducing
pycompat.bytestr() into util.checkwinfilename() for py3 support. After
that, wrapped checkwinfilename() always fails for non-ASCII filename
at pycompat.bytestr() invocation.
This patch wraps underlying pycompat.bytestr() function to use
util.checkwinfilename() safely.
To avoid similar regression in the future, another patch series will
add smoke testing on default branch.
See the previous commit for why. Marked as API change since osutil.listdir()
seems widely used in third-party extensions.
The win32mbcs extension is updated to wrap both util. and windows. aliases.
I always read the name "checkcase(path)" as "do we need to check for
case folding at this path", but it's actually (I think) meant to be
read "check if the file system cares about case at this path". I'm
clearly not the only one confused by this as the dirstate has this
property:
def _checkcase(self):
return not util.checkcase(self._join('.hg'))
Maybe we should even inverse the function and call it fscasefolding()
since that's what all callers care about?
I've caught multiple extensions in the wild lying about being
'internal', so it's time to move the goalposts on people. Goalpost
moving will continue until third party extensions stop trying to
defeat the system.
Before this patch, "missing _() in ui message" rule overlooks
translatable message, which starts with other than alphabet.
To detect "missing _() in ui message" more exactly, this patch
improves the regexp with assumptions below.
- sequence consisting of below might precede "translatable message"
in same string token
- formatting string, which starts with '%'
- escaped character, which starts with 'b' (as replacement of '\\'), or
- characters other than '%', 'b' and 'x' (as replacement of alphabet)
- any string tokens might precede a string token, which contains
"translatable message"
This patch builds an input file, which is used to examine "missing _()
in ui message" detection, before '"$check_code" stringjoin.py' in
test-contrib-check-code.t, because this reduces amount of change churn
in subsequent patch.
This patch also applies "()" instead of "_()" on messages below to
hide false-positives:
- messages for ui.debug() or debug commands/tools
- contrib/debugshell.py
- hgext/win32mbcs.py (ui.write() is used, though)
- mercurial/commands.py
- _debugchangegroup
- debugindex
- debuglocks
- debugrevlog
- debugrevspec
- debugtemplate
- untranslatable messages
- doc/gendoc.py (ReST specific text)
- hgext/hgk.py (permission string)
- hgext/keyword.py (text written into configuration file)
- mercurial/cmdutil.py (formatting strings for JSON)
Since (b) is banned, we should do the same for (a) for consistency.
a) from mercurial import hg
from mercurial.i18n import _
b) from . import hg
from .i18n import _
The home of 'Abort' is 'error' not 'util' however, a lot of code seems to be
confused about that and gives all the credit to 'util' instead of the
hardworking 'error'. In a spirit of equity, we break the cycle of injustice and
give back to 'error' the respect it deserves. And screw that 'util' poser.
For great justice.
Extension authors (notably at companies using hg) have been
cargo-culting the `testedwith = 'internal'` bit from hg's own
extensions, which then defeats our "file bugs over here" logic in
dispatch. Let's be more aggressive about trying to give extension
authors a hint about what testedwith should say.
This changeset fix the problem to use win32mbcs with mercurial 2.3 or
later.
The problem is brought by side effect of modification of
encoding.upper() (changeset 17236:8b2442729eb3) because upper() does
not accept unicode string argument. So wrapped util.normcase() which
uses upper() will fail. In other words, upper() and lower() are
unicode incompatible.
To fix this issue, this changeset adds new wrapper for reversed
conversion (unicode to str) for lower() and upper() to use them
safely.
this patch allows win32mbcs extension to be enabled on cygwin platform
for problematic character encodings.
on recent cygwin platform, even though
"os.path.supports_unicode_filenames" is False, "os.listdir()" and
other path manipulation functions can return the result correctly
decoded in unicode for invocations with unicode arguments, if locale
is configured properly.
existing code to check "os.path.supports_unicode_filenames" is kept to
prevent win32mbcs from being enabled on unexpected platform.
this patch uses encoding.lower/upper for case folding, because ones of
str can not fold case of non ascii characters correctly.
to avoid cyclic dependency and to encapsulate logic of normcase in
each platforms, this patch introduces encodinglower/encodingupper in
both posix/windows specific files.
this patch does not change implementation of normcase() in posix.py,
because we do not know the encoding of filenames on POSIX.
some "normcase()" are excluded from function wrap list in
hgext/win32mbcs.py, because they become encoding aware by this patch.
this patch uses upper() instead of lower() or os.path.normcase() for
case folding on Windows(NTFS), because lower-ing causes problems for
some languages on it.
see below for detail about problem of lower-ing:
https://blogs.msdn.com/b/michkap/archive/2005/01/16/353873.aspx
util.checkwinfilename() and util.checkosfilename() are wrapped.
The former, used by scmutil, is not friendly for Shift_JIS bytes.
The latter is an dynamic alias of checkwinfilename() used by other
modules.
Trying as much as possible to consistently:
- use a present tense predicate followed by a direct object
- verb referring directly to the functionality provided
(ie. not "add command that does this" but simple "do that")
- keep simple and to the point, leaving details for the long help
(width is tight, possibly even more so for translations)
Thanks to timeless, Martin Geisler, Rafael Villar Burke, Dan Villiom
Podlaski Christiansen and others for the helpful suggestions.