Unicode and Python's unicode.splitlines() treat several extra legacy
ASCII codepoints as linebreaks, even though the vast bulk of computing
and Python's own str.splitlines() do not. Rather than introduce line
numbering confusion, we filter them out when highlighting.
The check pattern only checked for whitespace between keyword and operator.
Now it also warns:
> x = f(),7
missing whitespace after ,
> x = f()+7
missing whitespace in expression
This patch treats all files inside repository as encoded by
locale's encoding when pygmentize.
We can assume that most files are written in locale's encoding,
but current implementation treats them as UTF-8.
So there's no way to specify the encoding of files.
Current implementation, e2f3244d5179 (issue1341):
1. Convert original `text`, which is treated as UTF-8, to locale's encoding.
`encoding.tolocal()` is the method to convert from internal UTF-8 to local.
If original `text` is not UTF-8, e.g. Japanese EUC-JP, some characters
become garbled here.
2. pygmentize, with no UnicodeDecodeError.
This patch:
1. Convert original `text`, which is treated as locale's encoding, to unicode.
Pygments prefers unicode object than raw str. [1]_
If original `text` is not encoded by locale's encoding, some characters
become garbled here.
2. pygmentize, also with no UnicodeDecodeError :)
3. Convert unicode back to raw str, which is encoded by locale's.
.. [1] http://pygments.org/docs/unicode/
Trying as much as possible to consistently:
- use a present tense predicate followed by a direct object
- verb referring directly to the functionality provided
(ie. not "add command that does this" but simple "do that")
- keep simple and to the point, leaving details for the long help
(width is tight, possibly even more so for translations)
Thanks to timeless, Martin Geisler, Rafael Villar Burke, Dan Villiom
Podlaski Christiansen and others for the helpful suggestions.
Example case:
Display file written in iso-8859-1 with current HGENCODING utf-8.
At the moment only an Error page appears because pygmentize
chokes on the replacement chars.
Alternatives:
1) Turn off highlighting and avoid UnicodeDecodeError
for files that are not in HGENCODING.
2) [this patch] use util.tolocal to display these files.
Alternative 2) seems ok, as this only concerns display and
readability.
See also: c5f1a58b8b9a, apparently put aside during refactor of
highlight.
Add test for UnicodeDecodeError with iso-8859-1 file contents.
For non-html mimetypes it doesn't make much sense. This also fixes the
issue that highlight unconditionally adds a <link/> tag for its CSS to
the template's header (which is pointless in text/plain output).