Since highlight is only relevant for servers, it seems worthwhile to
just trigger this eagerly, which avoids really weird traceback
problems caused by demandimport messing with some of the lexer plugins.
Differential Revision: https://phab.mercurial-scm.org/D1619
I tripped on some weirdness relating to _thread vs threading way down
in a dep of highlight recently. I'm not really sure why I'm only just
seeing this defect now, but experimentally this fixes the problem, and
shouldn't cause any load-time slowness for people until pygments is
actually about to be used since highlight.highlight is still lazily
loaded in the highlight/__init__.py file.
I've caught multiple extensions in the wild lying about being
'internal', so it's time to move the goalposts on people. Goalpost
moving will continue until third party extensions stop trying to
defeat the system.
This patch includes addition of absolute_import and print_function to the
files where they are missing. The modern importing conventions are also followed.
When Mozilla enabled Pygments on hg.mozilla.org, we got a lot of weirdly
colorized files. Upon further investigation, the hightlight extension
is first attempting a filename+content based match then falling back to a
purely content-driven detection mode in Pygments. Sounds good in theory.
Unfortunately, Pygments' content-driven detection establishes no minimum
threshold for returning a lexer. Furthermore, the detection code for
a number of languages is very liberal. For example, ActionScript 3 will
return a confidence of 0.3 (out of 1.0) if the first 1k of the file
we pass in matches the regex "\w+\s*:\s*\w"! Python matches on
"import ". It's no coincidence that a number of our extension-less files
were getting highlighted improperly.
This patch adds an option to have the highlighter not fall back to
purely content-based detection when filename+content detection failed.
This can be enabled to render unlighted text instead of taking the risk
that unknown file types are highlighted incorrectly. The old behavior is
still the default.
Highlight extension lacked a way to limit files by size, by extension, and/or
by any other part of file path. A good solution would be to use a fileset,
since it can check file path, extension and size (and more) in one expression.
So this change introduces such an option, highlighfiles, which takes a fileset
and on each request decides if the requested file should be highlighted.
The default "size('<5M')" is, in a way, suggested in issue3005.
checkfctx() limits the amount of work to just one file (subset kwarg in
fileset.matchctx()).
Monkey-patching works around issue4568, otherwise using filesets here while
running hgweb in directory mode would say, for example, "Abort: **.py not under
root", but this fix is very local and probably far from ideal. I suspect there
to be a way to fix this for the whole hgweb and resolve the issue, but I don't
know how to do it.
When highlight extension encountered files that pygments didn't recognize, it
used to fall back to text lexer. Also, pygments uses TextLexer for .txt files.
This lexer is noop by design.
On bigger files, however, doing the noop highlighting resulted in noticeable
extra CPU work and memory usage: to show a 1 MB text file, hgweb required about
0.7s more (on top of ~3.8s, Q8400) and consumed about 100 MB of RAM more (on
top of ~150 MB).
Let's just exit the function when it's clear that nothing will be highlighted.
Due to how this pygmentize function works (it modifies the template in-place),
we can just return from it and everything else will work as if highlight
extension wasn't enabled.
Due to how the colorized output from pygments was stripped of <pre> elements,
when there was an empty line at the end of a file, highlight extension produced
an incorrect markup (no closing tags from the fileline/annotateline template).
It wasn't usually noticeable, because browsers were smart enough to see where
the missing tags should've been, but in monoblue style it resulted in the last
line having twice the normal height.
Instead of awkwardly trying to strip outer <pre></pre> tags, let's make the
formatter with nowrap=True, which should do what we need in pygments since at
least 0.5 (2006-10-30).
Example from monoblue style:
Before:
<div class="source">
<div style="font-family:monospace" class="parity0">
<pre><a class="linenr" href="#l1" id="l1"> 1</a> </pre>
</div>
<div style="font-family:monospace" class="parity1">
<pre><a class="linenr" href="#l2" id="l2"> 2</a>
</div>
Now:
<div class="source">
<div style="font-family:monospace" class="parity0">
<pre><a class="linenr" href="#l1" id="l1"> 1</a> </pre>
</div>
<div style="font-family:monospace" class="parity1">
<pre><a class="linenr" href="#l2" id="l2"> 2</a> </pre>
</div>
</div>
(Notice the missing </pre></div> now in place)
One of the features of hgweb is that current position in repo history is
remembered between separate requests. That is, links from /rev/<node_hash> lead
to /file/<node_hash> or /log/<node_hash>, so it's easy to dig deep into the
history. However, such links could only use node hashes and local revision
numbers, so while staying at one exact revision is easy, staying on top of the
changes is not, because hashes presumably can't change (local revision numbers
can, but probably not in a way you'd find useful for navigating).
So while you could use 'tip' or 'default' in a url, links on that page would be
permanent. This is not always desired (think /rev/tip or /graph/stable or
/log/@) and is sometimes just confusing (i.e. /log/<not the tip hash>, when
recent history is not displayed). And if user changed url deliberately to say
default instead of <some node hash>, the page ignores that fact and uses node
hash in its links, which means that navigation is, in a way, broken.
This new property, symrev, is used for storing current revision the way it was
specified, so then templates can use it in links and thus "not dereference" the
symbolic revision. It is an additional way to produce links, so not every link
needs to drop {node|short} in favor of {symrev}, many will still use node hash
(log and filelog entries, annotate lines, etc).
Some pages (e.g. summary, tags) always use the tip changeset for their context,
in such cases symrev is set to 'tip'. This is needed in case the pages want to
provide archive links.
highlight extension needs to be updated, since _filerevision now takes an
additional positional argument (signature "web, req, tmpl" is used by most of
webcommands.py functions).
More references to symbolic revisions and related gripes: issue2296, issue2826,
issue3594, issue3634.
Extension authors (notably at companies using hg) have been
cargo-culting the `testedwith = 'internal'` bit from hg's own
extensions, which then defeats our "file bugs over here" logic in
dispatch. Let's be more aggressive about trying to give extension
authors a hint about what testedwith should say.
Unicode and Python's unicode.splitlines() treat several extra legacy
ASCII codepoints as linebreaks, even though the vast bulk of computing
and Python's own str.splitlines() do not. Rather than introduce line
numbering confusion, we filter them out when highlighting.
The check pattern only checked for whitespace between keyword and operator.
Now it also warns:
> x = f(),7
missing whitespace after ,
> x = f()+7
missing whitespace in expression
This patch treats all files inside repository as encoded by
locale's encoding when pygmentize.
We can assume that most files are written in locale's encoding,
but current implementation treats them as UTF-8.
So there's no way to specify the encoding of files.
Current implementation, e2f3244d5179 (issue1341):
1. Convert original `text`, which is treated as UTF-8, to locale's encoding.
`encoding.tolocal()` is the method to convert from internal UTF-8 to local.
If original `text` is not UTF-8, e.g. Japanese EUC-JP, some characters
become garbled here.
2. pygmentize, with no UnicodeDecodeError.
This patch:
1. Convert original `text`, which is treated as locale's encoding, to unicode.
Pygments prefers unicode object than raw str. [1]_
If original `text` is not encoded by locale's encoding, some characters
become garbled here.
2. pygmentize, also with no UnicodeDecodeError :)
3. Convert unicode back to raw str, which is encoded by locale's.
.. [1] http://pygments.org/docs/unicode/
Trying as much as possible to consistently:
- use a present tense predicate followed by a direct object
- verb referring directly to the functionality provided
(ie. not "add command that does this" but simple "do that")
- keep simple and to the point, leaving details for the long help
(width is tight, possibly even more so for translations)
Thanks to timeless, Martin Geisler, Rafael Villar Burke, Dan Villiom
Podlaski Christiansen and others for the helpful suggestions.
Example case:
Display file written in iso-8859-1 with current HGENCODING utf-8.
At the moment only an Error page appears because pygmentize
chokes on the replacement chars.
Alternatives:
1) Turn off highlighting and avoid UnicodeDecodeError
for files that are not in HGENCODING.
2) [this patch] use util.tolocal to display these files.
Alternative 2) seems ok, as this only concerns display and
readability.
See also: c5f1a58b8b9a, apparently put aside during refactor of
highlight.
Add test for UnicodeDecodeError with iso-8859-1 file contents.
For non-html mimetypes it doesn't make much sense. This also fixes the
issue that highlight unconditionally adds a <link/> tag for its CSS to
the template's header (which is pointless in text/plain output).