…when straddling an injection boundary.
When we do this one row at a time, the controlling layer for an indent query is
the comparison (previous) row, because that's where the query starts. The batch
version should be no different.
The new spec describes the exact scenario that revealed this bug.
…as overly broad.
Turns out there are some scenarios where you'd want to know about a layer whose
extent may include a given point, even if its content ranges _don't_ include
that point. I'd have realized this if I'd done something as simple as run the
test suite. Silly me.
But `controllingLayerAtPoint` will still enforce the content range constraint
because we use it to pick a winner for indent queries, fold queries, and so on.
…to consider content ranges instead of extents.
If you want to know which layers are operative at a given point, it makes no
sense to check a layer's extent. A `LanguageLayer` could extend over the entire
buffer and still only apply highlighting to one very small range in the buffer.
Thus it makes no sense for `controllingLayerAtPoint` not to check a layer's
content ranges, nor does it make sense for the `injectionLayersAtPoint` method
that it relies on.
If I felt like there was any use in knowing that a language layer extended over
a range that included the given point but wasn't operative at that point, I'd be
hesitant to change these methods. But I don't.
This bug was discovered when I realized that the wrong indents query was being
consulted in the following HTML…
<style>
#foo {
…when I moved the cursor to the end of the `<style>` element and hit `Enter`.
…when there's no indents query to help us out.
This seems to be the default behavior for plaintext and for other language
modes.
I've got a Tree-sitter Markdown grammar locally and I noticed it wasn't
preserving indentation when I was two levels deep into a bulleted list and hit
`Enter` — I had to re-indent each time. (The Markdown grammar doesn't include an
indents query because there's practically no way to predict indentation levels
in Markdown.)
…where such a parse could fail because the source text changes in the middle of
the parse.
The change I introduced yesterday made it possible for an async parse job to get
confused because its source text has changed in between async jobs. To prevent
this, the buffer text must remain constant during the parse, but should use the
most current buffer text _after_ the parse.
…when they haven't been re-parsed yet.
Say we've got an injection layer on rows 100-150. If a user inserts a newline
on line 1, we know for certain that that doesn't affect the parsed tree of that
injection layer, except that it needs to be edited to incorporate that change
so that its node positions are correct. So we wait to re-parse it because we
don't need to, even though we know the tree is technically dirty. We only need
to re-parse it when an edit occurs within its layer extent.
But each node has a `text` property that's actually just a getter. When `text`
is read, the node looks up its text using its `startIndex` and `endIndex`. If we
called `parse` with a string originally, then it's going to do its lookups
against a string that we know to be stale. Most `#match?` predicates would be
doomed to fail, and if the display layer needed to highlight any part of that
layer, it'd miss a lot of stuff.
Luckily, web-tree-sitter envisions that your buffer might be represented by a
strange data structure, and lets you specify a callback that is used when
the parser needs to get the contents of various buffer ranges. We can have that
callback do its lookups against a string copy of the buffer that we know to be
fresh, because we're updating it on every buffer change.
This is a big deal, because the alternative is having to re-parse every
injection layer on every transaction. Now we can safely run captures against
dirty trees when we know that the changes cannot have affected the structure of
the tree.
Predicates were getting pretty chaotic. With the namespace, it's now clearer
where a predicate may be defined and what its purpose is. It also helps resolve
ambiguity — e.g., both folds and highlights can have an `endAt` predicate, so
now they're disambiguated as `fold.endAt` and `adjust.endAt`, respectively.
Namespaces:
* `highlight` for highlight-specific settings (of which there is presently only
one)
* `test` for scope tests (currently used by highlights and indents)
* `adjust` for scope adjustments (highlights)
* `indent` for indent settings
* `fold` for fold settings
Right now, only the `test` namespace will be used by more than one kind of query
file, but I could imagine that changing in the future.
For now, tests and adjustments still work without the prepended namespace, but
I imagine I'll take that out even before we ship this feature experimentally.
Much easier to make big changes like this before anyone depends on them.
This also draws a much clearer line between `#set!` predicates with special
meaning in Pulsar… and those which are being used to set arbitrary data for
later use. For instance, if you see `(#set! isOnLeftSideOfAssignment true)`, you
know it must just be arbitrary data.
…along with a temporary `core.useExperimentalModernTreeSitter` setting.
If we truly planned to keep three different language modes around indefinitely,
changing `core.useTreeSitterParsers` to an enum would make sense. But we don't,
so it'd actually just be a gigantic pain in the ass to migrate one setting to
another of a different type.
When we ship modern-tree-sitter experimentally, we'll make it opt-in via the
temporary setting. When we make it the official tree-sitter implementation and
remove the legacy node-tree-sitter version, we'll remove the temporary setting
and just change the semantics around `core.useTreeSitterParsers`.
Reverting the addition of the `core.languageParser` setting is a chore, but it
prevents a _gigantic_ future headache.
It's possible to assign a node different scopes based on its contents, but we
need a way to detect these sorts of nodes so that we know how much of the buffer
to invalidate when that node's contents change.
Previously, we were simply finding the range of the deepest node in the tree for
a given edit and invalidating its entire region, in the hopes that this
heuristic would let us hide this complexity from users. Sadly, we can't get away
with that in terms of performance; the most specific node could end up being a
_very large_ node, and it's just not worth slowing everything else down just to
handle these silly edge cases.
Instead, a grammar author can mark certain queries in `highlights.scm` with
an `invalidateOnChange` property:
```scm
((comment) @comment.block.documentation.javadoc.java
(#match? @comment.block.documentation.javadoc.java "^/\\*\\*")
(#set! final true)
(#set! invalidateOnChange true))
((comment) @comment.block.java
(#match? @comment.block.java "^/\\*")
(#set! invalidateOnChange true))
```
Here we've got one scope for a JavaDoc (`/**`) block comment and one for a
regular (`/*`) block comment. Since a change in either kind can flip the scope
of the entire comment, both should be marked with `invalidateOnChange`.
If the world were perfect — if the web-tree-sitter bindings allowed it, or if
we could pre-process query files — we could set this property automatically
whenever a node both (a) defined a `#match?` predicate and (b) began and ended
on different rows. We'd be able to use that information to keep track of these
nodes on our own, as well as detect whether an arbitrary edit will actually
trigger a scope change. In the meantime, this is an unusual enough need that
we'll put the onus on the grammar author for now.
The comment in `src/web-tree-sitter.js` and the documentation in
`vendor/web-tree-sitter` explain why this change is needed. I hope it's
temporary.
I haven't removed our `web-tree-sitter` dependency in `package.json` because
it's a useful way of declaring (and remembering) which version of tree-sitter
we've built against, and because it lets us flip a single config flag in
`src/web-tree-sitter.js` to compare the behavior of our custom version with that
of the stock version. The cost of keeping the stock version around is under 300
kilobytes.
The custom builds will come from specialty branches on our tree-sitter fork.
When we upgrade to a new version, someone should follow the directions in
`vendor/web-tree-sitter/README.md` to create a new specialty branch.
When someone wants to build a `language-foo` package and finds that
`tree-sitter-foo` needs a function we don't yet export, they should be able to
see the warning we generate in the console and file a ticket. If we're on our
game, we should be able to generate a new web-tree-sitter build and get it into
a rolling release within a week or so.
I hope web-tree-sitter someday finds a way to fix this so that we no longer need
to do this chore. See https://github.com/tree-sitter/tree-sitter/issues/949.
The `syntaxQuery` key in a grammar definition file is the only query whose name
differs from its canonical filename, and there's no reason why that should be
the case. Might as well change it now before we ship instead of going through
the pain of changing it later.
`tags.scm` is a query file that comes standard in a tree-sitter repo, and is one
that we understand well enough to be able to use to identify symbols in the
file.
This is a big deal because it gives us an alternative to the `ctags`-based
symbol navigation that `symbols-view` has been using for ages.
The plan is to refactor `symbols-view` to do nothing except consume various
providers of symbols. We'd bundle a `ctags` provider and a tree-sitter-based
provider by default. LSP-based packages could also register as providers.
…and schedule them to be redone when the tree is clean.
Async parsing means we have to account for this. We tell the display layer when
ranges need to be re-highlighted, but we can't be sure when those tasks will be
scheduled, or if the tree will be clean when we do the required `highlights.scm`
captures.
We have several options:
1. When we run the query and notice that the tree is dirty, do an off-schedule
synchronous parse.
2. When `HighlightIterator#seek` notices the tree is dirty, cancel the highlight
job and reschedule it for whenever the tree is clean.
3. When `HighlightIterator#seek` notices the tree is dirty, proceed with the
highlight job but _also_ schedule the same job to run again when the tree is
clean.
#1 is tempting, but ultimately irresponsible if we don't know how long that
synchronous parse will take.
#2 is not technically possible, because the highlighting job will proceed no
matter what we do, so if we return early from `seek`, the display layer will
interpret that to mean the entire buffer range is plain text.
#3 is what I've implemented here. Captures against a dirty tree are likely to
be at least partially inaccurate, but since we have to go through with the job
anyway, we might as well allow it to proceed, thereby minimizing the flash of
unhighlighted text.
I can't think of a spec that will catch this situation reliably, but I'll give
it a shot later on.
…by ignoring any grammars that can't affect our injections.
This applies during startup; if a grammar that could inject into an open file
is added _after_ the open file is parsed and highlighted, we must repopulate
injections just in case the new grammar is applicable. But we should not react
to non-WASMTreeSitter grammars, nor grammars that do not define an
`injectionRegex`.
* A paste that ended in a newline would incorrectly adjust the row after the
end of the pasted text.
* A paste that needed no after-paste adjustment would incorrectly alter the
undo history.
This is a big one. All the tests pass and I haven't been able to make it
explode in ages, so it's ready for testing by others. But if you manage to make
it break, you can flip the `FEATURE_ASYNC_PARSE` flag back to `false`.
This sets a time limit (currently 3ms) for parsing; if the job needs more time
than that, we return a promise and keep working on the parsing in a series of
jobs that each take no more than 3ms, yielding in between. The editor stays
responsive and the highlighting will catch up when it's able.
The system is able to account for edits that are made during an async parse,
and will keep re-parsing to handle those edits until the tree is clean.
In practice, this has been surprisingly pleasant to use, even on very large
files; and smaller files are unlikely to exceed that 3ms threshold very often.
We can play around with the exact threshold. If we make it smaller, then
theoretically the UI gets even snappier, but there's more overhead when the
parsing is divided into smaller tasks. To keep the UI snappy, we'd want to limit
the threshold to some fraction of the 16.67ms that we're given with each
UI frame, keeping in mind that there's plenty of work to do in that allotted
time _after_ the tree gets parsed.
This is _preliminary_ and is _not working correctly_ yet. It is possible to get
into a situation where the parses can't keep up and things start breaking. But
if history is any indication, even the smallest of changes could be the
difference between “hopelessly broken” and “working perfectly,” so I'm not
discouraged.
It's amazing to me that we've been able to get this far without really _needing_
async parsing. I notice just a bit of lag on `text-editor.js`, and that's a
6,000-line file. But it's inevitable: tree-sitter relies on a complete parse
of the entire source file, so it can only get slower as a file gets bigger, and
at a certain point the parsing would have to go async just to keep the editor
responsive.
The threshold for going async is important, since async introduces quite a bit
of overhead. For instance, if you set the threshold at 5ms, a parse that
would've taken 5.5ms synchronously can be expected to take at least 15ms async.
Part of this is due to the pause in between jobs, yielding to the UI; part of it
is just the overhead of having to resume the job later.
I've introduced this behind a feature flag because, like async indent, it's too
big of a feature to introduce all at once, and I can't just keep it stashed all
the time while I alternate between it and other fixes.
…via the same machinery as our post-insertion auto-indent.
The `preserveTrailingLineIndentation` option is confusingly named. Its purpose
is to say this: “Don't auto-indent every line in the pasted text. Insert the
first line at the proper indent level for the context, but otherwise preserve
the relative indentation of each line.” This is what the
`editor:paste-without-formatting` command does.
We already have a workflow where `TextEditor` asks a language mode for the
correct indent levels for a range of lines, so we can add a mode to
`WASMTreeSitterLanguageMode#suggestedIndentForBufferRows` that implements the
logic of `preserveTrailingLineIndentation` and reports indent levels
accordingly.
…unless there's an indents query to inform a suggestion. Return `null` instead,
which we'll interpret as taking no action.
Before this, the default suggestion was to match the indentation of the previous
line, but that's not always the right answer.