Commit Graph

32 Commits

Author SHA1 Message Date
Ali Mohammad Pur
5e1499d104 Everywhere: Rename {Deprecated => Byte}String
This commit un-deprecates DeprecatedString, and repurposes it as a byte
string.
As the null state has already been removed, there are no other
particularly hairy blockers in repurposing this type as a byte string
(what it _really_ is).

This commit is auto-generated:
  $ xs=$(ack -l \bDeprecatedString\b\|deprecated_string AK Userland \
    Meta Ports Ladybird Tests Kernel)
  $ perl -pie 's/\bDeprecatedString\b/ByteString/g;
    s/deprecated_string/byte_string/g' $xs
  $ clang-format --style=file -i \
    $(git diff --name-only | grep \.cpp\|\.h)
  $ gn format $(git ls-files '*.gn' '*.gni')
2023-12-17 18:25:10 +03:30
Ali Mohammad Pur
4d71f4edc4 LibRegex: Don't add the Repeat instruction size to its jump target
This was causing the calculated jump target to become invalid, leading
to possibly invalid optimisations and (more likely) crashes.
Fixes #21047.
2023-09-15 18:07:23 +03:30
Ali Mohammad Pur
4d27257c45 LibRegex: Treat backwards jumps to IP 0 as normal backwards jumps too
This shows up in something like /\d+|x/, where the `+` ends up jumping
to the start of its own alternative.
2023-08-16 22:20:24 +03:30
Ali Mohammad Pur
e689422564 LibRegex: Keep track of instruction positions for backwards tree jumps 2023-08-05 16:40:04 +02:00
Ali Mohammad Pur
4e69eb89e8 LibRegex: Generate a search tree when patterns would benefit from it
This takes the previous alternation optimisation and applies it to all
the alternation blocks instead of just the few instructions at the
start.
By generating a trie of instructions, all logically equivalent
instructions will be consolidated into a single node, allowing the
engine to avoid checking the same thing multiple times.
For instance, given the pattern /abc|ac|ab/, this optimisation would
generate the following tree:
    - a
    | - b
    | | - c
    | | | - <accept>
    | | - <accept>
    | - c
    | | - <accept>
which will attempt to match 'a' or 'b' only once, and would also limit
the number of backtrackings performed in case alternatives fails to
match.

This optimisation is currently gated behind a simple cost model that
estimates the number of instructions generated, which is pessimistic for
small patterns, though the change in performance in such patterns is not
particularly large.
2023-07-31 05:31:33 +02:00
Ali Mohammad Pur
18f4b6c670 LibRegex: Add the literal string search optimisation
This switches to using a simple string equality check if the regex
pattern is strictly a string literal.
Technically this optimisation can also be made on bounded literal
patterns like /[abc]def/ or /abc|def/ as well, but those are
significantly more complex to implement due to our bytecode-only
approach.
2023-07-31 05:31:33 +02:00
Ali Mohammad Pur
06573cd46d LibRegex: Enable the atomic rewrite optimisation for unicode properties 2023-07-14 08:59:19 +02:00
Ali Mohammad Pur
fb262de7cb LibRegex: Make append_alternation() actually skip the common blocks
Previously we started with 'left_skip' set to zero, which made it so
that no blocks were selected to be skipped.
2023-06-14 06:41:17 +02:00
Ali Mohammad Pur
b1ca2e5e39 LibRegex: Do not treat repeats followed by fallthroughs as atomic 2023-06-14 06:41:17 +02:00
Ali Mohammad Pur
7f530c0753 LibRegex: Bail out of atomic rewrite if a block doesn't contain compares
If a block jumps before performing a compare, we'd need to recursively
find the first of the jumped-to block. While this is doable, it's not
really worth spending the time as most such cases won't actually qualify
for atomic loop rewrite anyway.
Fixes an invalid rewrite when `.+` is followed by an alternation, e.g.
/.+(a|b|c)/.
2023-02-15 10:14:26 +01:00
Ali Mohammad Pur
af441bb939 LibRegex: Consider the inverse=true case when finding pattern overlap
Previously we were only checking for overlap when the range wasn't in
inverse mode, which made us miss things like /[^x]x/; this patch makes
it so we don't miss that.
2023-02-15 10:14:26 +01:00
Ben Wiederhake
8a331d4fa0 Everywhere: Move AK/Debug.h include to using files or remove 2023-01-02 20:27:20 -05:00
Ali Mohammad Pur
598dc74a76 LibRegex: Partially implement the ECMAScript unicodeSets proposal
This skips the new string unicode properties additions, along with \q{}.
2022-07-20 21:25:59 +01:00
Ali Mohammad Pur
fe46b2c141 LibRegex: Correctly track current inversion state in the optimizer
This is currently not important as we do not nest TemporaryInverse.
2022-07-10 14:26:03 +02:00
Ali Mohammad Pur
9c5febe800 LibRegex: Flush compare tables before entering a permanent inverse state 2022-07-10 14:26:03 +02:00
Ali Mohammad Pur
7d01ee63d6 LibRegex: Use proper CharRange constructor instead of bit_casting
Otherwise the range order would be inverted.
2022-07-05 07:19:13 +02:00
Ali Mohammad Pur
6e655b7f89 LibRegex: Fully interpret the Compare Op when looking for overlaps
We had a really naive and simplistic implementation, which lead to
various issues where the optimiser incorrectly rewrote the regex to use
atomic groups; this commit fixes that.
2022-07-04 23:09:53 +02:00
Ali Mohammad Pur
97a333608e LibRegex: Make codegen+optimisation for alternatives much faster
Just a little thinking outside the box, and we can now parse and
optimise a million copies of "a|" chained together in just a second :^)
2022-02-20 11:53:59 +01:00
Ali Mohammad Pur
3b0943d24c LibRegex: Correct the alternative matching order when one is empty
Previously we were compiling `/a|/` into what effectively would be
`/|a`, which is clearly incorrect.
2022-02-14 11:30:50 +01:00
Ali Mohammad Pur
6a4c8a66ae LibRegex: Only skip full instructions when optimizing alternations
It makes no sense to skip half of an instruction, so make sure to skip
only full instructions!
2022-02-09 21:02:24 +00:00
Ali Mohammad Pur
cd83325c7c LibRegex: Preserve capture groups and matches across ForkReplace
This makes the (flawed) ForkStay inserted as a loop header unnecessary,
and finally fixes LibRegex rewriting weird loops in weird ways.
2022-01-22 00:35:49 +00:00
Ali Mohammad Pur
bfe8f312f3 LibRegex: Correct jump offset to the start of the loop block
Previously we were jumping to the new end of the previous block (created
by the newly inserted ForkStay), correct the offset to jump to the
correct block as shown in the comments.
Fixes #12033.
2022-01-21 18:14:08 +03:30
Hendiadyoin1
b674de6957 LibRegex: Add some implied auto qualifiers 2021-12-21 18:17:28 -08:00
Ali Mohammad Pur
b8f03bb072 LibRegex: Make append_alternation() significantly faster
...by flattening the underlying bytecode chunks first.
Also avoid calling DisjointChunks::size() inside a loop.
This is a very significant improvement in performance, making the
compilation of a large regex with lots of alternatives take only ~100ms
instead of many minutes (I ran out of patience waiting for it) :^)
2021-12-21 22:10:07 +01:00
Ali Mohammad Pur
d2e51fafa9 LibRegex: Merge alternations based on blocks and not instructions
The instructions can have dependencies (e.g. Repeat), so only unify
equal blocks instead of consecutive instructions.
Fixes #11247.

Also adds the minimal test case(s) from that issue.
2021-12-15 19:36:45 +03:30
Ali Mohammad Pur
387df06385 LibRegex: Avoid rewriting a+ as a* as part of atomic rewriting
The initial `ForkStay` is only needed if the looping block has a
following block, if there's no following block or the following block
does not attempt to match anything, we should not insert the ForkStay,
otherwise we would be rewriting `a+` as `a*` by allowing the 'end' to be
executed.
Fixes #10952.
2021-11-18 09:09:22 +01:00
Ali Mohammad Pur
ac856cb965 LibRegex: Don't ignore empty alternatives in append_alternation()
Doing so would cause patterns like `(a|)` to not match the empty string.
2021-10-29 15:57:59 +02:00
Ali Mohammad Pur
8f722302d9 LibRegex: Use a match table for character classes
Generate a sorted, compressed series of ranges in a match table for
character classes, and use a binary search to find the matches.
This is about a 3-4x speedup for character class match performance. :^)
2021-10-03 19:16:36 +02:00
Andreas Kling
2758d99bbc LibRegex: Flatten bytecode before performing optimizations
This avoids doing DisjointChunks traversal for every bytecode access,
significantly reducing startup time for large regular expressions.
2021-09-29 18:45:26 +02:00
Ali Mohammad Pur
741886a4c4 LibRegex: Make the optimizer understand references and capture groups
Otherwise the fork in patterns like `(1+)\1` would be (incorrectly)
optimized away.
2021-09-15 15:52:28 +04:30
Ali Mohammad Pur
bf0315ff8f LibRegex: Avoid excessive Vector copy when compiling regexps
Previously we would've copied the bytecode instead of moving the chunks
around, use the fancy new DisjointChunks<T> abstraction to make that
happen automagically.
This decreases vector copies and uses of memmove() by nearly 10x :^)
2021-09-14 21:33:15 +04:30
Ali Mohammad Pur
246ab432ff LibRegex: Add a basic optimization pass
This currently tries to convert forking loops to atomic groups, and
unify the left side of alternations.
2021-09-13 14:38:53 +04:30