Commit Graph

56 Commits

Author SHA1 Message Date
Linus Groh
6e7459322d AK: Remove StringBuilder::build() in favor of to_deprecated_string()
Having an alias function that only wraps another one is silly, and
keeping the more obvious name should flush out more uses of deprecated
strings.
No behavior change.
2023-01-27 20:38:49 +00:00
Eli Youngs
87a961534f LibRegex: Prevent patterns from matching the empty string twice
Previously, if a pattern matched the empty string (e.g. ".*"), it would
match the string twice instead of once. Among other issues, this caused
a Regex replacement to duplicate its expected output, since it would
replace "both" empty matches.
2023-01-06 13:52:21 -07:00
Linus Groh
57dc179b1f Everywhere: Rename to_{string => deprecated_string}() where applicable
This will make it easier to support both string types at the same time
while we convert code, and tracking down remaining uses.

One big exception is Value::to_string() in LibJS, where the name is
dictated by the ToString AO.
2022-12-06 08:54:33 +01:00
Linus Groh
6e19ab2bbc AK+Everywhere: Rename String to DeprecatedString
We have a new, improved string type coming up in AK (OOM aware, no null
state), and while it's going to use UTF-8, the name UTF8String is a
mouthful - so let's free up the String name by renaming the existing
class.
Making the old one have an annoying name will hopefully also help with
quick adoption :^)
2022-12-06 08:54:33 +01:00
Ali Mohammad Pur
f1851346d3 LibRegex: Use a copy-on-write vector for fork state 2022-11-17 20:13:04 +03:30
Ali Mohammad Pur
cfcd6e770c LibRegex: Don't copy forked results twice 2022-11-17 20:13:04 +03:30
Tim Schumacher
8763dbcccc Everywhere: Remove a bunch of dead write-only variables
LLVM 15 now warns (and thus errors) about this, and there is really no
point in keeping them.
2022-09-16 05:39:28 +00:00
sin-ack
3f3f45580a Everywhere: Add sv suffix to strings relying on StringView(char const*)
Each of these strings would previously rely on StringView's char const*
constructor overload, which would call __builtin_strlen on the string.
Since we now have operator ""sv, we can replace these with much simpler
versions. This opens the door to being able to remove
StringView(char const*).

No functional changes.
2022-07-12 23:11:35 +02:00
Timothy Flynn
3729fd06fa LibRegex: Do not return an Optional from Regex::Matcher::execute
The code path that could return an optional no longer exists as of
commit: a962ee020a
2022-02-05 19:06:50 +03:30
Timothy Flynn
27d3de1f17 LibRegex: Do not continue searching input when the sticky bit is set
This partially reverts commit a962ee020a.

When the sticky bit is set, the global bit should basically be ignored
except by external callers who want their own special behavior. For
example, RegExp.prototype [ @@match ] will use the global flag to
accumulate consecutive matches. But on the first failure, the regex
loop should break.
2022-02-05 19:06:50 +03:30
Ali Mohammad Pur
a962ee020a LibJS+LibRegex: Don't repeat regex match in regexp_exec()
LibRegex already implements this loop in a more performant way, so all
LibJS has to do here is to return things in the right shape, and not
loop over the input string.
Previously this was a quadratic operation on string length, which lead
to crazy execution times on failing regexps - now it's nice and fast :^)

Note that a Regex test has to be updated to remove the stateful flag as
it repeats matching on multiple strings.
2022-02-05 00:09:32 +01:00
Ali Mohammad Pur
2b028f6faa LibRegex+LibJS: Avoid searching for more than one match in JS RegExps
All of JS's regular expression APIs only want a single match, so avoid
trying to produce more (which will be discarded anyway).
2022-02-05 00:09:32 +01:00
Ali Mohammad Pur
5fac41f733 LibRegex: Implement ECMA262 multiline matching without splitting lines
As ECMA262 regex allows `[^]` and literal newlines to match newlines in
the input string, we shouldn't split the input string into lines, rather
simply make boundaries and catchall patterns capable of checking for
these conditions specifically.
2022-01-26 00:53:09 +03:30
Ali Mohammad Pur
cd83325c7c LibRegex: Preserve capture groups and matches across ForkReplace
This makes the (flawed) ForkStay inserted as a loop header unnecessary,
and finally fixes LibRegex rewriting weird loops in weird ways.
2022-01-22 00:35:49 +00:00
Ali Mohammad Pur
704e0654b3 Revert "LibRegex: Implement an ECMA262 Regex quirk with negative loo..."
This partially reverts commit c11be92e23.
That commit fixes one thing and breaks many more, a next commit will
implement this quirk in a more sane way.
2022-01-22 00:35:49 +00:00
Ali Mohammad Pur
9eccd4c56e LibRegex: Allow the pattern to match the zero-length end of the string
...only if Multiline is not enabled.
Fixes #11940.
2022-01-21 18:14:08 +03:30
Ali Mohammad Pur
c11be92e23 LibRegex: Implement an ECMA262 Regex quirk with negative lookarounds
This implements the quirk defined by "Note 3" in section "Canonicalize"
(https://tc39.es/ecma262/#sec-runtime-semantics-canonicalize-ch).

Crosses off another quirk from #6042.
2022-01-21 18:14:08 +03:30
Hendiadyoin1
a2563496f5 LibRegex: Remove some else-after-returns 2021-12-21 18:17:28 -08:00
Hendiadyoin1
b674de6957 LibRegex: Add some implied auto qualifiers 2021-12-21 18:17:28 -08:00
Andreas Kling
8b1108e485 Everywhere: Pass AK::StringView by value 2021-11-11 01:27:46 +01:00
Ben Wiederhake
50698a0db4 AK: Prevent accidental misuse of BumpAllocator
In particular, we implicitly required that the caller initializes the
returned instances themselves (solved by making
UniformBumpAllocator::allocate call the constructor), and BumpAllocator
itself cannot handle classes that are not trivially deconstructible
(solved by deleting the method).

Co-authored-by: Ali Mohammad Pur <ali.mpfard@gmail.com>
2021-10-23 19:02:54 +01:00
Andreas Kling
bb6634b024 LibRegex: Don't emit signpost events for every regular expression
The time we were spending on these signposts was adding up to way too
much, so let's not do it automatically.
2021-10-02 16:53:03 +02:00
Ali Mohammad Pur
e4b1c0b8b1 LibRegex: Set a signpost on every executed regular expression 2021-09-13 14:38:53 +04:30
Ali Mohammad Pur
246ab432ff LibRegex: Add a basic optimization pass
This currently tries to convert forking loops to atomic groups, and
unify the left side of alternations.
2021-09-13 14:38:53 +04:30
Timothy Flynn
9509433e25 LibRegex: Implement and use a REPEAT operation for bytecode repetition
Currently, when we need to repeat an instruction N times, we simply add
that instruction N times in a for-loop. This doesn't scale well with
extremely large values of N, and ECMA-262 allows up to N = 2^53 - 1.

Instead, add a new REPEAT bytecode operation to defer this loop from the
parser to the runtime executor. This allows the parser to complete sans
any loops (for this instruction), and allows the executor to bail early
if the repeated bytecode fails.

Note: The templated ByteCode methods are to allow the Posix parsers to
continue using u32 because they are limited to N = 2^20.
2021-08-15 11:43:45 +01:00
Timothy Flynn
a0b72f5ad3 LibRegex: Remove (mostly) unused regex::MatchOutput
This struct holds a counter for the number of executed operations, and
vectors for matches, captures groups, and named capture groups. Each of
the vectors is unused. Remove the struct and just keep a separate
counter for the executed operations.
2021-08-15 11:43:45 +01:00
Timothy Flynn
f1ce998d73 LibRegex+LibJS: Combine named and unnamed capture groups in MatchState
Combining these into one list helps reduce the size of MatchState, and
as a result, reduces the amount of memory consumed during execution of
very large regex matches.

Doing this also allows us to remove a few regex byte code instructions:
ClearNamedCaptureGroup, SaveLeftNamedCaptureGroup, and NamedReference.
Named groups now behave the same as unnamed groups for these operations.
Note that SaveRightNamedCaptureGroup still exists to cache the matched
group name.

This also removes the recursion level from the MatchState, as it can
exist as a local variable in Matcher::execute instead.
2021-08-15 11:43:45 +01:00
Timothy Flynn
fea181bde3 LibRegex: Reduce RegexMatcher's BumpAllocator chunk size
Before the BumpAllocator OOB access issue was understood and fixed, the
chunk size was increased to 8MiB as a workaround in commit:
27d555bab0.

The issue is now resolved by: 0f1425c895.

We can reduce the chunk size to 2MiB, which has the added benefit of
reducing runtime of the RegExp.prototype.exec test.
2021-08-15 11:43:45 +01:00
Timothy Flynn
27d555bab0 LibRegex: Track string position in both code units and code points
In non-Unicode mode, the existing MatchState::string_position is tracked
in code units; in Unicode mode, it is tracked in code points.

In order for some RegexStringView operations to be performant, it is
useful for the MatchState to have a field to always track the position
in code units. This will allow RegexStringView methods (e.g. operator[])
to perform lookups based on code unit offsets, rather than needing to
iterate over the entire string to find a code point offset.
2021-08-04 11:18:24 +02:00
Ali Mohammad Pur
d5984d296f LibRegex: Make Matcher<>::match(Vector<>) take a reference to the vector
It was previously copying the entire vector every time, which is not a
nice thing to do. :^)
2021-08-02 17:22:50 +04:30
Ali Mohammad Pur
a7653e6a05 LibRegex: Use a bump-allocated linked list for fork save states
This makes it avoid the excessively high malloc() traffic.
2021-08-02 17:22:50 +04:30
Ali Mohammad Pur
5f342e4fa9 LibRegex: Make Fork{Jump,Stay} non-recursive
This makes very fork-heavy expressions (like `(aa)*`) not run out of
stack space when matching very long strings.
2021-08-02 17:22:50 +04:30
Brian Gianforcaro
18d6f9ed5c Libraries: Remove unused header includes 2021-08-01 08:10:16 +02:00
Timothy Flynn
1400e3cf58 LibRegex: Allow separately parsing patterns and creating Regex objects
Adds a static method to parse a regex pattern and return the result, and
a constructor to accept a parse result. This is to allow LibJS to parse
the pattern string of a RegExpLiteral once and hand off regex objects
any number of times thereafter.
2021-07-30 21:26:31 +01:00
Timothy Flynn
b162517065 LibRegex: Take ownership of pattern string and fix move operations
The Regex object created a copy of the pattern string anyways, so tweak
the constructor to allow callers to move() pattern strings into the
regex.

The Regex move constructor and assignment operator currently result in
memory corruption. The Regex object stores a Matcher object, which holds
a reference to the Regex object. So when the Regex object is moved, that
reference is no longer valid. To fix this, the reference stored in the
Matcher must be updated when the Regex is moved.
2021-07-30 21:26:31 +01:00
Timothy Flynn
47f6bb38a1 LibRegex: Support UTF-16 RegexStringView and improve Unicode matching
When the Unicode option is not set, regular expressions should match
based on code units; when it is set, they should match based on code
points. To do so, the regex parser must combine surrogate pairs when
the Unicode option is set. Further, RegexStringView needs to know if
the flag is set in order to return code point vs. code unit based
string lengths and substrings.
2021-07-23 23:06:57 +01:00
Ali Mohammad Pur
36bfc912fc LibRegex: Switch to east-const style 2021-07-23 21:19:21 +04:30
Ali Mohammad Pur
f364fcec5d LibRegex+Everywhere: Make LibRegex more unicode-aware
This commit makes LibRegex (mostly) capable of operating on any of
the three main string views:
- StringView for raw strings
- Utf8View for utf-8 encoded strings
- Utf32View for raw unicode strings

As a result, regexps with unicode strings should be able to properly
handle utf-8 and not stop in the middle of a code point.
A future commit will update LibJS to use the correct type of string
depending on the flags.
2021-07-18 21:10:55 +04:30
Ali Mohammad Pur
54d89609de LibRegex: Add support for the Basic POSIX regular expressions
This implements the internal regex stuff for #8506.
2021-07-10 13:33:08 +02:00
Timothy Flynn
0f0ac37b56 LibRegex: Break from execution loop when the sticky flag is set
If the sticky flag is set, the regex execution loop should break
immediately even if the execution was a failure. The specification for
several RegExp.prototype methods (e.g. exec and @@split) rely on this
behavior.
2021-07-09 19:45:55 +01:00
Gunnar Beutner
794dc368f1 LibRegex: Avoid prepending items to vectors 2021-06-14 16:09:58 +04:30
Gunnar Beutner
281f39073d LibRegex: Make get_opcode() return a reference
Previously this would return a pointer which could be null if the
requested opcode was invalid. This should never be the case though
so let's VERIFY() that instead.
2021-06-14 16:09:58 +04:30
Linus Groh
dac0554fa0 LibRegex: Replace fprintf()/printf() with warnln()/outln()/dbgln() 2021-05-31 17:43:54 +01:00
Andreas Kling
79ff1902aa LibRegex: Convert StringBuilder::appendf() => AK::Format 2021-05-07 21:12:09 +02:00
Linus Groh
a4c1860bfc LibRegex: Put to dbgln()s behind REGEX_DEBUG 2021-04-23 20:52:12 +02:00
Ali Mohammad Pur
bf9c04a3da LibRegex: Implement multiline stateful matches 2021-04-23 10:05:04 +02:00
Ali Mohammad Pur
bb40d4d5ff LibRegex: Do not attempt to find more matches when one match is needed 2021-04-23 10:05:04 +02:00
Brian Gianforcaro
1682f0b760 Everything: Move to SPDX license identifiers in all files.
SPDX License Identifiers are a more compact / standardized
way of representing file license information.

See: https://spdx.dev/resources/use/#identifiers

This was done with the `ambr` search and replace tool.

 ambr --no-parent-ignore --key-from-file --rep-from-file key.txt rep.txt *
2021-04-22 11:22:27 +02:00
AnotherTest
ade97d4094 LibRegex: Make sure there are as many group matches as actual matches
Fixes #6131.
2021-04-05 09:02:06 +02:00
AnotherTest
76f63c2980 LibRegex: Allocate entries for all capture groups in RegexResult
Not just the seen ones.
Fixes #6108.
2021-04-04 16:04:06 +02:00