Commit Graph

276 Commits

Author SHA1 Message Date
Ali Mohammad Pur
8003bde03d AK+LibRegex+LibWasm: Remove the non-const COWVector::operator[]
This was copying the vector behind our backs, let's remove it and make
the copying explicit by putting it behind COWVector::mutable_at().
This is a further 64% performance improvement on Wasm validation.
2024-03-12 17:10:47 +01:00
Ali Mohammad Pur
cefe177a56 AK+LibRegex: Move COWVector to AK
This is about to gain a new user, so move it to AK.
2024-03-12 17:10:47 +01:00
Timothy Flynn
07a27b2ec0 AK: Replace the boolean parameter of StringView::lines with a named enum 2024-03-08 14:43:33 -05:00
Timothy Flynn
aa0a6d58b2 Userland: Remove LibCore dependency from libraries that do not use it 2024-01-22 08:48:34 -05:00
Ali Mohammad Pur
e265d81277 LibRegex: Correct And/Or and inversion interplay semantics
This commit also fixes an incorrect test case from very early on, our
behaviour now matches the ECMA262 spec in this case.

Fixes #21786.
2024-01-11 11:36:09 +01:00
Ali Mohammad Pur
267040dde7 LibRegex: Error out on Eof when parsing nonempty class range elements
Fixes #22507.
2023-12-31 15:36:42 +01:00
Shannon Booth
e2e7c4d574 Everywhere: Use to_number<T> instead of to_{int,uint,float,double}
In a bunch of cases, this actually ends up simplifying the code as
to_number will handle something such as:

```
Optional<I> opt;
if constexpr (IsSigned<I>)
    opt = view.to_int<I>();
else
    opt = view.to_uint<I>();
```

For us.

The main goal here however is to have a single generic number conversion
API between all of the String classes.
2023-12-23 20:41:07 +01:00
Ali Mohammad Pur
5e1499d104 Everywhere: Rename {Deprecated => Byte}String
This commit un-deprecates DeprecatedString, and repurposes it as a byte
string.
As the null state has already been removed, there are no other
particularly hairy blockers in repurposing this type as a byte string
(what it _really_ is).

This commit is auto-generated:
  $ xs=$(ack -l \bDeprecatedString\b\|deprecated_string AK Userland \
    Meta Ports Ladybird Tests Kernel)
  $ perl -pie 's/\bDeprecatedString\b/ByteString/g;
    s/deprecated_string/byte_string/g' $xs
  $ clang-format --style=file -i \
    $(git diff --name-only | grep \.cpp\|\.h)
  $ gn format $(git ls-files '*.gn' '*.gni')
2023-12-17 18:25:10 +03:30
Timothy Flynn
e122039c99 LibRegex: Support non-ASCII case-insensitive character comparisons
Specifically, when the Unicode flag is set, use Unicode-aware case
folding to case-insensitively compare code points.
2023-11-08 12:54:26 -05:00
Ali Mohammad Pur
4d71f4edc4 LibRegex: Don't add the Repeat instruction size to its jump target
This was causing the calculated jump target to become invalid, leading
to possibly invalid optimisations and (more likely) crashes.
Fixes #21047.
2023-09-15 18:07:23 +03:30
Ali Mohammad Pur
4d27257c45 LibRegex: Treat backwards jumps to IP 0 as normal backwards jumps too
This shows up in something like /\d+|x/, where the `+` ends up jumping
to the start of its own alternative.
2023-08-16 22:20:24 +03:30
Ali Mohammad Pur
e689422564 LibRegex: Keep track of instruction positions for backwards tree jumps 2023-08-05 16:40:04 +02:00
Ali Mohammad Pur
d9ed58eb39 LibRegex: Make OpCode_Compare use a switch-case instead of if-else
This has no performance impact, the if-elses have been annoying as it's
impossible to fold the entire thing in basically any IDE.
2023-07-31 05:31:33 +02:00
Ali Mohammad Pur
4e69eb89e8 LibRegex: Generate a search tree when patterns would benefit from it
This takes the previous alternation optimisation and applies it to all
the alternation blocks instead of just the few instructions at the
start.
By generating a trie of instructions, all logically equivalent
instructions will be consolidated into a single node, allowing the
engine to avoid checking the same thing multiple times.
For instance, given the pattern /abc|ac|ab/, this optimisation would
generate the following tree:
    - a
    | - b
    | | - c
    | | | - <accept>
    | | - <accept>
    | - c
    | | - <accept>
which will attempt to match 'a' or 'b' only once, and would also limit
the number of backtrackings performed in case alternatives fails to
match.

This optimisation is currently gated behind a simple cost model that
estimates the number of instructions generated, which is pessimistic for
small patterns, though the change in performance in such patterns is not
particularly large.
2023-07-31 05:31:33 +02:00
Ali Mohammad Pur
18f4b6c670 LibRegex: Add the literal string search optimisation
This switches to using a simple string equality check if the regex
pattern is strictly a string literal.
Technically this optimisation can also be made on bounded literal
patterns like /[abc]def/ or /abc|def/ as well, but those are
significantly more complex to implement due to our bytecode-only
approach.
2023-07-31 05:31:33 +02:00
Ali Mohammad Pur
221c52c696 LibRegex: Avoid slicing a RegexStringView in non-unicode Compare ops
Getting a single code point is much faster than slicing into the string.
2023-07-31 05:31:33 +02:00
Ali Mohammad Pur
11abca421a LibRegex: Always inline get_opcode()
This function's prologue was showing up as very hot in profiles, taking
almost 9% of runtime in our Regex test suite.
2023-07-31 05:31:33 +02:00
Ali Mohammad Pur
fbab9bc330 LibRegex: Avoid pointlessly slicing a UTF-16 string for one code unit
This simple change is a nice 9% speedup on the regex test suite :^)
2023-07-14 11:22:21 +02:00
Andreas Kling
17bff999c8 LibRegex: Remove redundant VERIFY in OpCode::argument()
The bounds check here is already performed by the at() we're calling,
and removing it appears to have non-trivial performance impact.
2023-07-14 09:09:11 +02:00
Ali Mohammad Pur
4bff4219ff LibRegex: Replace the checkpoint backing store with a Vector
This makes repeated lookups and insertions much faster, yielding a
general performance improvement of about 10% in the regex test suite.
2023-07-14 08:59:19 +02:00
Ali Mohammad Pur
2d6f50932b LibRegex: Assign unique serial IDs to checkpoints
This makes the compiler assign a serial ID to each checkpoint instead of
using the IP as the identifier.
This will be used in a future commit to replace the backing store of
checkpoints with a vector.
2023-07-14 08:59:19 +02:00
Ali Mohammad Pur
06573cd46d LibRegex: Enable the atomic rewrite optimisation for unicode properties 2023-07-14 08:59:19 +02:00
Timothy Flynn
c911781c21 Everywhere: Remove needless trailing semi-colons after functions
This is a new option in clang-format-16.
2023-07-08 10:32:56 +01:00
implicitfield
7d19abda7a LibC+LibRegex: Move regex_defs.h from LibC to LibRegex
This is needed to avoid including LibC headers in Lagom builds.
2023-06-27 12:40:38 +02:00
Timothy Flynn
8b668da9d5 LibRegex: Bail parsing class set characters upon early EOF
Otherwise, we reach a skip() invocation at the end of this function,
which crashes due to EOF. Caught by test262.
2023-06-23 20:22:45 +02:00
Ali Mohammad Pur
cdec23a68c LibRegex: Treat \<ORD_CHAR> as unescaped in POSIX BRE/ERE
This is undefined according to the spec, but glibc ignores the backslash
and some applications seem to prefer this behaviour (e.g. sed).
2023-06-14 12:38:10 +02:00
Ali Mohammad Pur
fb262de7cb LibRegex: Make append_alternation() actually skip the common blocks
Previously we started with 'left_skip' set to zero, which made it so
that no blocks were selected to be skipped.
2023-06-14 06:41:17 +02:00
Ali Mohammad Pur
b1ca2e5e39 LibRegex: Do not treat repeats followed by fallthroughs as atomic 2023-06-14 06:41:17 +02:00
Ali Mohammad Pur
eba466b8e7 LibRegex: Avoid calling GenericLexer::consume() past EOF
The consume(size_t) overload consumes "at most" as many bytes as
requested, but consume() consumes exactly one byte.
This commit makes sure to avoid consuming past EOF.

Fixes #18324.
Fixes #18325.
2023-04-14 12:33:54 +02:00
Ben Wiederhake
560133a0c6 Everywhere: Remove unused DeprecatedString includes 2023-04-09 22:00:54 +02:00
Ali Mohammad Pur
6fc9f5fa28 LibRegex: Make ^ and $ accept all LineTerminators instead of just '\n'
Also adds a couple tests.
2023-03-25 15:44:05 +01:00
Andreas Kling
a504ac3e2a Everywhere: Rename equals_ignoring_case => equals_ignoring_ascii_case
Let's make it clear that these functions deal with ASCII case only.
2023-03-10 13:15:44 +01:00
Andreas Kling
21db2b7b90 Everywhere: Remove NonnullOwnPtr.h includes 2023-03-06 23:46:35 +01:00
Fausto Tommasi
a24c49c18c LibRegex: Add to_string method for RegexStringView 2023-02-17 16:32:02 +00:00
Ali Mohammad Pur
7f530c0753 LibRegex: Bail out of atomic rewrite if a block doesn't contain compares
If a block jumps before performing a compare, we'd need to recursively
find the first of the jumped-to block. While this is doable, it's not
really worth spending the time as most such cases won't actually qualify
for atomic loop rewrite anyway.
Fixes an invalid rewrite when `.+` is followed by an alternation, e.g.
/.+(a|b|c)/.
2023-02-15 10:14:26 +01:00
Ali Mohammad Pur
af441bb939 LibRegex: Consider the inverse=true case when finding pattern overlap
Previously we were only checking for overlap when the range wasn't in
inverse mode, which made us miss things like /[^x]x/; this patch makes
it so we don't miss that.
2023-02-15 10:14:26 +01:00
Ali Mohammad Pur
936a9fd759 LibRegex: Make '.' reject matching LF / LS / PS as per the ECMA262 spec
Previously we allowed it to match those, but the ECMA262 spec disallows
these (except in DotAll).
2023-02-15 10:14:26 +01:00
Linus Groh
6e7459322d AK: Remove StringBuilder::build() in favor of to_deprecated_string()
Having an alias function that only wraps another one is silly, and
keeping the more obvious name should flush out more uses of deprecated
strings.
No behavior change.
2023-01-27 20:38:49 +00:00
Sam Atkins
6c66fd5ffb LibRegex: Remove declarations for non-existent methods 2023-01-27 20:33:18 +00:00
Timothy Flynn
f3db548a3d AK+Everywhere: Rename FlyString to DeprecatedFlyString
DeprecatedFlyString relies heavily on DeprecatedString's StringImpl, so
let's rename it to A) match the name of DeprecatedString, B) write a new
FlyString class that is tied to String.
2023-01-09 23:00:24 +00:00
Timothy Flynn
d0403ec14f AK+Everywhere: Rename Utf16View::to_utf8 to to_deprecated_string
A subsequent commit will add to_utf8 back to create an AK::String.
2023-01-09 23:00:24 +00:00
Timothy Flynn
d793262beb AK+Everywhere: Make UTF-16 to UTF-8 converter fallible
This could fail to allocate the underlying storage needed to store the
UTF-8 data. Propagate this error.
2023-01-08 12:13:15 +01:00
Timothy Flynn
1edb96376b AK+Everywhere: Make UTF-8 and UTF-32 to UTF-16 converters fallible
These could fail to allocate the underlying storage needed to store the
UTF-16 data. Propagate these errors.
2023-01-08 12:13:15 +01:00
Timothy Flynn
425c168ded AK+LibJS+LibRegex: Define an alias for UTF-16 string data storage
Instead of writing out "Vector<u16, 1>" everywhere, let's have a name
for it.
2023-01-08 12:13:15 +01:00
Eli Youngs
87a961534f LibRegex: Prevent patterns from matching the empty string twice
Previously, if a pattern matched the empty string (e.g. ".*"), it would
match the string twice instead of once. Among other issues, this caused
a Regex replacement to duplicate its expected output, since it would
replace "both" empty matches.
2023-01-06 13:52:21 -07:00
Eli Youngs
5bf2cce839 LibRegex: Allow the SingleMatch flag to be used as a PosixFlag 2023-01-06 13:52:21 -07:00
Nico Weber
beee52dbdc LibRegex: Return StringView from get_error_string()
It just returns literals after all. Removes one use of DeprecatedString.
2023-01-04 20:05:52 +01:00
Nico Weber
5fdef94d53 LibRegex: Tweak get_error() function
- Return StringView instead of DeprecatedString from function
  returning only literals
- Remove redundant cast
- Remove "inline" -- the function is defined in a cpp file,
  so there's no need for the linkage implications of `inline`.
  And compilers know to inline static functions with a single
  use without it. (Normally I'd remove the `static` instead,
  but this is in an `extern "C"` block, and it doesn't matter
  enough to end that block before the helper function and
  reopen it enough after)
2023-01-04 20:04:57 +01:00
Ben Wiederhake
6fd478b6ce Everywhere: Remove unused includes of AK/Format.h
These instances were detected by searching for files that include
AK/Format.h, but don't match the regex:

\\b(CheckedFormatString|critical_dmesgln|dbgln|dbgln_if|dmesgln|FormatBu
ilder|__FormatIfSupported|FormatIfSupported|FormatParser|FormatString|Fo
rmattable|Formatter|__format_value|HasFormatter|max_format_arguments|out
|outln|set_debug_enabled|StandardFormatter|TypeErasedFormatParams|TypeEr
asedParameter|VariadicFormatParams|v_critical_dmesgln|vdbgln|vdmesgln|vf
ormat|vout|warn|warnln|warnln_if)\\b

(Without the linebreaks.)

This regex is pessimistic, so there might be more files that don't
actually use any formatting functions.

Observe that this revealed that Userland/Libraries/LibC/signal.cpp is
missing an include.

In theory, one might use LibCPP to detect things like this
automatically, but let's do this one step after another.
2023-01-02 20:27:20 -05:00
Ben Wiederhake
8a331d4fa0 Everywhere: Move AK/Debug.h include to using files or remove 2023-01-02 20:27:20 -05:00