Commit Graph

43 Commits

Author SHA1 Message Date
Timothy Flynn
484ccfadc3 LibRegex: Support property escapes of Unicode script extensions 2021-08-04 13:50:32 +01:00
Timothy Flynn
06088df729 LibRegex: Support property escapes of the Unicode script property
Note that unlike binary properties and general categories, scripts must
be specified in the non-binary (Script=Value) form.
2021-08-04 13:50:32 +01:00
Timothy Flynn
dc9f516339 LibRegex: Generate negated property escapes as a single instruction
These were previously generated as two instructions, Compare [Inverse]
and Compare [Property].
2021-08-02 21:02:09 +04:30
Timothy Flynn
4de4312827 LibRegex: Support property escapes of the form \p{Type=Value}
Before now, only binary properties could be parsed. Non-binary props are
of the form "Type=Value", where "Type" may be General_Category, Script,
or Script_Extension (or their aliases). Of these, LibUnicode currently
supports General_Category, so LibRegex can parse only that type.
2021-08-02 21:02:09 +04:30
Timothy Flynn
1e10d6d7ce LibRegex: Support property escapes of Unicode General Categories
This changes LibRegex to parse the property escape as a Variant of
Unicode Property & General Category values. A byte code instruction is
added to perform matching based on General Category values.
2021-08-02 21:02:09 +04:30
Timothy Flynn
d485cf29d7 LibRegex+LibUnicode: Begin implementing Unicode property escapes
This supports some binary property matching. It does not support any
properties not yet parsed by LibUnicode, nor does it support value
matching (such as Script_Extensions=Latin).
2021-07-30 21:26:31 +01:00
Ali Mohammad Pur
faef523567 LibRegex: Make unclosed-at-eof brace quantifiers an error
Otherwise we'd just loop trying to parse it over and over again, for
instance in `/a{/` or `/a{1,/`.
Unless we're parsing in Annex B mode, which allows `{` as a normal
ExtendedSourceCharacter.
2021-07-24 20:52:43 +04:30
Timothy Flynn
345ef6abba LibRegex: Support ECMA-262 Unicode escapes of the form "\u{code_point}"
When the Unicode flag is set, regular expressions may escape code points
by surrounding the hexadecimal code point with curly braces, e.g. \u{41}
is the character "A".

When the Unicode flag is not set, this should be considered a repetition
symbol - \u{41} is the character "u" repeated 41 times. This is left as
a TODO for now.
2021-07-23 23:06:57 +01:00
Timothy Flynn
47f6bb38a1 LibRegex: Support UTF-16 RegexStringView and improve Unicode matching
When the Unicode option is not set, regular expressions should match
based on code units; when it is set, they should match based on code
points. To do so, the regex parser must combine surrogate pairs when
the Unicode option is set. Further, RegexStringView needs to know if
the flag is set in order to return code point vs. code unit based
string lengths and substrings.
2021-07-23 23:06:57 +01:00
Ali Mohammad Pur
36bfc912fc LibRegex: Switch to east-const style 2021-07-23 21:19:21 +04:30
Ali Mohammad Pur
c8b2199251 LibRegex: Clear previous capture group contents in ECMA262 mode
ECMA262 requires that the capture groups only contain the values from
the last iteration, e.g. `((c)(a)?(b))` should _not_ contain 'a' in the
second capture group when matching "cabcb".
2021-07-23 21:19:21 +04:30
Ali Mohammad Pur
9cdea2d521 LibRegex: Consider EOF in the middle of a range an error 2021-07-13 07:04:06 +02:00
Ali Mohammad Pur
1b2728f1ed LibRegex: Don't attempt to insert invalid bytecode in {B,E}RE 2021-07-13 07:04:06 +02:00
Ali Mohammad Pur
6e35b94034 LibRegex: Implement lookaround in ERE 2021-07-13 07:04:06 +02:00
Ali Mohammad Pur
5f4e1338a1 LibRegex: Allow empty character classes in {B,E}RE 2021-07-13 07:04:06 +02:00
Ali Mohammad Pur
189922f442 LibRegex: Disallow excessively large repetition counts in {B,E}RE 2021-07-13 07:04:06 +02:00
Ali Mohammad Pur
11a8476cf4 LibRegex: Use the parser state capture group count in BRE
Otherwise the users won't know how many capture groups are in the
parsed regular expression.
2021-07-10 23:14:08 +04:30
Ali Mohammad Pur
1c584e9d80 LibRegex: Correctly parse BRE bracket expressions
Commonly, bracket expressions are in fact, enclosed in brackets.
2021-07-10 22:58:24 +04:30
Ali Mohammad Pur
54d89609de LibRegex: Add support for the Basic POSIX regular expressions
This implements the internal regex stuff for #8506.
2021-07-10 13:33:08 +02:00
Ali Mohammad Pur
addfa1e82e LibRegex: Make the bytecode transformation functions static
They were pretty confusing when compared with other non-transforming
functions.
2021-07-10 13:33:08 +02:00
Timothy Flynn
65003241e4 LibRegex: Allow dollar signs in ECMA262 named capture groups
Fixes 1 test262 test.
2021-07-06 22:33:17 +01:00
Andreas Kling
dc65f54c06 AK: Rename Vector::append(Vector) => Vector::extend(Vector)
Let's make it a bit more clear when we're appending the elements from
one vector to the end of another vector.
2021-06-12 13:24:45 +02:00
Linus Groh
939da41fa1 LibRegex: Fix compilation errors on my host machine
I have no idea *why*, but this stopped working suddenly:

    return { { .code_point = '-', .is_character_class = false } };

Fails with:

    error: could not convert ‘{{'-', false}}’ from
    ‘<brace-enclosed initializer list>’ to
    ‘AK::Optional<regex::CharClassRangeElement>

Might be related to 66f15c2 somehow, going one past that commit makes
the build work again, however reverting the commit doesn't. Not sure
what's up with that.

Consider this patch a band-aid until we can find the reason and an
actual fix...

Compiler version:
gcc (GCC) 11.1.1 20210531 (Red Hat 11.1.1-3)
2021-06-06 09:26:07 +01:00
Linus Groh
a5903ac4b6 LibRegex: Hide stray dbgln() behind REGEX_DEBUG 2021-06-02 18:31:43 +01:00
Andreas Kling
12a42edd13 Everywhere: codepoint => code point 2021-06-01 10:01:11 +02:00
Linus Groh
dac0554fa0 LibRegex: Replace fprintf()/printf() with warnln()/outln()/dbgln() 2021-05-31 17:43:54 +01:00
Gunnar Beutner
6cf59b6ae9 Everywhere: Turn #if *_DEBUG into dbgln_if/if constexpr 2021-05-01 21:25:06 +02:00
Brian Gianforcaro
1682f0b760 Everything: Move to SPDX license identifiers in all files.
SPDX License Identifiers are a more compact / standardized
way of representing file license information.

See: https://spdx.dev/resources/use/#identifiers

This was done with the `ambr` search and replace tool.

 ambr --no-parent-ignore --key-from-file --rep-from-file key.txt rep.txt *
2021-04-22 11:22:27 +02:00
AnotherTest
5a14f7ea2f LibRegex: Generate a 'Compare' op for empty character classes
Otherwise it would match zero-length strings.
Fixes #6256.
2021-04-12 08:54:58 +02:00
AnotherTest
c128b3fd91 LibRegex: Remove 'ReadDigitFollowPolicy' as it's no longer needed
Thanks to @GMTA: 1b071455b1 (r49343474)
2021-04-10 12:10:45 +02:00
AnotherTest
1b071455b1 LibRegex: Treat brace quantifiers with invalid contents as literals
Fixes #6208.
2021-04-10 09:16:03 +02:00
AnotherTest
e9279d1790 LibRegex: Allow a '?' suffix for brace quantifiers
This fixes another compat point in #6042.
2021-04-10 09:16:03 +02:00
Jelle Raaijmakers
db321db5f4 LibRegex: Parse \0 as a zero-byte instead of 0x30 ("0")
This was causing some regexes to trip up. Fixes #6202.
2021-04-09 21:53:14 +02:00
AnotherTest
1bdc1cf77e LibRegex: Consider named capture groups as normal capture groups too 2021-04-05 09:02:06 +02:00
AnotherTest
be0182d049 LibRegex: Reset capture group indices when resetting parser state 2021-04-05 09:02:06 +02:00
AnotherTest
6bbb26fdaf LibRegex: Allow references to capture groups that aren't parsed yet
This only applies to the ECMA262 parser.
This behaviour is an ECMA262-specific quirk, such references always
generate zero-length matches (even on subsequent passes).
Also adds a test in LibJS's test suite.

Fixes #6039.
2021-04-01 21:55:47 +02:00
AnotherTest
e0ac85288e LibRegex: Allow missing high bound in {x,y} quantifiers
Fixes #5518.
2021-02-27 07:31:01 +01:00
AnotherTest
91bf3dc7fe LibRegex: Match the escaped part of escaped syntax characters
Previously, `\^` would've matched `\`, not `^`.
2021-02-27 07:31:01 +01:00
AnotherTest
f05e518cbc LibRegex: Implement section B.1.4. of the ECMA262 spec
This allows the parser to deal with crazy patterns like the one
in #5517.
2021-02-27 07:31:01 +01:00
Andreas Kling
5d180d1f99 Everywhere: Rename ASSERT => VERIFY
(...and ASSERT_NOT_REACHED => VERIFY_NOT_REACHED)

Since all of these checks are done in release builds as well,
let's rename them to VERIFY to prevent confusion, as everyone is
used to assertions being compiled out in release.

We can introduce a new ASSERT macro that is specifically for debug
checks, but I'm doing this wholesale conversion first since we've
accumulated thousands of these already, and it's not immediately
obvious which ones are suitable for ASSERT.
2021-02-23 20:56:54 +01:00
asynts
1a3a0836c0 Everywhere: Use CMake to generate AK/Debug.h.
This was done with the help of several scripts, I dump them here to
easily find them later:

    awk '/#ifdef/ { print "#cmakedefine01 "$2 }' AK/Debug.h.in

    for debug_macro in $(awk '/#ifdef/ { print $2 }' AK/Debug.h.in)
    do
        find . \( -name '*.cpp' -o -name '*.h' -o -name '*.in' \) -not -path './Toolchain/*' -not -path './Build/*' -exec sed -i -E 's/#ifdef '$debug_macro'/#if '$debug_macro'/' {} \;
    done

    # Remember to remove WRAPPER_GERNERATOR_DEBUG from the list.
    awk '/#cmake/ { print "set("$2" ON)" }' AK/Debug.h.in
2021-01-25 09:47:36 +01:00
asynts
01879d27c2 Everywhere: Replace a bundle of dbg with dbgln.
These changes are arbitrarily divided into multiple commits to make it
easier to find potentially introduced bugs with git bisect.
2021-01-16 11:54:35 +01:00
Andreas Kling
13d7c09125 Libraries: Move to Userland/Libraries/ 2021-01-12 12:17:46 +01:00