Commit Graph

253 Commits

Author SHA1 Message Date
Timothy Flynn
b7a95cba65 LibUnicode: Implement grammar validators for Unicode TR-35
ECMA-402 requires validating user input against the EBNF grammar for
Unicode locales described in TR-35: https://www.unicode.org/reports/tr35

This commit adds validators for that grammar, as well as other helper to
e.g. canonicalize a locale string.
2021-08-26 22:04:09 +01:00
Timothy Flynn
262e412634 AK: Implement method to convert a String/StringView to title case
This implementation preserves consecutive spaces in the orginal string.
2021-08-26 22:04:09 +01:00
Jean-Baptiste Boric
c97f7ea23b Tests: Test setjmp/sigsetjmp LibC functions
Since there are no real users of these functions in Serenity's
userland and this is my third attempt at this... This time, the great
LibTest test suite will make sure that I do it right!
2021-08-26 00:54:23 +02:00
Jan de Visser
85a84b0794 LibSQL: Introduce Serializer as a mediator between Heap and client code
Classes reading and writing to the data heap would communicate directly
with the Heap object, and transfer ByteBuffers back and forth with it.
This makes things like caching and locking hard. Therefore all data
persistence activity will be funneled through a Serializer object which
in turn submits it to the Heap.

Introducing this unfortunately resulted in a huge amount of churn, in
which a number of smaller refactorings got caught up as well.
2021-08-21 22:03:30 +02:00
Jan de Visser
d074a601df LibSQL+SQLServer: Bare bones INSERT and SELECT statements
This patch provides very basic, bare bones implementations of the
INSERT and SELECT statements. They are *very* limited:
- The only variant of the INSERT statement that currently works is
   SELECT INTO schema.table (column1, column2, ....) VALUES
      (value11, value21, ...), (value12, value22, ...), ...
   where the values are literals.
- The SELECT statement is even more limited, and is only provided to
  allow verification of the INSERT statement. The only form implemented
  is: SELECT * FROM schema.table

These statements required a bit of change in the Statement::execute
API. Originally execute only received a Database object as parameter.
This is not enough; we now pass an ExecutionContext object which
contains the Database, the current result set, and the last Tuple read
from the database. This object will undoubtedly evolve over time.

This API change dragged SQLServer::SQLStatement into the patch.

Another API addition is Expression::evaluate. This method is,
unsurprisingly, used to evaluate expressions, like the values in the
INSERT statement.

Finally, a new test file is added: TestSqlStatementExecution, which
tests the currently implemented statements. As the number and flavour of
implemented statements grows, this test file will probably have to be
restructured.
2021-08-21 22:03:30 +02:00
Jan de Visser
b74721e604 LibSQL: Redesign Value implementation and add new types
The implemtation of the Value class was based on lambda member variables
implementing type-dependent behaviour. This was done to ensure that
Values can be used as stack-only objects; the simplest alternative,
virtual methods, forces them onto the heap. The problem with the the
lambda approach is that it bloats the Values (which are supposed to be
lightweight objects) quite considerably, because every object contains
more than a dozen function pointers.

The solution to address both problems (we want Values to be able to live
on the stack and be as lightweight as possible) chosen here is to
encapsulate type-dependent behaviour and state in an implementation
class, and let the Value be an AK::Variant of those implementation
classes. All methods of Value are now basically straight delegates to
the implementation object using the Variant::visit method.

One issue complicating matters is the addition of two aggregate types,
Tuple and Array, which each contain a Vector of Values. At this point
Tuples and Arrays (and potential future aggregate types) can't contain
these aggregate types. This is limiting and needs to be addressed.

Another area that needs attention is the nomenclature of things; it's
a bit of a tangle of 'ValueBlahBlah' and 'ImplBlahBlah'. It makes sense
right now I think but admit we probably can do better.

Other things included here:
- Added the Boolean and Null types (and Tuple and Array, see above).
- to_string now always succeeds and returns a String instead of an
  Optional. This had some impact on other sources.
- Added a lot of tests.
- Started moving the serialization mechanism more towards where I want
  it to be, i.e. a 'DataSerializer' object which just takes
  serialization and deserialization requests and knows for example how
  to store long strings out-of-line.

One last remark: There is obviously a naming clash between the Tuple
class and the Tuple Value type. This is intentional; I plan to make the
Tuple class a subclass of Value (and hence Key and Row as well).
2021-08-21 22:03:30 +02:00
Jan de Visser
a5e28f2897 LibSQL: Make TupleDescriptor a shared pointer instead of a stack object
Tuple descriptors are basically the same for for example all rows in
a table. Makes sense to share them instead of copying them for every
single row.
2021-08-21 22:03:30 +02:00
Timothy Flynn
562d4e497b LibRegex: Treat pattern string characters as unsigned
For example, consider the following pattern:

    new RegExp('\ud834\udf06', 'u')

With this pattern, the regex parser should insert the UTF-8 encoded
bytes 0xf0, 0x9d, 0x8c, and 0x86. However, because these characters are
currently treated as normal char types, they have a negative value since
they are all > 0x7f. Then, due to sign extension, when these characters
are cast to u64, the sign bit is preserved. The result is that these
bytes are inserted as 0xfffffffffffffff0, 0xffffffffffffff9d, etc.

Fortunately, there are only a few places where we insert bytecode with
the raw characters. In these places, be sure to treat the bytes as u8
before they are cast to u64.
2021-08-20 19:16:33 +02:00
Andreas Kling
13f4890c38 LibCore: Make Core::File::open() return OSError in case of failure 2021-08-20 15:31:46 +02:00
Timothy Flynn
4f2cbe119b LibRegex: Allow Unicode escape sequences in capture group names
Unfortunately, this requires a slight divergence in the way the capture
group names are stored. Previously, the generated byte code would simply
store a view into the regex pattern string, so no string copying was
required.

Now, the escape sequences are decoded into a new string, and a vector
of all parsed capture group names are stored in a vector in the parser
result structure. The byte code then stores a view into the
corresponding string in that vector.
2021-08-19 23:49:25 +02:00
Timothy Flynn
fd8ccedf2b AK: Add GenericLexer API to consume an escaped Unicode code point
This parsing is already duplicated between LibJS and LibRegex, and will
shortly be needed in more places in those libraries. Move it to AK to
prevent further duplication.

This API will consume escaped Unicode code points of the form:
    \\u{code point}
    \\unnnn (where each n is a hexadecimal digit)
    \\unnnn\\unnnn (where the two escaped values are a surrogate pair)
2021-08-19 23:49:25 +02:00
Timothy Flynn
325eabc770 LibRegex: Ensure the GoBack operation decrements the code unit index
This was missed in commit 27d555bab0.
2021-08-18 09:47:09 +04:30
Timothy Flynn
a9716ad44e LibRegex: In non-Unicode mode, parse \u{4} as a repetition pattern 2021-08-18 09:47:09 +04:30
davidot
7613c22b06 LibJS: Add a mode to parse JS as a module
In a module strict mode should be enabled at the start of parsing and we
allow import and export statements.
2021-08-15 23:51:47 +01:00
Timothy Flynn
9509433e25 LibRegex: Implement and use a REPEAT operation for bytecode repetition
Currently, when we need to repeat an instruction N times, we simply add
that instruction N times in a for-loop. This doesn't scale well with
extremely large values of N, and ECMA-262 allows up to N = 2^53 - 1.

Instead, add a new REPEAT bytecode operation to defer this loop from the
parser to the runtime executor. This allows the parser to complete sans
any loops (for this instruction), and allows the executor to bail early
if the repeated bytecode fails.

Note: The templated ByteCode methods are to allow the Posix parsers to
continue using u32 because they are limited to N = 2^20.
2021-08-15 11:43:45 +01:00
Timothy Flynn
f1ce998d73 LibRegex+LibJS: Combine named and unnamed capture groups in MatchState
Combining these into one list helps reduce the size of MatchState, and
as a result, reduces the amount of memory consumed during execution of
very large regex matches.

Doing this also allows us to remove a few regex byte code instructions:
ClearNamedCaptureGroup, SaveLeftNamedCaptureGroup, and NamedReference.
Named groups now behave the same as unnamed groups for these operations.
Note that SaveRightNamedCaptureGroup still exists to cache the matched
group name.

This also removes the recursion level from the MatchState, as it can
exist as a local variable in Matcher::execute instead.
2021-08-15 11:43:45 +01:00
Timothy Flynn
1a173be29d LibRegex: Disallow unescaped quantifiers in Unicode mode 2021-08-15 11:43:45 +01:00
Timothy Flynn
c3e1f1f687 LibRegex: Use correct source characters for Unicode identity escapes 2021-08-15 11:43:45 +01:00
Timothy Flynn
6a485f612f LibRegex: Implement legacy octal escape parsing closer to the spec
The grammar for the ECMA-262 CharacterEscape is:

  CharacterEscape[U, N] ::
    ControlEscape
    c ControlLetter
    0 [lookahead ∉ DecimalDigit]
    HexEscapeSequence
    RegExpUnicodeEscapeSequence[?U]
    [~U]LegacyOctalEscapeSequence
    IdentityEscape[?U, ?N]

It's important to parse the standalone "\0 [lookahead ∉ DecimalDigit]"
before parsing LegacyOctalEscapeSequence. Otherwise, all standalone "\0"
patterns are parsed as octal, which are disallowed in Unicode mode.

Further, LegacyOctalEscapeSequence should also be parsed while parsing
character classes.
2021-08-15 11:43:45 +01:00
Timothy Flynn
83ca8c7e38 LibRegex: Convert LibRegex tests to use StringView in place of C-strings
A subsequent commit will add tests that require a string containing only
"\0". As a C-string, this will be interpreted as the null terminator. To
make the diff for that commit easier to grok, this commit converts all
tests to use StringView without any other functional changes.
2021-08-15 11:43:45 +01:00
Timothy Flynn
0c8f2f5aca LibRegex: Ensure escaped hexadecimals are exactly 2 digits in length 2021-08-15 11:43:45 +01:00
Timothy Flynn
2e4b6fd1ac LibRegex: Ensure escaped code points are exactly 4 digits in length 2021-08-15 11:43:45 +01:00
Timothy Flynn
e887314472 LibRegex: Fix ECMA-262 parsing of invalid identity escapes
* Only alphabetic (A-Z, a-z) characters may be escaped with \c. The loop
  currently parsing \c includes code points between the upper/lower case
  groups.
* In Unicode mode, all invalid identity escapes should cause a parser
  error, even in browser-extended mode.
* Avoid an infinite loop when parsing the pattern "\c" on its own.
2021-08-15 11:43:45 +01:00
Brian Gianforcaro
a2a5cb0f24 AK: Add Time::is_negative() to detect negative time values 2021-08-15 12:20:38 +02:00
Daniel Bertalan
0a36cea9dc Tests: Re-enable UserspaceEmulator tests on the Clang build
Now that problems that made UE crash have been fixed, this test should
now pass.
2021-08-14 18:42:14 +02:00
Itamar
e57fdb63f8 Tests: Add regression tests for the LibCpp preprocessor
Similarly to the LibCpp parser regression tests, these tests run the
preprocessor on the .cpp test files under
Userland/LibCpp/Tests/preprocessor, and compare the output with existing
.txt ground truth files.
2021-08-14 12:40:55 +02:00
Timothy Flynn
df14d11a11 LibRegex: Disallow invalid interval qualifiers in Unicode mode
Fixes all remaining 'built-ins/RegExp/property-escapes' test262 tests.
2021-08-11 13:11:01 +02:00
Timothy Flynn
1e91334008 LibUnicode: Handle edge-case script extensions, Common and Inherited
These script extensions have some peculiar behavior in the Unicode spec.
The UCD ScriptExtension file does not contain these scripts. Rather, it
is implied the code points which have these scripts as an extension are
the code points that both:

  1. Have Common or Inherited as their primary script value
  2. Do not have any other script value in their script extension lists

Because these are not explictly listed in the UCD, we must manually form
these script extensions.
2021-08-11 13:11:01 +02:00
Timothy Flynn
47bb350ebd LibUnicode: Generate separate tables for scripts and script extensions
Notice that unlike the note in populate_general_category_unions(),
script extension do indeed have code point ranges which overlap. Thus,
this commit adds code to handle that, and hooks it into the GC unions.
2021-08-11 13:11:01 +02:00
Timothy Flynn
5ac23d244d LibUnicode: Generate separate tables for Unicode properties
Similar to General Categories, this generates separate tables for the
Property list.
2021-08-11 13:11:01 +02:00
Timothy Flynn
b06c104076 LibUnicode: Include Unassigned code points in the Other General Category
Now that the generator parses unassigned General Category properties, it
can include Unassigned (Cn) in the Other (C) category.
2021-08-11 13:11:01 +02:00
Timothy Flynn
7dce2bfe23 LibUnicode: Generate separate tables for General Category properties
Previously, each code point's General Category was part of the generated
UnicodeData structure. This ultimately presented two problems, one
functional and one performance related:

  * Some General Categories are applied to unassigned code points, for
    example the Unassigned (Cn) category. Unassigned code points are
    strictly excluded from UnicodeData.txt, so by relying on that file,
    the generator is unable to handle these categories.

  * Lookups for General Categories are slower when searching through the
    large UnicodeData hash map. Even though lookups are O(1), the hash
    function turned out to be slower than binary searching through a
    category-specific table.

So, now a table is generated for each General Category. When querying a
code point for a category, a binary search is done on each code point
range in that category's table to check if code point has that category.

Further, General Categories are now parsed from the UCD file
DerivedGeneralCategory.txt. This file is a normal "prop list" file and
contains the categories for unassigned code points.
2021-08-11 13:11:01 +02:00
Mandar Kulkarni
aaf232f903 Tests: Add test for String::bijective_base_from() 2021-08-09 14:14:07 +04:30
Daniel Bertalan
146dcf4856 Tests: Disable UserspaceEmulator tests for Clang builds
There seems to be more incorrect assumptions about Clang-built
executables' memory layout than expected. These make the CI fail even
though the system is functional in all other aspects. While this is
being fixed, let's just disable tests for UserspaceEmulator.
2021-08-08 10:55:36 +02:00
Daniel Bertalan
7396e4aedc LibDebug: Store 64-bit numbers in AttributeValue
This helps us avoid weird truncation issues and fixes a bug on Clang
builds where truncation while reading caused the DIE offsets following
large LEB128 numbers to be incorrect. This removes the need for the
separate `LongUnsignedNumber` type.
2021-08-08 10:55:36 +02:00
Daniel Bertalan
5f2f460cc8 Tests: Add Clang pragma for turning off optimizations
Clang does not accept `GCC optimize("O0")`, so it fails to build the
system with it.
2021-08-08 10:55:36 +02:00
Itamar
4673a517f6 LibCpp: Do lexing in the Preprocessor
We now call Preprocessor::process_and_lex() and pass the result to the
parser.

Doing the lexing in the preprocessor will allow us to maintain the
original position information of tokens after substituting definitions.
2021-08-07 21:24:11 +02:00
Lenny Maiorani
8e949c5c91 Tests: Remove unused variables for clang build
Problem:
- Clang will not build `Tests/LibTLS` due to unused variables.

Solution:
- Remove the unused variables.
2021-08-06 23:55:27 +02:00
TheFightingCatfish
4e8e1b7b3a AK: Improve the parsing of data urls
Improve the parsing of data urls in URLParser to bring it more up-to-
spec. At the moment, we cannot parse the components of the MIME type
since it is represented as a string, but the spec requires it to be
parsed as a "MIME type record".
2021-08-06 10:45:17 +02:00
Timothy Flynn
484ccfadc3 LibRegex: Support property escapes of Unicode script extensions 2021-08-04 13:50:32 +01:00
Timothy Flynn
06088df729 LibRegex: Support property escapes of the Unicode script property
Note that unlike binary properties and general categories, scripts must
be specified in the non-binary (Script=Value) form.
2021-08-04 13:50:32 +01:00
Brian Gianforcaro
4df1657898 Tests: Add coverage for sys$alarm() success case 2021-08-03 18:44:01 +02:00
Brian Gianforcaro
ea401fb3c3 Tests: Add coverage for sys$alarm() canceling a stale timer
This is a regression test to validate the functionality that was
reported broken in #9071, where the kernel would spin attempting
to cancel a stale timer.
2021-08-03 18:44:01 +02:00
Timothy Flynn
dc9f516339 LibRegex: Generate negated property escapes as a single instruction
These were previously generated as two instructions, Compare [Inverse]
and Compare [Property].
2021-08-02 21:02:09 +04:30
Timothy Flynn
4de4312827 LibRegex: Support property escapes of the form \p{Type=Value}
Before now, only binary properties could be parsed. Non-binary props are
of the form "Type=Value", where "Type" may be General_Category, Script,
or Script_Extension (or their aliases). Of these, LibUnicode currently
supports General_Category, so LibRegex can parse only that type.
2021-08-02 21:02:09 +04:30
Timothy Flynn
1e10d6d7ce LibRegex: Support property escapes of Unicode General Categories
This changes LibRegex to parse the property escape as a Variant of
Unicode Property & General Category values. A byte code instruction is
added to perform matching based on General Category values.
2021-08-02 21:02:09 +04:30
Ali Mohammad Pur
85d87cbcc8 LibRegex: Add some tests for Fork{Stay,Jump} performance
Without the previous fixes, these will blow up the stack.
2021-08-02 17:22:50 +04:30
Brian Gianforcaro
d1644c26d6 Tests: Remove unused header includes 2021-08-01 08:10:16 +02:00
Brian Gianforcaro
c54ae3afd6 Tests: Fix AK/TestJSON.cpp by not relying on disk resources
The following commit broke Tests/AK/TestJSON.cpp as it removed the
file that the test loaded from disk to validate JSON parsing.

    commit ad141a2286
    Author: Andreas Kling <kling@serenityos.org>
    Date:   Sat Jul 31 15:26:14 2021 +0200

        Base: Remove "test.frm" from HackStudio test project

Instead of restoring the file, lets just embed a bit of JSON in the
test case to avoid using external resources, as they obviously are
surprising and make the test less portable across environments.
2021-07-31 23:56:40 +02:00
Timothy Flynn
d485cf29d7 LibRegex+LibUnicode: Begin implementing Unicode property escapes
This supports some binary property matching. It does not support any
properties not yet parsed by LibUnicode, nor does it support value
matching (such as Script_Extensions=Latin).
2021-07-30 21:26:31 +01:00