ladybird

mirror of https://github.com/LadybirdBrowser/ladybird.git synced 2024-07-14 08:40:35 +03:00

Author	SHA1	Message	Date
Timothy Flynn	672a555f98	LibCore+LibJS+LibUnicode: Port retrieving time zone offsets to ICU The changes to tests are due to LibTimeZone incorrectly interpreting time stamps in the TZDB. The TZDB will list zone transitions in either UTC or the zone's local time (which is then subject to DST offsets). LibTimeZone did not handle the latter at all. For example: The following rule is in effect until November 18, 6PM UTC. America/Chicago -5:50:36 - LMT 1883 Nov 18 18:00u The following rule is in effect until March 1, 2AM in Chicago time. But at that time, a DST transition occurs, so the local time is actually 3AM. America/Chicago -6:00 Chicago C%sT 1936 Mar 1 2:00	2024-06-26 10:14:02 +02:00
Timothy Flynn	1b2d47e6bb	LibJS+LibUnicode: Port retrieving available regional time zones to ICU	2024-06-26 10:14:02 +02:00
Timothy Flynn	4fc0fba646	LibCore+LibJS+LibUnicode: Port retrieving available time zones to ICU This required updating some LibJS spec steps to their latest versions, as the data expected by the old steps does not quite match the APIs that are available with the ICU. The new spec steps are much more aligned.	2024-06-26 10:14:02 +02:00
Timothy Flynn	d3e809bcd4	LibJS+LibUnicode: Port retrieving the system time zone to ICU	2024-06-26 10:14:02 +02:00
Timothy Flynn	ebdb92eef6	LibUnicode+Everywhere: Merge LibLocale back into LibUnicode LibLocale was split off from LibUnicode a couple years ago to reduce the number of applications on SerenityOS that depend on CLDR data. Now that we use ICU, both LibUnicode and LibLocale are actually linking in this data. And since vcpkg gives us static libraries, both libraries are over 30MB in size. This patch reverts the separation and merges LibLocale into LibUnicode again. We now have just one library that includes the ICU data. Further, this will let LibUnicode share the locale cache that previously would only exist in LibLocale.	2024-06-23 19:52:45 +02:00
Timothy Flynn	aa3a30870b	LibUnicode: Replace code point bidirectional classes with ICU	2024-06-22 14:56:39 +02:00
Timothy Flynn	ab56b8c8dc	LibUnicode: Remove the locale-unaware text segmentation implementation	2024-06-20 13:46:54 +02:00
Timothy Flynn	5cf818e305	LibUnicode: Replace case transformations and comparison with ICUs There are a couple of differences here due to using ICU: 1. Titlecasing behaves slightly differently. We previously transformed "123dollars" to "123Dollars", as we would use word segmentation to split a string into words, then transform the first cased character to titlecase. ICU doesn't go quite that far, and leaves the string as "123dollars". While this is a behavior change, the only user of this API is the `text-transform: capitalize;` CSS rule, and we now match the behavior of other browsers. 2. There isn't an API to compare strings with case insensitivity without allocating case-folded strings for both the left- and right-hand-side strings. Our implementation was previously allocation-free; however, in a benchmark, ICU is still ~1.4x faster.	2024-06-20 10:59:55 +02:00
Timothy Flynn	8d7216f4e0	LibUnicode: Replace IDNA ASCII conversion with ICU	2024-06-18 21:07:56 +02:00
Timothy Flynn	1feef17bf7	LibUnicode: Remove completely unused code point name & block name data These were used for e.g. the Character Map on Serenity, but are not used at all for Ladybird.	2024-06-18 21:07:56 +02:00
Simon Wanner	5bcb019106	LibUnicode: Add IDNA::to_ascii This implements the ToASCII operation of Unicode Technical Standard 46	2023-12-10 08:04:58 -05:00
Simon Wanner	cfd0a60863	LibUnicode: Add Punycode::encode	2023-12-10 08:04:58 -05:00
Simon Wanner	299d35aadc	LibUnicode: Add Punycode::decode	2023-12-10 08:04:58 -05:00
Shannon Booth	d777b279e3	LibUnicode+Tests: Remove now unused `to_unicode_*_full` methods Relocating all of the tests for these in LibUnicode over to the AK String testsuite.	2023-11-28 17:15:27 -05:00
Timothy Flynn	139c575cc9	LibUnicode: Update to Unicode version 15.1.0 https://unicode.org/versions/Unicode15.1.0/ This update includes a new set of code point properties, Indic Conjunct Break. These may have the values Consonant, Linker, or Extend. These are used in text segmentation to prevent breaking on some extended grapheme cluster sequences.	2023-09-15 18:30:26 +02:00
Timothy Flynn	02a8683266	LibUnicode+LibJS: Stop propagating small OOM errors from normalization This API only perform small allocations, and is only used by LibJS.	2023-09-09 13:03:25 -04:00
Sam Atkins	0d021a63c7	LibUnicode: Generate data for bidirectional character types This will let us examine code points to determine the rtl/ltr direction of a piece of text.	2023-08-20 16:21:35 -04:00
Timothy Flynn	456211932f	LibUnicode: Perform code point case conversion lookups in constant time Similar to commit `0652cc4`, we now generate 2-stage lookup tables for case conversion information. Only about 1500 code points are actually cased. This means that case information is rather highly compressible, as the blocks we break the code points into will generally all have no casing information at all. In total, this change: * Does not change the size of libunicode.so (which is nice because, generally, the 2-stage lookup tables are expected to trade a bit of size for performance). * Reduces the runtime of the new benchmark test case added here from 1.383s to 1.127s (about an 18.5% improvement).	2023-07-28 05:28:50 +02:00
Timothy Flynn	0652cc48c0	LibUnicode: Perform code point property lookups in constant time We currently produce a single table for all categories of code point properties (GeneralCategory, Script, etc.). Each row contains a field indicating the range of code points to which that property applies. At runtime, we then do a binary search through that table to decide if a code point has a property. This changes our approach to generate a 2-stage lookup table for each of those categories. There is an in-depth explanation of these tables above the new `create_code_point_tables` method. The end effect is that code point property lookup is reduced from a binary search to constant-time array lookups. In total, this change: * Increases the size of libunicode.so from 2.7 MB to 2.9 MB. * Reduces the runtime of the new benchmark test case added here from 3.576s to 1.020s (a 3.5x speedup). * In a profile of resizing a TextEditor window with a 3MB file open, the runtime of checking if a code point has a word break property reduces from ~81% to ~56%.	2023-07-26 08:36:20 +02:00
Timothy Flynn	c950f88611	LibUnicode: Stop generating Block property data We started generating this data in commit `0505e03`, but it was unused. It's still not used, so let's remove it, rather than bloating the size of libunicode.so with unused data. If we need it in the future, it's trivial to add back. Note we have always used the block name data from that commit, and that is still present here.	2023-07-26 08:36:20 +02:00
Timothy Flynn	f8a0365002	LibUnicode: Detect ZWJ sequences when filtering by emoji presentation This was preventing some unqualified emoji sequences from rendering properly, such as the custom SerenityOS flag. We rendered the flag correctly when given the fully qualified sequence: U+1F3F3 U+FEOF U+200D U+1F41E But were not detecting the unqualified sequence as an emoji when also filtering for emoji-presentation sequences: U+1F3F3 U+200D U+1F41E	2023-03-05 20:21:57 +01:00
Timothy Flynn	73239fdd82	LibUnicode: Add a unit test for Unicode grapheme and word segmentation These include tests for previously broken boundary conditions.	2023-02-25 22:23:39 +01:00
Timothy Flynn	1484d3d9f5	LibUnicode: Add a method to check if a code point could start an emoji	2023-02-24 19:48:47 +01:00
Timothy Flynn	5cbf054651	LibUnicode: Fix typos causing text segmentation on mid-word punctuation For example the words "can't" and "32.3" should not have boundaries detected on the "'" and "." code points, respectively. The String test cases fixed here are because "b'ar" is now considered one word.	2023-02-15 12:36:47 +01:00
Timothy Flynn	8f2589b3b0	LibUnicode: Parse and generate case folding code point data Case folding rules have a similar mapping style as special casing rules, where one code point may map to zero or more case folding rules. These will be used for case-insensitive string comparisons. To see how case folding can differ from other casing rules, consider "ß" (U+00DF): >>> "ß".lower() 'ß' >>> "ß".upper() 'SS' >>> "ß".title() 'Ss' >>> "ß".casefold() 'ss'	2023-01-18 14:43:40 +00:00
Timothy Flynn	bc51017a03	LibUnicode: Support full case folding for titlecasing a string Unicode declares that to titlecase a string, the first cased code point after each word boundary should be transformed to its titlecase mapping. All other codepoints are transformed to their lowercase mapping.	2023-01-16 18:33:44 -05:00
Timothy Flynn	b562348d31	LibUnicode: Generate simple case folding mappings for titlecase Note we already generate the special case foldings for titlecase.	2023-01-16 18:33:44 -05:00
Timothy Flynn	3d22efccca	LibUnicode+LibJS: Propagate OOM from Unicode normalization	2023-01-09 22:48:15 +00:00
Timothy Flynn	1ff29afc45	LibUnicode+LibJS+LibWeb: Propagate OOM from Unicode case transformations	2023-01-09 22:48:15 +00:00
Timothy Flynn	f38c68177b	LibUnicode: Update code point ideographic replacements for Unicode 15	2022-10-07 18:17:40 +01:00
matcool	104b51b912	LibUnicode: Fix Hangul syllable composition for specific cases This fixes `combine_hangul_code_points` which would try to combine a LVT syllable with a trailing consonant, resulting in a wrong character. Also added a test for this specific case.	2022-10-07 07:53:27 -04:00
matcool	c8d7b0a33a	Tests: Add tests for LibUnicode's normalize	2022-10-06 08:24:39 -04:00
Timothy Flynn	9e860d973e	LibLocale: Move locale source files to the LibLocale library Everything is now setup to create the LibLocale library and link it where needed.	2022-09-05 14:37:16 -04:00
Timothy Flynn	b2d2bb43ce	LibLocale: Move locale test files to the LibLocale folder	2022-09-05 14:37:16 -04:00
Timothy Flynn	43a3471298	LibLocale: Move locale source files to the LibLocale folder These are still included in LibUnicode, but this updates their location and the include paths of other files which include them.	2022-09-05 14:37:16 -04:00
Timothy Flynn	ff48220dca	Userland: Move files destined for LibLocale to the Locale namespace	2022-09-05 14:37:16 -04:00
Timothy Flynn	fc8bf7ac3e	LibUnicode+Userland: Migrate generated CLDR data to LibLocaleData Currently, LibUnicodeData contains the generated UCD and CLDR data. Move the UCD data to the main LibUnicode library, and rename LibUnicodeData to LibLocaleData. This is another prepatory change to migrate to LibLocale.	2022-09-05 14:37:16 -04:00
sin-ack	3f3f45580a	Everywhere: Add sv suffix to strings relying on StringView(char const) Each of these strings would previously rely on StringView's char const constructor overload, which would call __builtin_strlen on the string. Since we now have operator ""sv, we can replace these with much simpler versions. This opens the door to being able to remove StringView(char const*). No functional changes.	2022-07-12 23:11:35 +02:00
thankyouverycool	5658524aa3	Tests: Add Unicode tests for CharacterType block properties	2022-02-15 10:13:19 -05:00
Timothy Flynn	6efbafa6e0	Everywhere: Update copyrights with my new serenityos.org e-mail :^)	2022-01-31 18:23:22 +00:00
Timothy Flynn	4400150cd2	LibJS+LibUnicode: Return the appropriate time zone name depending on DST	2022-01-19 21:20:41 +00:00
Timothy Flynn	bdf02c21e1	LibUnicode: Swap the preferred order of standard time zone display names Our generator is currently preferring the DST variant of the time zone display names over the non-DST variant. LibTimeZone currently does not have DST support, and operates in a mode that basically assumes DST does not exist. Swap the display names for now just to be consistent until we have DST support. Note we will need to generate both of these variants and select the appropriate one at runtime once we have DST support.	2022-01-12 15:43:12 +01:00
Timothy Flynn	e2dfbe8f67	LibUnicode: Parse and generate long and short generic time zone names This implements the CalendarPatternStyle::{Long,Short}Generic styles of time zone name formatting.	2022-01-11 23:56:35 +01:00
Timothy Flynn	d50f5e14f8	LibUnicode: Fall back to GMT offset when a time zone name is unavailable The following table in TR-35 includes a web of fall back rules when the requested time zone style is unavailable: https://unicode.org/reports/tr35/tr35-dates.html#dfst-zone Conveniently, the subset of styles supported by ECMA-402 (and therefore LibUnicode) all either fall back to GMT offset or to a style that is unsupported but itself falls back to GMT offset.	2022-01-11 23:56:35 +01:00
Timothy Flynn	8d35563f28	LibUnicode: Implement TR-35's localized GMT offset formatting This adds an API to use LibTimeZone to convert a time zone such as "America/New_York" to a GMT offset string like "GMT-5" (short form) or "GMT-05:00" (long form).	2022-01-11 23:56:35 +01:00
Timothy Flynn	6d7d9dd324	LibUnicode: Do not assume time zones & meta zones have a 1-to-1 mapping The generator parses metaZones.json to form a mapping of meta zones to time zones (AKA "golden zone" in TR-35). This parser errantly assumed this was a 1-to-1 mapping.	2022-01-06 22:28:01 +01:00
Timothy Flynn	ffb3ba3079	Tests: Link some tests directly against LibUnicodeData These were missed in `565a880ce5`. This wasn't an issue because these tests don't pledge/unveil anything, so they could happily dlopen() the library at runtime. But this is now needed in order to migrate LibUnicode towards weak symbols instead.	2022-01-04 22:49:43 +00:00
Timothy Flynn	7e6ad172a4	LibUnicode: Support code point names that apply to ranges of code points For example, consider the following adjacent entries in UnicodeData.txt: 3400;<CJK Ideograph Extension A, First>;Lo;0;L;;;;;N;;;;; 4DBF;<CJK Ideograph Extension A, Last>;Lo;0;L;;;;;N;;;;; Our current implementation would assign the display name "CJK Ideograph Extension A" to code points U+3400 & U+4DBF, but not to the code points in between. Not only should those code points be assigned a name, but the Unicode spec also has formatting rules on what the names should be (the names for these ranged code points are not as they appear in UnicodeData.txt). The spec also defines names for code point ranges that actually are listed individually in UnicodeData.txt. For example: 2F800;CJK COMPATIBILITY IDEOGRAPH-2F800;Lo;0;L;4E3D;;;;N;;;;; 2F801;CJK COMPATIBILITY IDEOGRAPH-2F801;Lo;0;L;4E38;;;;N;;;;; 2F802;CJK COMPATIBILITY IDEOGRAPH-2F802;Lo;0;L;4E41;;;;N;;;;; Code points are only coalesced into a range if all fields after the name are equivalent. Our parser will insert the range and its name formatting pattern when it comes across the first code point in that range, then ignore other code points in that range. This reduces the number of names we generated by nearly 2,000.	2021-11-30 11:24:02 +01:00
Timothy Flynn	93ee922027	LibUnicode: Support locales-without-script aliases for ECMA-402 As noted by ECMA-402, if a supported locale contains all of a language, script, and region subtag, then the implementation must also support the locale without the script subtag. The most complicated example of this is the zh-TW locale. The list of locales in the CLDR database does not include zh-TW or its maximized zh-Hant-TW variant. Instead, it inlcudes the zh-Hant locale. However, zh-Hant-TW is listed in the default-content locale list in the cldr-core package. This defines an alias from zh-Hant-TW to zh-Hant. We must then also support the zh-Hant-TW alias without the script subtag: zh-TW. This transitively maps zh-TW to zh-Hant, which is a case quite heavily tested by test262.	2021-11-19 11:45:35 +01:00
Timothy Flynn	357c97dfa8	LibUnicode: Parse the CLDR's defaultContent.json locale list This file contains the list of locales which default to their parent locale's values. In the core CLDR dataset, these locales have their own files, but they are empty (except for identity data). For example: https://github.com/unicode-org/cldr/blob/main/common/main/en_US.xml In the JSON export, these files are excluded, so we currently are not recognizing these locales just by iterating the locale files. This is a prerequisite for upgrading to CLDR version 40. One of these default-content locales is the popular "en-US" locale, which defaults to "en" values. We were previously inferring the existence of this locale from the "en-US-POSIX" locale (many implementations, including ours, strip variants such as POSIX). However, v40 removes the "en-US-POSIX" locale entirely, meaning that without this change, we wouldn't know that "en-US" exists (we would default to "en"). For more detail on this and other v40 changes, see: https://cldr.unicode.org/index/downloads/cldr-40#h.nssoo2lq3cba	2021-11-09 20:44:52 +01:00

1 2

84 Commits