Data analysts should be able to Text.match, Text.match_all, Text.is_match to find or check matches (#3841)

Implements https://www.pivotaltracker.com/story/show/181266092

# Important Notes
Also renaming `Text.location_of` and `Text.location_of_all` to `Text.locate` and `Text.locate_all`.
This commit is contained in:
Radosław Waśko 2022-11-18 23:17:42 +01:00 committed by GitHub
parent 8f3bfe8ce2
commit 5b6fd74929
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
7 changed files with 283 additions and 337 deletions

View File

@ -241,6 +241,7 @@
create derived Columns.][3782]
- [Added support for milli and micro seconds, new short form for rename_columns
and fixed issue with compare_to versus Nothing][3874]
- [Aligned `Text.match`/`Text.locate` API][3841]
[debug-shortcuts]:
https://github.com/enso-org/enso/blob/develop/app/gui/docs/product/shortcuts.md#debug
@ -383,6 +384,7 @@
[3782]: https://github.com/enso-org/enso/pull/3782
[3863]: https://github.com/enso-org/enso/pull/3863
[3874]: https://github.com/enso-org/enso/pull/3874
[3841]: https://github.com/enso-org/enso/pull/3841
#### Enso Compiler

View File

@ -126,119 +126,64 @@ Text.characters self =
self.each bldr.append
bldr.to_vector
## ALIAS Match Text
## ALIAS find
Matches the text in `self` against the provided regex `pattern`, returning
the match(es) if present or `Nothing` if there are no matches.
Matches the text in `self` against the provided `term`, returning the first
or last match if present or `Nothing` if there are no matches.
Arguments:
- pattern: The pattern to match `self` against. We recommend using _raw text_
- term: The pattern to match `self` against. We recommend using _raw text_
to write your patterns.
- mode: This argument specifies how many matches the engine will try and
find. When mode is set to either `Regex_Mode.First` or `Regex_Mode.Full`,
this method will return either a single `Match` or `Nothing`. If set to an
`Integer` or `Regex_Mode.All`, this method will return either
a `Vector Match` or `Nothing`.
- match_ascii: Enables or disables pure-ASCII matching for the regex. If you
know your data only contains ASCII then you can enable this for a
performance boost on some regex engines.
- case_insensitive: Enables or disables case-insensitive matching. Case
insensitive matching behaves as if it normalises the case of all input
text before matching on it.
- dot_matches_newline: Enables or disables the dot matches newline option.
This specifies that the `.` special character should match everything
_including_ newline characters. Without this flag, it will match all
characters _except_ newlines.
- multiline: Enables or disables the multiline option. Multiline specifies
that the `^` and `$` pattern characters match the start and end of lines,
as well as the start and end of the input respectively.
- comments: Enables or disables the comments mode for the regular expression.
In comments mode, the following changes apply:
- Whitespace within the pattern is ignored, except when within a
character class or when preceded by an unescaped backslash, or within
grouping constructs (e.g. `(?...)`).
- When a line contains a `#`, that is not in a character class and is not
preceded by an unescaped backslash, all characters from the leftmost
such `#` to the end of the line are ignored. That is to say, they act
as _comments_ in the regex.
- extra_opts: Specifies additional options in a vector. This allows options
to be supplied and computed without having to break them out into arguments
to the function. Where these overlap with one of the flags (`match_ascii`,
`case_insensitive`, `dot_matches_newline`, `multiline` and `verbose`), the
flags take precedence.
! Boolean Flags and Extra Options
This function contains a number of arguments that are boolean flags that
enable or disable common options for the regex. At the same time, it also
provides the ability to specify options in the `extra_opts` argument.
Where one of the flags is _set_ (has the value `True` or `False`), the
value of the flag takes precedence over the value in `extra_opts` when
merging the options to the engine. The flags are _unset_ (have value
`Nothing`) by default.
- mode: This argument specifies whether the first or last match should be
returned.
- matcher: If a `Text_Matcher`, the text is compared using case-sensitivity
rules specified in the matcher. If a `Regex_Matcher`, the term is used as a
regular expression and matched using the associated options.
> Example
Find matches for a basic email regex in some text. NOTE: This regex is
_not_ compliant with RFC 5322.
Find the first substring matching the regex.
example_match =
regex = ".+@.+"
"contact@enso.org".match regex
Text.match : Text | Engine.Pattern -> (Regex_Mode | Matching_Mode) -> Boolean | Nothing -> Boolean | Nothing -> Boolean | Nothing -> Boolean | Nothing -> Boolean | Nothing -> Vector Option.Option -> Match | Vector Match | Nothing ! Regex.Compile_Error
Text.match self pattern mode=Regex_Mode.All match_ascii=Nothing case_insensitive=Nothing dot_matches_newline=Nothing multiline=Nothing comments=Nothing extra_opts=[] =
compiled_pattern = Regex.compile pattern match_ascii=match_ascii case_insensitive=case_insensitive dot_matches_newline=dot_matches_newline multiline=multiline comments=comments extra_opts=extra_opts
compiled_pattern.match self mode
regex = "a[ab]c"
"aabbbbccccaabcaaaa".match regex == "abc"
Text.match : Text -> (Matching_Mode.First | Matching_Mode.Last) -> Matcher -> Text | Nothing
Text.match self term mode=Matching_Mode.First matcher=Regex_Matcher.Regex_Matcher_Data =
case self.locate term mode matcher of
Nothing -> Nothing
span -> span.text
## ALIAS find_all
Matches all occurrences text in `self` against the provided `term`, returning
a vector of matches.
Arguments:
- term: The pattern to match `self` against. We recommend using _raw text_
to write your patterns.
- matcher: If a `Text_Matcher`, the text is compared using case-sensitivity
rules specified in the matcher. If a `Regex_Matcher`, the term is used as a
regular expression and matched using the associated options.
> Example
Find all substrings matching the regex.
example_match =
regex = "a[ab]c"
"aabcbbccaacaa".match regex == ["abc", "aac"]
Text.match_all : Text -> (Text_Matcher | Regex_Matcher) -> Vector Text
Text.match_all self term=".*" matcher=Regex_Matcher.Regex_Matcher_Data =
self.locate_all term matcher . map .text
## ALIAS Check Matches
Matches the text in `self` against the provided regex `pattern`, returning
`True` if the text matches at least once, and `False` otherwise.
Checks if the whole text in `self` matches a provided `pattern`.
Arguments:
- pattern: The pattern to match `self` against. We recommend using _raw text_
to write your patterns.
- mode: This argument specifies how many matches the engine will try and
find. When mode is set to either `Regex_Mode.First` or `Regex_Mode.Full`,
this method will return either a single `Match` or `Nothing`. If set to an
`Integer` or `Regex_Mode.All`, this method will return either
a `Vector Match` or `Nothing`.
- match_ascii: Enables or disables pure-ASCII matching for the regex. If you
know your data only contains ASCII then you can enable this for a
performance boost on some regex engines.
- case_insensitive: Enables or disables case-insensitive matching. Case
insensitive matching behaves as if it normalises the case of all input
text before matching on it.
- dot_matches_newline: Enables or disables the dot matches newline option.
This specifies that the `.` special character should match everything
_including_ newline characters. Without this flag, it will match all
characters _except_ newlines.
- multiline: Enables or disables the multiline option. Multiline specifies
that the `^` and `$` pattern characters match the start and end of lines,
as well as the start and end of the input respectively.
- comments: Enables or disables the comments mode for the regular expression.
In comments mode, the following changes apply:
- Whitespace within the pattern is ignored, except when within a
character class or when preceeded by an unescaped backslash, or within
grouping constructs (e.g. `(?...)`).
- When a line contains a `#`, that is not in a character class and is not
preceeded by an unescaped backslash, all characters from the leftmost
such `#` to the end of the line are ignored. That is to say, they act
as _comments_ in the regex.
- extra_opts: Specifies additional options in a vector. This allows options
to be supplied and computed without having to break them out into arguments
to the function. Where these overlap with one of the flags (`match_ascii`,
`case_insensitive`, `dot_matches_newline`, `multiline` and `verbose`), the
flags take precedence.
! Boolean Flags and Extra Options
This function contains a number of arguments that are boolean flags that
enable or disable common options for the regex. At the same time, it also
provides the ability to specify options in the `extra_opts` argument.
Where one of the flags is _set_ (has the value `True` or `False`), the
value of the flag takes precedence over the value in `extra_opts` when
merging the options to the engine. The flags are _unset_ (have value
`Nothing`) by default.
- matcher: If a `Text_Matcher`, the text is compared using case-sensitivity
rules specified in the matcher. If a `Regex_Matcher`, the term is used as a
regular expression and matched using the associated options.
> Example
Checks if some text matches a basic email regex. NOTE: This regex is _not_
@ -246,74 +191,15 @@ Text.match self pattern mode=Regex_Mode.All match_ascii=Nothing case_insensitive
example_match =
regex = ".+@.+"
"contact@enso.org".matches regex
Text.matches : Text | Engine.Pattern -> Boolean | Nothing -> Boolean | Nothing -> Boolean | Nothing -> Boolean | Nothing -> Boolean | Nothing -> Vector Option.Option -> Boolean ! Regex.Compile_Error
Text.matches self pattern match_ascii=Nothing case_insensitive=Nothing dot_matches_newline=Nothing multiline=Nothing comments=Nothing extra_opts=[] =
compiled_pattern = Regex.compile pattern match_ascii=match_ascii case_insensitive=case_insensitive dot_matches_newline=dot_matches_newline multiline=multiline comments=comments extra_opts=extra_opts
"contact@enso.org".is_match regex
Text.is_match : Text -> Matcher -> Boolean ! Regex.Compile_Error
Text.is_match self pattern=".*" matcher=Regex_Matcher.Regex_Matcher_Data = case matcher of
Text_Matcher.Case_Sensitive -> self == pattern
Text_Matcher.Case_Insensitive locale -> self.equals_ignore_case pattern locale
_ : Regex_Matcher.Regex_Matcher ->
compiled_pattern = matcher.compile pattern
compiled_pattern.matches self
## ALIAS Find Text
Finds all occurrences of `pattern` in the text `self`, returning the text(s)
if present, or `Nothing` if there are no matches.
Arguments:
- pattern: The pattern to match `self` against. We recommend using _raw text_
to write your patterns.
- mode: This argument specifies how many matches the engine will try and
find. When mode is set to either `Regex_Mode.First` or `Regex_Mode.Full`,
this method will return either a single `Text` or `Nothing`. If set to an
`Integer` or `Regex_Mode.All`, this method will return either
a `Vector Text` or `Nothing`.
- match_ascii: Enables or disables pure-ASCII matching for the regex. If you
know your data only contains ASCII then you can enable this for a
performance boost on some regex engines.
- case_insensitive: Enables or disables case-insensitive matching. Case
insensitive matching behaves as if it normalises the case of all input
text before matching on it.
- dot_matches_newline: Enables or disables the dot matches newline option.
This specifies that the `.` special character should match everything
_including_ newline characters. Without this flag, it will match all
characters _except_ newlines.
- multiline: Enables or disables the multiline option. Multiline specifies
that the `^` and `$` pattern characters match the start and end of lines,
as well as the start and end of the input respectively.
- comments: Enables or disables the comments mode for the regular expression.
In comments mode, the following changes apply:
- Whitespace within the pattern is ignored, except when within a
character class or when preceeded by an unescaped backslash, or within
grouping constructs (e.g. `(?...)`).
- When a line contains a `#`, that is not in a character class and is not
preceeded by an unescaped backslash, all characters from the leftmost
such `#` to the end of the line are ignored. That is to say, they act
as _comments_ in the regex.
- extra_opts: Specifies additional options in a vector. This allows options
to be supplied and computed without having to break them out into arguments
to the function. Where these overlap with one of the flags (`match_ascii`,
`case_insensitive`, `dot_matches_newline`, `multiline` and `verbose`), the
flags take precedence.
! Boolean Flags and Extra Options
This function contains a number of arguments that are boolean flags that
enable or disable common options for the regex. At the same time, it also
provides the ability to specify options in the `extra_opts` argument.
Where one of the flags is _set_ (has the value `True` or `False`), the
value of the flag takes precedence over the value in `extra_opts` when
merging the options to the engine. The flags are _unset_ (have value
`Nothing`) by default.
> Example
Find words that contain three or less letters in text`"\w{1,3}"`
example_find =
text = "Now I know my ABCs"
text.find "\w{1,3}"
Text.find : Text | Engine.Pattern -> (Regex_Mode | Matching_Mode) -> Boolean | Nothing -> Boolean | Nothing -> Boolean | Nothing -> Boolean | Nothing -> Boolean | Nothing -> Vector Option.Option -> Text | Vector Text | Nothing
Text.find self pattern mode=Regex_Mode.All match_ascii=Nothing case_insensitive=Nothing dot_matches_newline=Nothing multiline=Nothing comments=Nothing extra_opts=[] =
compiled_pattern = Regex.compile pattern match_ascii=match_ascii case_insensitive=case_insensitive dot_matches_newline=dot_matches_newline multiline=multiline comments=comments extra_opts=extra_opts
compiled_pattern.find self mode
## ALIAS Split Text
Takes a delimiter and returns the vector that results from splitting `self`
@ -343,8 +229,8 @@ Text.find self pattern mode=Regex_Mode.All match_ascii=Nothing case_insensitive=
'abc def\tghi'.split '\\s+' Regex_Matcher.Regex_Matcher_Data == ["abc", "def", "ghi"]
Text.split : Text -> (Text_Matcher | Regex_Matcher) -> Vector Text
Text.split self delimiter="," matcher=Text_Matcher.Case_Sensitive = if delimiter.is_empty then Error.throw (Illegal_Argument_Error_Data "The delimiter cannot be empty.") else
case Meta.type_of matcher of
Text_Matcher.Text_Matcher ->
case matcher of
_ : Text_Matcher.Text_Matcher ->
delimiters = Vector.from_polyglot_array <| case matcher of
Text_Matcher.Case_Sensitive ->
Text_Utils.span_of_all self delimiter
@ -356,7 +242,7 @@ Text.split self delimiter="," matcher=Text_Matcher.Case_Sensitive = if delimiter
end = if i == delimiters.length then (Text_Utils.char_length self) else
delimiters.at i . codeunit_start
Text_Utils.substring self start end
Regex_Matcher.Regex_Matcher ->
_ : Regex_Matcher.Regex_Matcher ->
compiled_pattern = matcher.compile delimiter
compiled_pattern.split self mode=Regex_Mode.All
@ -438,8 +324,8 @@ Text.split self delimiter="," matcher=Text_Matcher.Case_Sensitive = if delimiter
"aaa aaa".replace "aa" "c" mode=Matching_Mode.Last matcher=Regex_Matcher . should_equal "aaa ca"
Text.replace : Text -> Text -> Matching_Mode | Regex_Mode -> (Text_Matcher | Regex_Matcher) -> Text
Text.replace self term="" new_text="" mode=Regex_Mode.All matcher=Text_Matcher.Case_Sensitive = if term.is_empty then self else
case Meta.type_of matcher of
Text_Matcher.Text_Matcher ->
case matcher of
_ : Text_Matcher.Text_Matcher ->
array_from_single_result result = case result of
Nothing -> Array.empty
_ -> Array.new_1 result
@ -463,7 +349,7 @@ Text.replace self term="" new_text="" mode=Regex_Mode.All matcher=Text_Matcher.C
Text_Utils.span_of_case_insensitive self term locale.java_locale True
_ -> Error.throw (Illegal_Argument_Error_Data "Invalid mode.")
Text_Utils.replace_spans self spans_array new_text
Regex_Matcher.Regex_Matcher ->
_ : Regex_Matcher.Regex_Matcher ->
compiled_pattern = matcher.compile term
compiled_pattern.replace self new_text mode=mode
@ -882,7 +768,7 @@ Text.starts_with self prefix matcher=Text_Matcher.Case_Sensitive = case matcher
Text_Matcher.Case_Sensitive -> Text_Utils.starts_with self prefix
Text_Matcher.Case_Insensitive locale ->
self.take (Index_Sub_Range.First prefix.length) . equals_ignore_case prefix locale=locale
Regex_Matcher.Regex_Matcher_Data _ _ _ _ _ ->
_ : Regex_Matcher.Regex_Matcher ->
preprocessed_pattern = "\A(?:" + prefix + ")"
compiled_pattern = matcher.compile preprocessed_pattern
match = compiled_pattern.match self Regex_Mode.First
@ -917,7 +803,7 @@ Text.ends_with self suffix matcher=Text_Matcher.Case_Sensitive = case matcher of
Text_Matcher.Case_Sensitive -> Text_Utils.ends_with self suffix
Text_Matcher.Case_Insensitive locale ->
self.take (Index_Sub_Range.Last suffix.length) . equals_ignore_case suffix locale=locale
Regex_Matcher.Regex_Matcher_Data _ _ _ _ _ ->
_ : Regex_Matcher.Regex_Matcher ->
preprocessed_pattern = "(?:" + suffix + ")\z"
compiled_pattern = matcher.compile preprocessed_pattern
match = compiled_pattern.match self Regex_Mode.First
@ -979,7 +865,7 @@ Text.contains self term="" matcher=Text_Matcher.Case_Sensitive = case matcher of
Text_Matcher.Case_Sensitive -> Text_Utils.contains self term
Text_Matcher.Case_Insensitive locale ->
Text_Utils.contains_case_insensitive self term locale.java_locale
Regex_Matcher.Regex_Matcher_Data _ _ _ _ _ ->
_ : Regex_Matcher.Regex_Matcher ->
compiled_pattern = matcher.compile term
match = compiled_pattern.match self Regex_Mode.First
match.is_nothing.not
@ -1031,9 +917,7 @@ Text.* self count = self.repeat count
"Hello ".repeat 2 == "Hello Hello "
Text.repeat : Integer -> Text
Text.repeat self count=1 =
## TODO max is a workaround until Range is sorted to make 0..-1 not cause an infinite loop
https://www.pivotaltracker.com/story/show/181435598
0.up_to (count.max 0) . fold "" acc-> _-> acc + self
0.up_to count . fold "" acc-> _-> acc + self
## ALIAS first, last, left, right, mid, substring
Creates a new Text by selecting the specified range of the input.
@ -1262,7 +1146,7 @@ Text.trim self where=Location.Both what=_.is_whitespace =
if start_index >= end_index then "" else
Text_Utils.substring self start_index end_index
## ALIAS find, index_of, position_of, span_of
## ALIAS index_of, position_of, span_of
Find the location of the `term` in the input.
Returns a Span representing the location at which the term was found, or
`Nothing` if the term was not found in the input.
@ -1286,9 +1170,9 @@ Text.trim self where=Location.Both what=_.is_whitespace =
> Example
Finding location of a substring.
"Hello World!".location_of "J" == Nothing
"Hello World!".location_of "o" == Span (Range 4 5) "Hello World!"
"Hello World!".location_of "o" mode=Matching_Mode.Last == Span (Range 7 8) "Hello World!"
"Hello World!".locate "J" == Nothing
"Hello World!".locate "o" == Span (Range 4 5) "Hello World!"
"Hello World!".locate "o" mode=Matching_Mode.Last == Span (Range 7 8) "Hello World!"
! Match Length
The function returns not only the index of the match but a `Span` instance
@ -1308,7 +1192,7 @@ Text.trim self where=Location.Both what=_.is_whitespace =
term = "straße"
text = "MONUMENTENSTRASSE 42"
match = text . location_of term matcher=(Text_Matcher Case_Insensitive)
match = text . locate term matcher=(Text_Matcher Case_Insensitive)
term.length == 6
match.length == 7
@ -1329,11 +1213,11 @@ Text.trim self where=Location.Both what=_.is_whitespace =
ligatures = "ffiffl"
ligatures.length == 2
term_1 = "IFF"
match_1 = ligatures . location_of term_1 matcher=(Text_Matcher Case_Insensitive)
match_1 = ligatures . locate term_1 matcher=(Text_Matcher Case_Insensitive)
term_1.length == 3
match_1.length == 2
term_2 = "ffiffl"
match_2 = ligatures . location_of term_2 matcher=(Text_Matcher Case_Insensitive)
match_2 = ligatures . locate term_2 matcher=(Text_Matcher Case_Insensitive)
term_2.length == 6
match_2.length == 2
# After being extended to full grapheme clusters, both terms "IFF" and "ffiffl" match the same span of grapheme clusters.
@ -1349,13 +1233,13 @@ Text.trim self where=Location.Both what=_.is_whitespace =
> Example
Comparing Matching in Last Mode in Regex and Text mode
"aaa".location_of "aa" mode=Matching_Mode.Last matcher=Text_Matcher == Span (Range 1 3) "aaa"
"aaa".location_of "aa" mode=Matching_Mode.Last matcher=Regex_Matcher == Span (Range 0 2) "aaa"
"aaa".locate "aa" mode=Matching_Mode.Last matcher=Text_Matcher == Span (Range 1 3) "aaa"
"aaa".locate "aa" mode=Matching_Mode.Last matcher=Regex_Matcher == Span (Range 0 2) "aaa"
"aaa aaa".location_of "aa" mode=Matching_Mode.Last matcher=Text_Matcher == Span (Range 5 7) "aaa aaa"
"aaa aaa".location_of "aa" mode=Matching_Mode.Last matcher=Regex_Matcher == Span (Range 4 6) "aaa aaa"
Text.location_of : Text -> (Matching_Mode.First | Matching_Mode.Last) -> Matcher -> Span | Nothing
Text.location_of self term="" mode=Matching_Mode.First matcher=Text_Matcher.Case_Sensitive = case matcher of
"aaa aaa".locate "aa" mode=Matching_Mode.Last matcher=Text_Matcher == Span (Range 5 7) "aaa aaa"
"aaa aaa".locate "aa" mode=Matching_Mode.Last matcher=Regex_Matcher == Span (Range 4 6) "aaa aaa"
Text.locate : Text -> (Matching_Mode.First | Matching_Mode.Last) -> Matcher -> Span | Nothing
Text.locate self term="" mode=Matching_Mode.First matcher=Text_Matcher.Case_Sensitive = case matcher of
Text_Matcher.Case_Sensitive ->
codepoint_span = case mode of
Matching_Mode.First -> Text_Utils.span_of self term
@ -1391,7 +1275,7 @@ Text.location_of self term="" mode=Matching_Mode.First matcher=Text_Matcher.Case
Nothing -> Nothing
matches -> matches.last.span 0 . to_grapheme_span
## ALIAS find_all, index_of_all, position_of_all, span_of_all
## ALIAS index_of_all, position_of_all, span_of_all
Finds all the locations of the `term` in the input.
If not found, the function returns an empty Vector.
@ -1411,8 +1295,8 @@ Text.location_of self term="" mode=Matching_Mode.First matcher=Text_Matcher.Case
> Example
Finding locations of all occurrences of a substring.
"Hello World!".location_of_all "J" == []
"Hello World!".location_of_all "o" . map .start == [4, 7]
"Hello World!".locate_all "J" == []
"Hello World!".locate_all "o" . map .start == [4, 7]
! Match Length
The function returns not only the index of the match but a `Span` instance
@ -1432,7 +1316,7 @@ Text.location_of self term="" mode=Matching_Mode.First matcher=Text_Matcher.Case
term = "strasse"
text = "MONUMENTENSTRASSE ist eine große Straße."
match = text . location_of_all term matcher=(Text_Matcher Case_Insensitive)
match = text . locate_all term matcher=(Text_Matcher Case_Insensitive)
term.length == 7
match . map .length == [7, 6]
@ -1452,12 +1336,12 @@ Text.location_of self term="" mode=Matching_Mode.First matcher=Text_Matcher.Case
ligatures = "ffifflFFIFF"
ligatures.length == 7
match_1 = ligatures . location_of_all "IFF" matcher=(Text_Matcher Case_Insensitive)
match_1 = ligatures . locate_all "IFF" matcher=(Text_Matcher Case_Insensitive)
match_1 . map .length == [2, 3]
match_2 = ligatures . location_of_all "ffiff" matcher=(Text_Matcher Case_Insensitive)
match_2 = ligatures . locate_all "ffiff" matcher=(Text_Matcher Case_Insensitive)
match_2 . map .length == [2, 5]
Text.location_of_all : Text -> Matcher -> [Span]
Text.location_of_all self term="" matcher=Text_Matcher.Case_Sensitive = if term.is_empty then Vector.new (self.length + 1) (ix -> Span_Data (Range_Data ix ix) self) else case matcher of
Text.locate_all : Text -> Matcher -> [Span]
Text.locate_all self term="" matcher=Text_Matcher.Case_Sensitive = if term.is_empty then Vector.new (self.length + 1) (ix -> Span_Data (Range_Data ix ix) self) else case matcher of
Text_Matcher.Case_Sensitive ->
codepoint_spans = Vector.from_polyglot_array <| Text_Utils.span_of_all self term
grahpeme_ixes = Vector.from_polyglot_array <| Text_Utils.utf16_indices_to_grapheme_indices self (codepoint_spans.map .codeunit_start).to_array
@ -1472,7 +1356,7 @@ Text.location_of_all self term="" matcher=Text_Matcher.Case_Sensitive = if term.
grapheme_spans = Vector.from_polyglot_array <| Text_Utils.span_of_all_case_insensitive self term locale.java_locale
grapheme_spans.map grapheme_span->
Span_Data (Range_Data grapheme_span.grapheme_start grapheme_span.grapheme_end) self
Regex_Matcher.Regex_Matcher_Data _ _ _ _ _ ->
_ : Regex_Matcher.Regex_Matcher ->
case matcher.compile term . match self Regex_Mode.All of
Nothing -> []
matches -> matches.map m-> m.span 0 . to_grapheme_span

View File

@ -27,7 +27,7 @@ type Span
Arguments:
- range: The range of characters over which the span exists. The range is
assumed to have `step` equal to 1.
- text: The text over which the span exists.
- parent: The text over which the span exists.
! What is a Character?
A character is defined as an Extended Grapheme Cluster, see Unicode
@ -43,7 +43,7 @@ type Span
text = "Hello!"
range = 0.up_to 3
Span.Span_Data range text
Span_Data (range : Range.Range) (text : Text)
Span_Data (range : Range.Range) (parent : Text)
## The index of the first character included in the span.
@ -73,6 +73,10 @@ type Span
length : Integer
length self = self.range.length
## Returns the part of the text that this span covers.
text : Text
text self = self.to_utf_16_span.text
## Converts the span of extended grapheme clusters to a corresponding span
of UTF-16 code units.
@ -83,7 +87,7 @@ type Span
(Span_Data (Range 1 3) text).to_utf_16_span == (Utf_16_Span_Data (Range 1 4) text)
to_utf_16_span : Utf_16_Span
to_utf_16_span self =
Utf_16_Span_Data (range_to_char_indices self.text self.range) self.text
Utf_16_Span_Data (range_to_char_indices self.parent self.range) self.parent
# TODO Dubious constructor export
from project.Data.Text.Span.Utf_16_Span import all
@ -96,7 +100,7 @@ type Utf_16_Span
Arguments:
- range: The range of code units over which the span exists. The range is
assumed to have `step` equal to 1.
- text: The text over which the span exists.
- parent: The text over which the span exists.
> Example
Creating a span over the first three code units of the text 'a\u{301}bc'.
@ -106,7 +110,7 @@ type Utf_16_Span
example_span =
text = 'a\u{301}bc'
Span.Utf_16_Span_Data (Range 0 3) text
Utf_16_Span_Data (range : Range.Range) (text : Text)
Utf_16_Span_Data (range : Range.Range) (parent : Text)
## The index of the first code unit included in the span.
start : Integer
@ -121,6 +125,10 @@ type Utf_16_Span
length : Integer
length self = self.range.length
## Returns the part of the text that this span covers.
text : Text
text self = Text_Utils.substring self.parent self.start self.end
## Returns a span of extended grapheme clusters which is the closest
approximation of this span of code units.
@ -139,14 +147,14 @@ type Utf_16_Span
extended == Span_Data (Range 0 3) text # The span is extended to the whole string since it contained code units from every grapheme cluster.
extended.to_utf_16_span == Utf_16_Span_Data (Range 0 6) text
to_grapheme_span : Span
to_grapheme_span self = if (self.start < 0) || (self.end > Text_Utils.char_length self.text) then Error.throw (Illegal_State_Error "Utf_16_Span indices are out of range of the associated text.") else
to_grapheme_span self = if (self.start < 0) || (self.end > Text_Utils.char_length self.parent) then Error.throw (Illegal_State_Error_Data "Utf_16_Span indices are out of range of the associated text.") else
if self.end < self.start then Error.throw (Illegal_State_Error "Utf_16_Span invariant violation: start <= end") else
case self.start == self.end of
True ->
grapheme_ix = Text_Utils.utf16_index_to_grapheme_index self.text self.start
Span_Data (Range_Data grapheme_ix grapheme_ix) self.text
grapheme_ix = Text_Utils.utf16_index_to_grapheme_index self.parent self.start
Span_Data (Range_Data grapheme_ix grapheme_ix) self.parent
False ->
grapheme_ixes = Text_Utils.utf16_indices_to_grapheme_indices self.text [self.start, self.end - 1].to_array
grapheme_ixes = Text_Utils.utf16_indices_to_grapheme_indices self.parent [self.start, self.end - 1].to_array
grapheme_first = grapheme_ixes.at 0
grapheme_last = grapheme_ixes.at 1
## We find the grapheme index of the last code unit actually contained within our span and set the
@ -154,7 +162,7 @@ type Utf_16_Span
only a part of a grapheme were contained in our original span, the resulting span will be
extended to contain this whole grapheme.
grapheme_end = grapheme_last + 1
Span_Data (Range_Data grapheme_first grapheme_end) self.text
Span_Data (Range_Data grapheme_first grapheme_end) self.parent
## PRIVATE
Utility function taking a range pointing at grapheme clusters and converting

View File

@ -500,7 +500,7 @@ type File
extension : Text
extension self =
name = self.name
last_dot = name.location_of "." mode=Matching_Mode.Last
last_dot = name.locate "." mode=Matching_Mode.Last
if last_dot.is_nothing then "" else
extension = name.drop (First last_dot.start)
if extension == "." then "" else extension

View File

@ -9,7 +9,7 @@ type Test_Result
Arguments:
- message: The reason why the test failed.
- details: Additional context of the error, for example the stack trace.
Failure message details
Failure message details=Nothing
## Represents a pending behavioral test.

View File

@ -10,11 +10,14 @@ spec = Test.group "Text.Span" <|
span = Span_Data (Range_Data 0 3) text
span.start . should_equal 0
span.end . should_equal 3
span.text . should_equal text
span.parent . should_equal text
span.text . should_equal "Hel"
Test.specify "should be able to be converted to code units" <|
text = 'ae\u{301}fz'
(Span_Data (Range_Data 1 3) text).to_utf_16_span . should_equal (Utf_16_Span_Data (Range_Data 1 4) text)
span = Span_Data (Range_Data 1 3) text
span.to_utf_16_span . should_equal (Utf_16_Span_Data (Range_Data 1 4) text)
span.text . should_equal 'e\u{301}f'
Test.specify "should expand to the associated grapheme clusters" <|
text = 'a\u{301}e\u{302}o\u{303}'

View File

@ -1110,16 +1110,16 @@ spec =
'✨🚀🚧'*2 . should_equal '✨🚀🚧✨🚀🚧'
Test.specify "location_of should work as shown in examples" <|
Test.specify "locate should work as shown in examples" <|
example_1 =
"Hello World!".location_of "J" == Nothing
"Hello World!".location_of "o" == Span_Data (Range_Data 4 5) "Hello World!"
"Hello World!".location_of "o" mode=Matching_Mode.Last == Span_Data (Range_Data 4 5) "Hello World!"
"Hello World!".locate "J" == Nothing
"Hello World!".locate "o" == Span_Data (Range_Data 4 5) "Hello World!"
"Hello World!".locate "o" mode=Matching_Mode.Last == Span_Data (Range_Data 4 5) "Hello World!"
example_2 =
term = "straße"
text = "MONUMENTENSTRASSE 42"
match = text . location_of term matcher=Text_Matcher.Case_Insensitive
match = text . locate term matcher=Text_Matcher.Case_Insensitive
term.length . should_equal 6
match.length . should_equal 7
@ -1127,32 +1127,32 @@ spec =
ligatures = "ffiffl"
ligatures.length . should_equal 2
term_1 = "IFF"
match_1 = ligatures . location_of term_1 matcher=Text_Matcher.Case_Insensitive
match_1 = ligatures . locate term_1 matcher=Text_Matcher.Case_Insensitive
term_1.length . should_equal 3
match_1.length . should_equal 2
term_2 = "ffiffl"
match_2 = ligatures . location_of term_2 matcher=Text_Matcher.Case_Insensitive
match_2 = ligatures . locate term_2 matcher=Text_Matcher.Case_Insensitive
term_2.length . should_equal 6
match_2.length . should_equal 2
match_1 . should_equal match_2
example_4 =
"Hello World!".location_of_all "J" . should_equal []
"Hello World!".location_of_all "o" . map .start . should_equal [4, 7]
"Hello World!".locate_all "J" . should_equal []
"Hello World!".locate_all "o" . map .start . should_equal [4, 7]
example_5 =
term = "strasse"
text = "MONUMENTENSTRASSE ist eine große Straße."
match = text . location_of_all term matcher=Text_Matcher.Case_Insensitive
match = text . locate_all term matcher=Text_Matcher.Case_Insensitive
term.length . should_equal 7
match . map .length . should_equal [7, 6]
example_6 =
ligatures = "ffifflFFIFF"
ligatures.length . should_equal 7
match_1 = ligatures . location_of_all "IFF" matcher=Text_Matcher.Case_Insensitive
match_1 = ligatures . locate_all "IFF" matcher=Text_Matcher.Case_Insensitive
match_1 . map .length . should_equal [2, 3]
match_2 = ligatures . location_of_all "ffiff" matcher=Text_Matcher.Case_Insensitive
match_2 = ligatures . locate_all "ffiff" matcher=Text_Matcher.Case_Insensitive
match_2 . map .length . should_equal [2, 5]
# Put them in blocks to avoid name clashes.
@ -1163,165 +1163,216 @@ spec =
example_5
example_6
Test.specify "should allow to find location_of occurrences within a text" <|
"Hello World!".location_of_all "J" . should_equal []
"Hello World!".location_of_all "o" . map .start . should_equal [4, 7]
Test.specify "should allow to find locate occurrences within a text" <|
"Hello World!".locate_all "J" . should_equal []
"Hello World!".locate_all "o" . map .start . should_equal [4, 7]
accents = 'a\u{301}e\u{301}o\u{301}'
accents.location_of accent_1 . should_equal (Span_Data (Range_Data 1 2) accents)
accents.locate accent_1 . should_equal (Span_Data (Range_Data 1 2) accents)
"".location_of "foo" . should_equal Nothing
"".location_of "foo" mode=Matching_Mode.Last . should_equal Nothing
"".location_of_all "foo" . should_equal []
"".location_of "" . should_equal (Span_Data (Range_Data 0 0) "")
"".location_of "" mode=Matching_Mode.Last . should_equal (Span_Data (Range_Data 0 0) "")
"".location_of_all "" . should_equal [Span_Data (Range_Data 0 0) ""]
"".locate "foo" . should_equal Nothing
"".locate "foo" mode=Matching_Mode.Last . should_equal Nothing
"".locate_all "foo" . should_equal []
"".locate "" . should_equal (Span_Data (Range_Data 0 0) "")
"".locate "" mode=Matching_Mode.Last . should_equal (Span_Data (Range_Data 0 0) "")
"".locate_all "" . should_equal [Span_Data (Range_Data 0 0) ""]
abc = 'A\u{301}ßC'
abc.location_of "" . should_equal (Span_Data (Range_Data 0 0) abc)
abc.location_of "" mode=Matching_Mode.Last . should_equal (Span_Data (Range_Data 3 3) abc)
abc.location_of_all "" . should_equal [Span_Data (Range_Data 0 0) abc, Span_Data (Range_Data 1 1) abc, Span_Data (Range_Data 2 2) abc, Span_Data (Range_Data 3 3) abc]
abc.locate "" . should_equal (Span_Data (Range_Data 0 0) abc)
abc.locate "" mode=Matching_Mode.Last . should_equal (Span_Data (Range_Data 3 3) abc)
abc.locate_all "" . should_equal [Span_Data (Range_Data 0 0) abc, Span_Data (Range_Data 1 1) abc, Span_Data (Range_Data 2 2) abc, Span_Data (Range_Data 3 3) abc]
Test.specify "should allow case-insensitive matching in location_of" <|
Test.specify "should allow case-insensitive matching in locate" <|
hello = "Hello WORLD!"
case_insensitive = Text_Matcher.Case_Insensitive
hello.location_of "world" . should_equal Nothing
hello.location_of "world" matcher=case_insensitive . should_equal (Span_Data (Range_Data 6 11) hello)
hello.locate "world" . should_equal Nothing
hello.locate "world" matcher=case_insensitive . should_equal (Span_Data (Range_Data 6 11) hello)
hello.location_of "o" mode=Regex_Mode.First matcher=case_insensitive . should_equal (Span_Data (Range_Data 4 5) hello)
hello.location_of "o" mode=Matching_Mode.Last matcher=case_insensitive . should_equal (Span_Data (Range_Data 7 8) hello)
hello.locate "o" mode=Regex_Mode.First matcher=case_insensitive . should_equal (Span_Data (Range_Data 4 5) hello)
hello.locate "o" mode=Matching_Mode.Last matcher=case_insensitive . should_equal (Span_Data (Range_Data 7 8) hello)
accents = 'A\u{301}E\u{301}O\u{301}'
accents.location_of accent_1 matcher=case_insensitive . should_equal (Span_Data (Range_Data 1 2) accents)
accents.locate accent_1 matcher=case_insensitive . should_equal (Span_Data (Range_Data 1 2) accents)
"Strasse".location_of "ß" matcher=case_insensitive . should_equal (Span_Data (Range_Data 4 6) "Strasse")
"Monumentenstraße 42".location_of "STRASSE" matcher=case_insensitive . should_equal (Span_Data (Range_Data 10 16) "Monumentenstraße 42")
"Strasse".locate "ß" matcher=case_insensitive . should_equal (Span_Data (Range_Data 4 6) "Strasse")
"Monumentenstraße 42".locate "STRASSE" matcher=case_insensitive . should_equal (Span_Data (Range_Data 10 16) "Monumentenstraße 42")
'\u0390'.location_of '\u03B9\u0308\u0301' matcher=case_insensitive . should_equal (Span_Data (Range_Data 0 1) '\u0390')
'ԵՒ'.location_of 'և' . should_equal Nothing
'ԵՒ'.location_of 'և' matcher=case_insensitive . should_equal (Span_Data (Range_Data 0 2) 'ԵՒ')
'և'.location_of 'ԵՒ' matcher=case_insensitive . should_equal (Span_Data (Range_Data 0 1) 'և')
'\u0390'.locate '\u03B9\u0308\u0301' matcher=case_insensitive . should_equal (Span_Data (Range_Data 0 1) '\u0390')
'ԵՒ'.locate 'և' . should_equal Nothing
'ԵՒ'.locate 'և' matcher=case_insensitive . should_equal (Span_Data (Range_Data 0 2) 'ԵՒ')
'և'.locate 'ԵՒ' matcher=case_insensitive . should_equal (Span_Data (Range_Data 0 1) 'և')
ligatures = 'ffafffiflffifflſtstZ'
ligatures.location_of 'FFI' matcher=case_insensitive . should_equal (Span_Data (Range_Data 3 5) ligatures)
ligatures.location_of 'FF' matcher=case_insensitive . should_equal (Span_Data (Range_Data 0 2) ligatures)
ligatures.location_of 'ff' matcher=case_insensitive mode=Matching_Mode.Last . should_equal (Span_Data (Range_Data 7 8) ligatures)
ligatures.location_of_all 'ff' . should_equal [Span_Data (Range_Data 0 2) ligatures]
ligatures.location_of_all 'FF' matcher=case_insensitive . should_equal [Span_Data (Range_Data 0 2) ligatures, Span_Data (Range_Data 3 4) ligatures, Span_Data (Range_Data 6 7) ligatures, Span_Data (Range_Data 7 8) ligatures]
ligatures.location_of_all 'ffi' matcher=case_insensitive . should_equal [Span_Data (Range_Data 3 5) ligatures, Span_Data (Range_Data 6 7) ligatures]
'fffi'.location_of_all 'ff' matcher=case_insensitive . should_equal [Span_Data (Range_Data 0 2) 'fffi']
'fffi'.location_of_all 'ffi' . should_equal []
'fffi'.location_of_all 'ffi' matcher=case_insensitive . should_equal [Span_Data (Range_Data 1 4) 'fffi']
'FFFI'.location_of 'ffi' matcher=case_insensitive . should_equal (Span_Data (Range_Data 1 4) 'FFFI')
ligatures.locate 'FFI' matcher=case_insensitive . should_equal (Span_Data (Range_Data 3 5) ligatures)
ligatures.locate 'FF' matcher=case_insensitive . should_equal (Span_Data (Range_Data 0 2) ligatures)
ligatures.locate 'ff' matcher=case_insensitive mode=Matching_Mode.Last . should_equal (Span_Data (Range_Data 7 8) ligatures)
ligatures.locate_all 'ff' . should_equal [Span_Data (Range_Data 0 2) ligatures]
ligatures.locate_all 'FF' matcher=case_insensitive . should_equal [Span_Data (Range_Data 0 2) ligatures, Span_Data (Range_Data 3 4) ligatures, Span_Data (Range_Data 6 7) ligatures, Span_Data (Range_Data 7 8) ligatures]
ligatures.locate_all 'ffi' matcher=case_insensitive . should_equal [Span_Data (Range_Data 3 5) ligatures, Span_Data (Range_Data 6 7) ligatures]
'fffi'.locate_all 'ff' matcher=case_insensitive . should_equal [Span_Data (Range_Data 0 2) 'fffi']
'fffi'.locate_all 'ffi' . should_equal []
'fffi'.locate_all 'ffi' matcher=case_insensitive . should_equal [Span_Data (Range_Data 1 4) 'fffi']
'FFFI'.locate 'ffi' matcher=case_insensitive . should_equal (Span_Data (Range_Data 1 4) 'FFFI')
'ffiffl'.location_of 'IF' matcher=case_insensitive . should_equal (Span_Data (Range_Data 0 2) 'ffiffl')
'ffiffl'.location_of 'F' Matching_Mode.Last matcher=case_insensitive . should_equal (Span_Data (Range_Data 1 2) 'ffiffl')
'ffiffl'.location_of_all 'F' matcher=case_insensitive . should_equal [Span_Data (Range_Data 0 1) 'ffiffl', Span_Data (Range_Data 0 1) 'ffiffl', Span_Data (Range_Data 1 2) 'ffiffl', Span_Data (Range_Data 1 2) 'ffiffl']
'aaffibb'.location_of_all 'af' matcher=case_insensitive . should_equal [Span_Data (Range_Data 1 3) 'aaffibb']
'aaffibb'.location_of_all 'affi' matcher=case_insensitive . should_equal [Span_Data (Range_Data 1 3) 'aaffibb']
'aaffibb'.location_of_all 'ib' matcher=case_insensitive . should_equal [Span_Data (Range_Data 2 4) 'aaffibb']
'aaffibb'.location_of_all 'ffib' matcher=case_insensitive . should_equal [Span_Data (Range_Data 2 4) 'aaffibb']
'ffiffl'.locate 'IF' matcher=case_insensitive . should_equal (Span_Data (Range_Data 0 2) 'ffiffl')
'ffiffl'.locate 'F' Matching_Mode.Last matcher=case_insensitive . should_equal (Span_Data (Range_Data 1 2) 'ffiffl')
'ffiffl'.locate_all 'F' matcher=case_insensitive . should_equal [Span_Data (Range_Data 0 1) 'ffiffl', Span_Data (Range_Data 0 1) 'ffiffl', Span_Data (Range_Data 1 2) 'ffiffl', Span_Data (Range_Data 1 2) 'ffiffl']
'aaffibb'.locate_all 'af' matcher=case_insensitive . should_equal [Span_Data (Range_Data 1 3) 'aaffibb']
'aaffibb'.locate_all 'affi' matcher=case_insensitive . should_equal [Span_Data (Range_Data 1 3) 'aaffibb']
'aaffibb'.locate_all 'ib' matcher=case_insensitive . should_equal [Span_Data (Range_Data 2 4) 'aaffibb']
'aaffibb'.locate_all 'ffib' matcher=case_insensitive . should_equal [Span_Data (Range_Data 2 4) 'aaffibb']
"".location_of "foo" matcher=case_insensitive . should_equal Nothing
"".location_of "foo" matcher=case_insensitive mode=Matching_Mode.Last . should_equal Nothing
"".location_of_all "foo" matcher=case_insensitive . should_equal []
"".location_of "" matcher=case_insensitive . should_equal (Span_Data (Range_Data 0 0) "")
"".location_of "" matcher=case_insensitive mode=Matching_Mode.Last . should_equal (Span_Data (Range_Data 0 0) "")
"".location_of_all "" matcher=case_insensitive . should_equal [Span_Data (Range_Data 0 0) ""]
"".locate "foo" matcher=case_insensitive . should_equal Nothing
"".locate "foo" matcher=case_insensitive mode=Matching_Mode.Last . should_equal Nothing
"".locate_all "foo" matcher=case_insensitive . should_equal []
"".locate "" matcher=case_insensitive . should_equal (Span_Data (Range_Data 0 0) "")
"".locate "" matcher=case_insensitive mode=Matching_Mode.Last . should_equal (Span_Data (Range_Data 0 0) "")
"".locate_all "" matcher=case_insensitive . should_equal [Span_Data (Range_Data 0 0) ""]
abc = 'A\u{301}ßC'
abc.location_of "" matcher=case_insensitive . should_equal (Span_Data (Range_Data 0 0) abc)
abc.location_of "" matcher=case_insensitive mode=Matching_Mode.Last . should_equal (Span_Data (Range_Data 3 3) abc)
abc.location_of_all "" matcher=case_insensitive . should_equal [Span_Data (Range_Data 0 0) abc, Span_Data (Range_Data 1 1) abc, Span_Data (Range_Data 2 2) abc, Span_Data (Range_Data 3 3) abc]
abc.locate "" matcher=case_insensitive . should_equal (Span_Data (Range_Data 0 0) abc)
abc.locate "" matcher=case_insensitive mode=Matching_Mode.Last . should_equal (Span_Data (Range_Data 3 3) abc)
abc.locate_all "" matcher=case_insensitive . should_equal [Span_Data (Range_Data 0 0) abc, Span_Data (Range_Data 1 1) abc, Span_Data (Range_Data 2 2) abc, Span_Data (Range_Data 3 3) abc]
Test.specify "should allow regexes in location_of" <|
Test.specify "should allow regexes in locate" <|
hello = "Hello World!"
regex = Regex_Matcher.Regex_Matcher_Data
regex_insensitive = Regex_Matcher.Regex_Matcher_Data case_sensitivity=Case_Sensitivity.Insensitive
hello.location_of ".o" Matching_Mode.First matcher=regex . should_equal (Span_Data (Range_Data 3 5) hello)
hello.location_of ".o" Matching_Mode.Last matcher=regex . should_equal (Span_Data (Range_Data 6 8) hello)
hello.location_of_all ".o" matcher=regex . map .start . should_equal [3, 6]
hello.locate ".o" Matching_Mode.First matcher=regex . should_equal (Span_Data (Range_Data 3 5) hello)
hello.locate ".o" Matching_Mode.Last matcher=regex . should_equal (Span_Data (Range_Data 6 8) hello)
hello.locate_all ".o" matcher=regex . map .start . should_equal [3, 6]
"foobar".location_of "BAR" Regex_Mode.First matcher=regex_insensitive . should_equal (Span_Data (Range_Data 3 6) "foobar")
"foobar".locate "BAR" Regex_Mode.First matcher=regex_insensitive . should_equal (Span_Data (Range_Data 3 6) "foobar")
## Regex matching does not do case folding
"Strasse".location_of "ß" Regex_Mode.First matcher=regex_insensitive . should_equal Nothing
"Strasse".locate "ß" Regex_Mode.First matcher=regex_insensitive . should_equal Nothing
## But it should handle the Unicode normalization
accents = 'a\u{301}e\u{301}o\u{301}'
accents.location_of accent_1 Regex_Mode.First matcher=regex . should_equal (Span_Data (Range_Data 1 2) accents)
Test.specify "should correctly handle regex edge cases in location_of" pending="Figure out how to make Regex correctly handle empty patterns." <|
accents.locate accent_1 Regex_Mode.First matcher=regex . should_equal (Span_Data (Range_Data 1 2) accents)
Test.specify "should correctly handle regex edge cases in locate" pending="Figure out how to make Regex correctly handle empty patterns." <|
regex = Regex_Matcher.Regex_Matcher_Data
"".location_of "foo" matcher=regex . should_equal Nothing
"".location_of "foo" matcher=regex mode=Matching_Mode.Last . should_equal Nothing
"".location_of_all "foo" matcher=regex . should_equal []
"".location_of "" matcher=regex . should_equal (Span_Data (Range_Data 0 0) "")
"".location_of_all "" matcher=regex . should_equal [Span_Data (Range_Data 0 0) ""]
"".location_of "" matcher=regex mode=Matching_Mode.Last . should_equal (Span_Data (Range_Data 0 0) "")
"".locate "foo" matcher=regex . should_equal Nothing
"".locate "foo" matcher=regex mode=Matching_Mode.Last . should_equal Nothing
"".locate_all "foo" matcher=regex . should_equal []
"".locate "" matcher=regex . should_equal (Span_Data (Range_Data 0 0) "")
"".locate_all "" matcher=regex . should_equal [Span_Data (Range_Data 0 0) ""]
"".locate "" matcher=regex mode=Matching_Mode.Last . should_equal (Span_Data (Range_Data 0 0) "")
abc = 'A\u{301}ßC'
abc.location_of "" matcher=regex . should_equal (Span_Data (Range_Data 0 0) abc)
abc.location_of_all "" matcher=regex . should_equal [Span_Data (Range_Data 0 0) abc, Span_Data (Range_Data 0 0) abc, Span_Data (Range_Data 1 1) abc, Span_Data (Range_Data 2 2) abc, Span_Data (Range_Data 3 3) abc]
abc.location_of "" matcher=regex mode=Matching_Mode.Last . should_equal (Span_Data (Range_Data 3 3) abc)
abc.locate "" matcher=regex . should_equal (Span_Data (Range_Data 0 0) abc)
abc.locate_all "" matcher=regex . should_equal [Span_Data (Range_Data 0 0) abc, Span_Data (Range_Data 0 0) abc, Span_Data (Range_Data 1 1) abc, Span_Data (Range_Data 2 2) abc, Span_Data (Range_Data 3 3) abc]
abc.locate "" matcher=regex mode=Matching_Mode.Last . should_equal (Span_Data (Range_Data 3 3) abc)
Test.specify "should handle overlapping matches as shown in the examples" <|
"aaa".location_of "aa" mode=Matching_Mode.Last matcher=Text_Matcher.Case_Sensitive . should_equal (Span_Data (Range_Data 1 3) "aaa")
"aaa".location_of "aa" mode=Matching_Mode.Last matcher=Regex_Matcher.Regex_Matcher_Data . should_equal (Span_Data (Range_Data 0 2) "aaa")
"aaa".locate "aa" mode=Matching_Mode.Last matcher=Text_Matcher.Case_Sensitive . should_equal (Span_Data (Range_Data 1 3) "aaa")
"aaa".locate "aa" mode=Matching_Mode.Last matcher=Regex_Matcher.Regex_Matcher_Data . should_equal (Span_Data (Range_Data 0 2) "aaa")
"aaa aaa".location_of "aa" mode=Matching_Mode.Last matcher=Text_Matcher.Case_Sensitive . should_equal (Span_Data (Range_Data 5 7) "aaa aaa")
"aaa aaa".location_of "aa" mode=Matching_Mode.Last matcher=Regex_Matcher.Regex_Matcher_Data . should_equal (Span_Data (Range_Data 4 6) "aaa aaa")
"aaa aaa".locate "aa" mode=Matching_Mode.Last matcher=Text_Matcher.Case_Sensitive . should_equal (Span_Data (Range_Data 5 7) "aaa aaa")
"aaa aaa".locate "aa" mode=Matching_Mode.Last matcher=Regex_Matcher.Regex_Matcher_Data . should_equal (Span_Data (Range_Data 4 6) "aaa aaa")
Test.specify "should allow to match one or more occurrences of a pattern in the text" <|
"abacadae".match_all "a[bc]" . should_equal ["ab", "ac"]
"abacadae".match_all "a." . should_equal ["ab", "ac", "ad", "ae"]
"abacadae".match_all "a.*" . should_equal ["abacadae"]
"abacadae".match_all "a.+?" . should_equal ["ab", "ac", "ad", "ae"]
"abacadae".match "a[bc]" mode=Matching_Mode.Last . should_equal "ac"
"abacadae".match "a." mode=Matching_Mode.Last . should_equal "ae"
"abacadae".match "a.*" mode=Matching_Mode.Last . should_equal "abacadae"
"abacadae".match "a.+?" mode=Matching_Mode.Last . should_equal "ae"
"abacadae".match "a[bc]" matcher=Text_Matcher.Case_Sensitive . should_equal Nothing
"abABacAC".match "ab" matcher=Text_Matcher.Case_Sensitive mode=Matching_Mode.Last . should_equal "ab"
"abABacAC".match "ab" matcher=Text_Matcher.Case_Insensitive mode=Matching_Mode.Last . should_equal "AB"
"abABacAC".match_all "ab" matcher=Text_Matcher.Case_Sensitive . should_equal ["ab"]
"abABacAC".match_all "ab" matcher=Text_Matcher.Case_Insensitive . should_equal ["ab", "AB"]
"abacadae".match_all "a[bc]" matcher=Text_Matcher.Case_Sensitive . should_equal []
"Strasse and Straße".match_all "STRASSE" matcher=Text_Matcher.Case_Sensitive . should_equal []
"Strasse and Straße".match_all "STRASSE" matcher=Text_Matcher.Case_Insensitive . should_equal ["Strasse", "Straße"]
Test.specify "should default to exact matching for locate but regex for match" <|
txt = "aba[bc]adacae"
"ab".locate "ab" . should_equal (Span_Data (Range_Data 0 2) "ab")
"ab".locate "a[bc]" . should_equal Nothing
"ab".locate_all "a[bc]" . should_equal []
txt.locate "a[bc]" . should_equal (Span_Data (Range_Data 2 7) txt)
txt.locate_all "a[bc]" . should_equal [Span_Data (Range_Data 2 7) txt]
"ab".match "a[bc]" . should_equal "ab"
"a[bc]".match "a[bc]" . should_equal Nothing
"a[bc]".match_all "a[bc]" . should_equal []
txt.match "a[bc]" . should_equal "ab"
txt.match_all "a[bc]" . should_equal ["ab", "ac"]
Test.group "Regex matching" <|
Test.specify "should be possible on text" <|
match = "My Text: Goes Here".match "^My Text: (.+)$" mode=Regex_Mode.First
match . should_be_a Default_Engine.Match_Data
match.group 1 . should_equal "Goes Here"
match = "My Text: Goes Here".match "^My Text: (.+)$"
match.should_equal "My Text: Goes Here"
Test.specify "should be possible on unicode text" <|
match = "Korean: 건반".match "^Korean: (.+)$" mode=Regex_Mode.First
match . should_be_a Default_Engine.Match_Data
match.group 1 . should_equal "건반"
txt = "maza건반zaa"
txt.match "^a..z$" . should_equal Nothing
txt.match "^m..a..z.a$" . should_equal txt
txt.match "a..z" . should_equal "a건반z"
Test.specify "should be possible in ascii mode" <|
match = "İ".match "\w" mode=Regex_Mode.First match_ascii=True
match = "İ".match "\w" matcher=(Regex_Matcher.Regex_Matcher_Data match_ascii=True)
match.should_equal Nothing
Test.specify "should be possible in case-insensitive mode" <|
match = "MY".match "my" mode=Regex_Mode.First case_insensitive=True
match . should_be_a Default_Engine.Match_Data
match.group 0 . should_equal "MY"
match = "MY".match "my" matcher=(Regex_Matcher.Regex_Matcher_Data case_sensitivity=Case_Sensitivity.Insensitive)
match.should_equal "MY"
Test.specify "should be possible in dot_matches_newline mode" <|
match = 'Foo\n'.match "(....)" mode=Regex_Mode.First dot_matches_newline=True
match . should_be_a Default_Engine.Match_Data
match.group 0 . should_equal 'Foo\n'
match = 'Foo\n'.match "(....)" matcher=(Regex_Matcher.Regex_Matcher_Data dot_matches_newline=True)
match.should_equal 'Foo\n'
Test.specify "should be possible in multiline mode" <|
text = """
Foo
bar
match = text.match "^(...)$" multiline=True
match.length . should_equal 2
match.at 0 . group 1 . should_equal "Foo"
match.at 1 . group 1 . should_equal "bar"
match = text.match_all "^(...)$" matcher=(Regex_Matcher.Regex_Matcher_Data multiline=True)
match.should_equal ["Foo", "bar"]
Test.specify "should be possible in comments mode" <|
match = "abcde".match "(..) # Match two of any character" comments=True mode=Regex_Mode.First
match . should_be_a Default_Engine.Match_Data
match.group 0 . should_equal "ab"
match = "abcde".match "(..) # Match two of any character" matcher=(Regex_Matcher.Regex_Matcher_Data comments=True)
match.should_equal "ab"
Test.group "Regex matches" <|
Test.specify "should be possible on text" <|
"My Text: Goes Here".matches "^My Text: (.+)$" . should_be_true
Test.group "Text.is_match" <|
Test.specify "should default to regex" <|
"My Text: Goes Here".is_match "^My Text: (.+)$" . should_be_true
"555-801-1923".is_match "^\d{3}-\d{3}-\d{4}$" . should_be_true
"Hello".is_match "^[a-z]+$" . should_be_false
"Hello".is_match "^[a-z]+$" (Regex_Matcher.Regex_Matcher_Data case_sensitivity=Case_Sensitivity.Insensitive) . should_be_true
Test.specify "should only match whole input" <|
"Hello".is_match "[a-z]" . should_be_false
"x".is_match "[a-z]" . should_be_true
Test.specify "should allow Text_Matcher too" <|
"foobar".is_match "foobar" matcher=Text_Matcher.Case_Sensitive . should_be_true
"foobar".is_match "FOOBAR" matcher=Text_Matcher.Case_Sensitive . should_be_false
"foobar".is_match "foo.*" matcher=Text_Matcher.Case_Sensitive . should_be_false
"foobar".is_match "foo" matcher=Text_Matcher.Case_Sensitive . should_be_false
"foobar".is_match "foobar" matcher=Text_Matcher.Case_Insensitive . should_be_true
"foobar".is_match "FOOBAR" matcher=Text_Matcher.Case_Insensitive . should_be_true
"foobar".is_match "foo.*" matcher=Text_Matcher.Case_Insensitive . should_be_false
"foobar".is_match "foo" matcher=Text_Matcher.Case_Insensitive . should_be_false
Test.specify "should be possible on unicode text" <|
"Korean: 건반".matches "^Korean: (.+)$" . should_be_true
"Korean: 건반".is_match "^Korean: (.+)$" . should_be_true
Test.specify "should be possible in ascii mode" <|
"İ".matches "\w" match_ascii=True . should_be_false
"İ".is_match "\w" (Regex_Matcher.Regex_Matcher_Data match_ascii=True) . should_be_false
Test.specify "should be possible in case-insensitive mode" <|
"MY".matches "my" case_insensitive=True . should_be_true
"MY".is_match "my" (Regex_Matcher.Regex_Matcher_Data case_sensitivity=Case_Sensitivity.Insensitive) . should_be_true
Test.specify "should be possible in dot_matches_newline mode" <|
'Foo\n'.matches "(....)" dot_matches_newline=True . should_be_true
'Foo\n'.is_match "(....)" (Regex_Matcher.Regex_Matcher_Data dot_matches_newline=True) . should_be_true
multiline_matches_message = """
This test does not make sense once we require matches to match the
@ -1332,33 +1383,33 @@ spec =
text = """
Foo
bar
text.matches "^(...)$" multiline=True . should_be_true
text.is_match "^(...)$" (Regex_Matcher.Regex_Matcher_Data multiline=True) . should_be_true
Test.specify "should be possible in comments mode" <|
"abcde".matches "(.....) # Match any five characters" comments=True . should_be_true
"abcde".is_match "(.....) # Match any five characters" (Regex_Matcher.Regex_Matcher_Data comments=True) . should_be_true
Test.group "Regex finding" <|
Test.specify "should be possible on text" <|
match = "My Text: Goes Here".find "^My Text: (.+)$" mode=Regex_Mode.First
match = "My Text: Goes Here".match "^My Text: (.+)$" mode=Matching_Mode.First
match . should_be_a Text
match . should_equal "My Text: Goes Here"
Test.specify "should be possible on unicode text" <|
match = "Korean: 건반".find "^Korean: (.+)$" mode=Regex_Mode.First
match = "Korean: 건반".match "^Korean: (.+)$" mode=Matching_Mode.First
match . should_be_a Text
match . should_equal "Korean: 건반"
Test.specify "should be possible in ascii mode" <|
match = "İ".find "\w" mode=Regex_Mode.First match_ascii=True
match = "İ".match "\w" matcher=(Regex_Matcher.Regex_Matcher_Data match_ascii=True)
match . should_equal Nothing
Test.specify "should be possible in case-insensitive mode" <|
match = "MY".find "my" mode=Regex_Mode.First case_insensitive=True
match = "MY".match "my" matcher=(Regex_Matcher.Regex_Matcher_Data case_sensitivity=Case_Sensitivity.Insensitive)
match . should_be_a Text
match . should_equal "MY"
Test.specify "should be possible in dot_matches_newline mode" <|
match = 'Foo\n'.find "(....)" mode=Regex_Mode.First dot_matches_newline=True
match = 'Foo\n'.match "(....)" matcher=(Regex_Matcher.Regex_Matcher_Data dot_matches_newline=True)
match . should_be_a Text
match . should_equal 'Foo\n'
@ -1366,13 +1417,11 @@ spec =
text = """
Foo
bar
match = text.find "^(...)$" multiline=True
match.length . should_equal 2
match.at 0 . should_equal "Foo"
match.at 1 . should_equal "bar"
match = text.match_all "^(...)$" matcher=(Regex_Matcher.Regex_Matcher_Data multiline=True)
match . should_equal ["Foo", "bar"]
Test.specify "should be possible in comments mode" <|
match = "abcde".find "(..) # Match two of any character" comments=True mode=Regex_Mode.First
match = "abcde".match "(..) # Match two of any character" matcher=(Regex_Matcher.Regex_Matcher_Data comments=True)
match . should_be_a Text
match . should_equal "ab"