Data analysts should be able to use Text.replace to substitute parts of the text (#3393)

Implements https://www.pivotaltracker.com/story/show/181266274
This commit is contained in:
Radosław Waśko 2022-04-13 21:21:47 +02:00 committed by GitHub
parent 0ab46bc6f8
commit 0ea5dc2a6f
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
15 changed files with 448 additions and 141 deletions

View File

@ -105,6 +105,7 @@
- [Implemented `Text.reverse`][3377] - [Implemented `Text.reverse`][3377]
- [Implemented support for most Table aggregations in the Database - [Implemented support for most Table aggregations in the Database
backend.][3383] backend.][3383]
- [Update `Text.replace` to new API.][3393]
[debug-shortcuts]: [debug-shortcuts]:
https://github.com/enso-org/enso/blob/develop/app/gui/docs/product/shortcuts.md#debug https://github.com/enso-org/enso/blob/develop/app/gui/docs/product/shortcuts.md#debug
@ -160,6 +161,7 @@
[3383]: https://github.com/enso-org/enso/pull/3383 [3383]: https://github.com/enso-org/enso/pull/3383
[3385]: https://github.com/enso-org/enso/pull/3385 [3385]: https://github.com/enso-org/enso/pull/3385
[3392]: https://github.com/enso-org/enso/pull/3392 [3392]: https://github.com/enso-org/enso/pull/3392
[3393]: https://github.com/enso-org/enso/pull/3393
#### Enso Compiler #### Enso Compiler

View File

@ -424,52 +424,21 @@ Text.split separator=Split_Kind.Whitespace mode=Mode.All match_ascii=Nothing cas
pattern.split this mode=mode pattern.split this mode=mode
## ALIAS Replace Text ## ALIAS Replace Text
Replaces the first, last, or all occurrences of term with new_text in the
Replaces each occurrence of `old_sequence` with `new_sequence`, returning input. If `term` is empty, the function returns the input unchanged.
`this` unchanged if no matches are found.
Arguments: Arguments:
- old_sequence: The pattern to search for in `this`. - term: The term to find.
- new_sequence: The text to replace every occurrence of `old_sequence` with. - new_text: The new text to replace occurrences of `term` with.
- mode: This argument specifies how many matches the engine will try to If `matcher` is a `Regex_Matcher`, `new_text` can include replacement
replace. patterns (such as `$<n>`) for a marked group.
- match_ascii: Enables or disables pure-ASCII matching for the regex. If you - mode: Specifies which instances of term the engine tries to find. When the
know your data only contains ASCII then you can enable this for a mode is `First` or `Last`, this method replaces the first or last instance
performance boost on some regex engines. of term in the input. If set to `All`, it replaces all instances of term in
- case_insensitive: Enables or disables case-insensitive matching. Case the input.
insensitive matching behaves as if it normalises the case of all input - matcher: If a `Text_Matcher`, the text is compared using case-sensitivity
text before matching on it. rules specified in the matcher. If a `Regex_Matcher`, the term is used as a
- dot_matches_newline: Enables or disables the dot matches newline option. regular expression and matched using the associated options.
This specifies that the `.` special character should match everything
_including_ newline characters. Without this flag, it will match all
characters _except_ newlines.
- multiline: Enables or disables the multiline option. Multiline specifies
that the `^` and `$` pattern characters match the start and end of lines,
as well as the start and end of the input respectively.
- comments: Enables or disables the comments mode for the regular expression.
In comments mode, the following changes apply:
- Whitespace within the pattern is ignored, except when within a
character class or when preceeded by an unescaped backslash, or within
grouping constructs (e.g. `(?...)`).
- When a line contains a `#`, that is not in a character class and is not
preceeded by an unescaped backslash, all characters from the leftmost
such `#` to the end of the line are ignored. That is to say, they act
as _comments_ in the regex.
- extra_opts: Specifies additional options in a vector. This allows options
to be supplied and computed without having to break them out into arguments
to the function. Where these overlap with one of the flags (`match_ascii`,
`case_insensitive`, `dot_matches_newline`, `multiline` and `verbose`), the
flags take precedence.
! Boolean Flags and Extra Options
This function contains a number of arguments that are boolean flags that
enable or disable common options for the regex. At the same time, it also
provides the ability to specify options in the `extra_opts` argument.
Where one of the flags is _set_ (has the value `True` or `False`), the
value of the flag takes precedence over the value in `extra_opts` when
merging the options to the engine. The flags are _unset_ (have value
`Nothing`) by default.
> Example > Example
Replace letters in the text "aaa". Replace letters in the text "aaa".
@ -477,15 +446,87 @@ Text.split separator=Split_Kind.Whitespace mode=Mode.All match_ascii=Nothing cas
'aaa'.replace 'aa' 'b' == 'ba' 'aaa'.replace 'aa' 'b' == 'ba'
> Example > Example
Replace every word of two letters or less with the string "SMOL". Replace all occurrences of letters 'l' and 'o' with '#'.
example_replace = "Hello World!".replace "[lo]" "#" matcher=Regex_Matcher == "He### W#r#d!"
text = "I am a very smol word."
text.replace "\w\w(?!\w)" > Example
Text.replace : Text | Engine.Pattern -> Text -> Mode.Mode -> Boolean | Nothing -> Boolean | Nothing -> Boolean | Nothing -> Boolean | Nothing -> Boolean | Nothing -> Vector.Vector Option.Option -> Text Replace the first occurrence of letter 'l' with '#'.
Text.replace old_sequence new_sequence mode=Mode.All match_ascii=Nothing case_insensitive=Nothing dot_matches_newline=Nothing multiline=Nothing comments=Nothing extra_opts=[] =
compiled_pattern = Regex.compile old_sequence match_ascii=match_ascii case_insensitive=case_insensitive dot_matches_newline=dot_matches_newline multiline=multiline comments=comments extra_opts=extra_opts "Hello World!".replace "l" "#" mode=Matching_Mode.First == "He#lo World!"
compiled_pattern.replace this new_sequence mode
> Example
Replace texts in quotes with parentheses.
'"abc" foo "bar" baz'.replace '"(.*?)"' '($1)' matcher=Regex_Matcher == '(abc) foo (bar) baz'
! Matching Grapheme Clusters
In case-insensitive mode, a single character can match multiple characters,
for example `ß` will match `ss` and `SS`, and the ligature `ffi` will match
`ffi` or `f` etc. Thus in this mode, it is sometimes possible for a term to
match only a part of some single grapheme cluster, for example in the text
`ffia` the term `ia` will match just one-third of the first grapheme `ffi`.
Since we do not have the resolution to distinguish such partial matches, a
match which matched just a part of some grapheme cluster is extended and
treated as if it matched the whole grapheme cluster. Thus the whole
grapheme cluster may be replaced with the replacement text even if just a
part of it was matched.
> Example
Extended partial matches in case-insensitive mode.
# The ß symbol matches the letter `S` twice in case-insensitive mode, because it folds to `ss`.
'ß'.replace 'S' 'A' matcher=(Text_Matcher Case_Insensitive) . should_equal 'AA'
# The 'ffi' ligature is a single grapheme cluster, so even if just a part of it is matched, the whole grapheme is replaced.
'affib'.replace 'i' 'X' matcher=(Text_Matcher Case_Insensitive) . should_equal 'aXb'
! Last Match in Regex Mode
Regex always performs the search from the front and matching the last
occurrence means selecting the last of the matches while still generating
matches from the beginning. This will lead to slightly different behavior
for overlapping occurrences of a pattern in Regex mode than in exact text
matching mode where the matches are searched for from the back.
> Example
Comparing Matching in Last Mode in Regex and Text mode
"aaa".replace "aa" "c" mode=Matching_Mode.Last matcher=Text_Matcher . should_equal "ac"
"aaa".replace "aa" "c" mode=Matching_Mode.Last matcher=Regex_Matcher . should_equal "ca"
"aaa aaa".replace "aa" "c" matcher=Text_Matcher . should_equal "ca ca"
"aaa aaa".replace "aa" "c" mode=Matching_Mode.First matcher=Text_Matcher . should_equal "ca aaa"
"aaa aaa".replace "aa" "c" mode=Matching_Mode.Last matcher=Text_Matcher . should_equal "aaa ac"
"aaa aaa".replace "aa" "c" matcher=Regex_Matcher . should_equal "ca ca"
"aaa aaa".replace "aa" "c" mode=Matching_Mode.First matcher=Regex_Matcher . should_equal "ca aaa"
"aaa aaa".replace "aa" "c" mode=Matching_Mode.Last matcher=Regex_Matcher . should_equal "aaa ca"
Text.replace : Text -> Text -> (Matching_Mode.First | Matching_Mode.Last | Mode.All) -> (Text_Matcher | Regex_Matcher) -> Text
Text.replace term="" new_text="" mode=Mode.All matcher=Text_Matcher = if term.is_empty then this else
case matcher of
Text_Matcher case_sensitivity ->
array_from_single_result result = case result of
Nothing -> Array.empty
_ -> Array.new_1 result
spans_array = case case_sensitivity of
True -> case mode of
Mode.All ->
Text_Utils.span_of_all this term
Matching_Mode.First ->
array_from_single_result <| Text_Utils.span_of this term
Matching_Mode.Last ->
array_from_single_result <| Text_Utils.last_span_of this term
Case_Insensitive locale -> case mode of
Mode.All ->
Text_Utils.span_of_all_case_insensitive this term locale.java_locale
Matching_Mode.First ->
array_from_single_result <|
Text_Utils.span_of_case_insensitive this term locale.java_locale False
Matching_Mode.Last ->
array_from_single_result <|
Text_Utils.span_of_case_insensitive this term locale.java_locale True
Text_Utils.replace_spans this spans_array new_text
Regex_Matcher _ _ _ _ _ ->
compiled_pattern = matcher.compile term
compiled_pattern.replace this new_text mode=mode
## ALIAS Get Words ## ALIAS Get Words
@ -1223,16 +1264,16 @@ Text.trim where=Location.Both what=_.is_whitespace =
which contains both the start and end indices, allowing to determine the which contains both the start and end indices, allowing to determine the
length of the match. This is useful not only with regex matches (where a length of the match. This is useful not only with regex matches (where a
regular expression can have matches of various lengths) but also for case regular expression can have matches of various lengths) but also for case
insensitive matching. In case insensitive mode, a single character can insensitive matching. In case-insensitive mode, a single character can
match multiple characters, for example `ß` will match `ss` and `SS`, and match multiple characters, for example `ß` will match `ss` and `SS`, and
the ligature `ffi` will match `ffi` or `f` etc. Thus in case insensitive the ligature `ffi` will match `ffi` or `f` etc. Thus in case-insensitive
mode, the length of the match can be shorter or longer than the term that mode, the length of the match can be shorter or longer than the term that
was being matched, so it is extremely important to not rely on the length was being matched, so it is extremely important to not rely on the length
of the matched term when analysing the matches as they may have different of the matched term when analysing the matches as they may have different
lengths. lengths.
> Example > Example
Match length differences in case insensitive matching. Match length differences in case-insensitive matching.
term = "straße" term = "straße"
text = "MONUMENTENSTRASSE 42" text = "MONUMENTENSTRASSE 42"
@ -1241,7 +1282,7 @@ Text.trim where=Location.Both what=_.is_whitespace =
match.length == 7 match.length == 7
! Matching Grapheme Clusters ! Matching Grapheme Clusters
In case insensitive mode, a single character can match multiple characters, In case-insensitive mode, a single character can match multiple characters,
for example `ß` will match `ss` and `SS`, and the ligature `ffi` will match for example `ß` will match `ss` and `SS`, and the ligature `ffi` will match
`ffi` or `f` etc. Thus in this mode, it is sometimes possible for a term to `ffi` or `f` etc. Thus in this mode, it is sometimes possible for a term to
match only a part of some single grapheme cluster, for example in the text match only a part of some single grapheme cluster, for example in the text
@ -1266,6 +1307,22 @@ Text.trim where=Location.Both what=_.is_whitespace =
match_2.length == 2 match_2.length == 2
# After being extended to full grapheme clusters, both terms "IFF" and "ffiffl" match the same span of grapheme clusters. # After being extended to full grapheme clusters, both terms "IFF" and "ffiffl" match the same span of grapheme clusters.
match_1 == match_2 match_1 == match_2
! Last Match in Regex Mode
Regex always performs the search from the front and matching the last
occurrence means selecting the last of the matches while still generating
matches from the beginning. This will lead to slightly different behavior
for overlapping occurrences of a pattern in Regex mode than in exact text
matching mode where the matches are searched for from the back.
> Example
Comparing Matching in Last Mode in Regex and Text mode
"aaa".location_of "aa" mode=Matching_Mode.Last matcher=Text_Matcher == Span (Range 1 3) "aaa"
"aaa".location_of "aa" mode=Matching_Mode.Last matcher=Regex_Matcher == Span (Range 0 2) "aaa"
"aaa aaa".location_of "aa" mode=Matching_Mode.Last matcher=Text_Matcher == Span (Range 5 7) "aaa aaa"
"aaa aaa".location_of "aa" mode=Matching_Mode.Last matcher=Regex_Matcher == Span (Range 4 6) "aaa aaa"
Text.location_of : Text -> (Matching_Mode.First | Matching_Mode.Last) -> Matcher -> Span | Nothing Text.location_of : Text -> (Matching_Mode.First | Matching_Mode.Last) -> Matcher -> Span | Nothing
Text.location_of term="" mode=Matching_Mode.First matcher=Text_Matcher.new = case matcher of Text.location_of term="" mode=Matching_Mode.First matcher=Text_Matcher.new = case matcher of
Text_Matcher case_sensitive -> case case_sensitive of Text_Matcher case_sensitive -> case case_sensitive of
@ -1274,7 +1331,7 @@ Text.location_of term="" mode=Matching_Mode.First matcher=Text_Matcher.new = cas
Matching_Mode.First -> Text_Utils.span_of this term Matching_Mode.First -> Text_Utils.span_of this term
Matching_Mode.Last -> Text_Utils.last_span_of this term Matching_Mode.Last -> Text_Utils.last_span_of this term
if codepoint_span.is_nothing then Nothing else if codepoint_span.is_nothing then Nothing else
start = Text_Utils.utf16_index_to_grapheme_index this codepoint_span.start start = Text_Utils.utf16_index_to_grapheme_index this codepoint_span.codeunit_start
## While the codepoint_span may have different code unit length ## While the codepoint_span may have different code unit length
from our term, the `length` counted in grapheme clusters is from our term, the `length` counted in grapheme clusters is
guaranteed to be the same. guaranteed to be the same.
@ -1293,7 +1350,7 @@ Text.location_of term="" mode=Matching_Mode.First matcher=Text_Matcher.new = cas
case Text_Utils.span_of_case_insensitive this term locale.java_locale search_for_last of case Text_Utils.span_of_case_insensitive this term locale.java_locale search_for_last of
Nothing -> Nothing Nothing -> Nothing
grapheme_span -> grapheme_span ->
Span (Range grapheme_span.start grapheme_span.end) this Span (Range grapheme_span.grapheme_start grapheme_span.grapheme_end) this
Regex_Matcher _ _ _ _ _ -> case mode of Regex_Matcher _ _ _ _ _ -> case mode of
Matching_Mode.First -> Matching_Mode.First ->
case matcher.compile term . match this Mode.First of case matcher.compile term . match this Mode.First of
@ -1332,16 +1389,16 @@ Text.location_of term="" mode=Matching_Mode.First matcher=Text_Matcher.new = cas
which contains both the start and end indices, allowing to determine the which contains both the start and end indices, allowing to determine the
length of the match. This is useful not only with regex matches (where a length of the match. This is useful not only with regex matches (where a
regular expression can have matches of various lengths) but also for case regular expression can have matches of various lengths) but also for case
insensitive matching. In case insensitive mode, a single character can insensitive matching. In case-insensitive mode, a single character can
match multiple characters, for example `ß` will match `ss` and `SS`, and match multiple characters, for example `ß` will match `ss` and `SS`, and
the ligature `ffi` will match `ffi` or `f` etc. Thus in case insensitive the ligature `ffi` will match `ffi` or `f` etc. Thus in case-insensitive
mode, the length of the match can be shorter or longer than the term that mode, the length of the match can be shorter or longer than the term that
was being matched, so it is extremely important to not rely on the length was being matched, so it is extremely important to not rely on the length
of the matched term when analysing the matches as they may have different of the matched term when analysing the matches as they may have different
lengths. lengths.
> Example > Example
Match length differences in case insensitive matching. Match length differences in case-insensitive matching.
term = "strasse" term = "strasse"
text = "MONUMENTENSTRASSE ist eine große Straße." text = "MONUMENTENSTRASSE ist eine große Straße."
@ -1350,7 +1407,7 @@ Text.location_of term="" mode=Matching_Mode.First matcher=Text_Matcher.new = cas
match . map .length == [7, 6] match . map .length == [7, 6]
! Matching Grapheme Clusters ! Matching Grapheme Clusters
In case insensitive mode, a single character can match multiple characters, In case-insensitive mode, a single character can match multiple characters,
for example `ß` will match `ss` and `SS`, and the ligature `ffi` will match for example `ß` will match `ss` and `SS`, and the ligature `ffi` will match
`ffi` or `f` etc. Thus in this mode, it is sometimes possible for a term to `ffi` or `f` etc. Thus in this mode, it is sometimes possible for a term to
match only a part of some single grapheme cluster, for example in the text match only a part of some single grapheme cluster, for example in the text
@ -1374,7 +1431,7 @@ Text.location_of_all term="" matcher=Text_Matcher.new = case matcher of
Text_Matcher case_sensitive -> if term.is_empty then Vector.new (this.length + 1) (ix -> Span (Range ix ix) this) else case case_sensitive of Text_Matcher case_sensitive -> if term.is_empty then Vector.new (this.length + 1) (ix -> Span (Range ix ix) this) else case case_sensitive of
True -> True ->
codepoint_spans = Vector.Vector <| Text_Utils.span_of_all this term codepoint_spans = Vector.Vector <| Text_Utils.span_of_all this term
grahpeme_ixes = Vector.Vector <| Text_Utils.utf16_indices_to_grapheme_indices this (codepoint_spans.map .start).to_array grahpeme_ixes = Vector.Vector <| Text_Utils.utf16_indices_to_grapheme_indices this (codepoint_spans.map .codeunit_start).to_array
## While the codepoint_spans may have different code unit lengths ## While the codepoint_spans may have different code unit lengths
from our term, the `length` counted in grapheme clusters is from our term, the `length` counted in grapheme clusters is
guaranteed to be the same. guaranteed to be the same.
@ -1385,7 +1442,7 @@ Text.location_of_all term="" matcher=Text_Matcher.new = case matcher of
Case_Insensitive locale -> Case_Insensitive locale ->
grapheme_spans = Vector.Vector <| Text_Utils.span_of_all_case_insensitive this term locale.java_locale grapheme_spans = Vector.Vector <| Text_Utils.span_of_all_case_insensitive this term locale.java_locale
grapheme_spans.map grapheme_span-> grapheme_spans.map grapheme_span->
Span (Range grapheme_span.start grapheme_span.end) this Span (Range grapheme_span.grapheme_start grapheme_span.grapheme_end) this
Regex_Matcher _ _ _ _ _ -> Regex_Matcher _ _ _ _ _ ->
case matcher.compile term . match this Mode.All of case matcher.compile term . match this Mode.All of
Nothing -> [] Nothing -> []

View File

@ -39,6 +39,7 @@ import Standard.Base.Data.Text.Regex
import Standard.Base.Data.Text.Regex.Engine import Standard.Base.Data.Text.Regex.Engine
import Standard.Base.Data.Text.Regex.Option as Global_Option import Standard.Base.Data.Text.Regex.Option as Global_Option
import Standard.Base.Data.Text.Regex.Mode import Standard.Base.Data.Text.Regex.Mode
import Standard.Base.Data.Text.Matching_Mode
import Standard.Base.Polyglot.Java as Java_Ext import Standard.Base.Polyglot.Java as Java_Ext
from Standard.Base.Data.Text.Span as Span_Module import Utf_16_Span from Standard.Base.Data.Text.Span as Span_Module import Utf_16_Span
@ -533,7 +534,7 @@ type Pattern
pattern = engine.compile "aa [] pattern = engine.compile "aa []
input = "aabbaabbbbbaab" input = "aabbaabbbbbaab"
pattern.replace input "REPLACED" pattern.replace input "REPLACED"
replace : Text -> Text -> (Mode.First | Integer | Mode.All | Mode.Full) -> Text replace : Text -> Text -> (Mode.First | Integer | Mode.All | Mode.Full | Matching_Mode.Last) -> Text
replace input replacement mode=Mode.All = replace input replacement mode=Mode.All =
do_replace_mode mode start end = case mode of do_replace_mode mode start end = case mode of
Mode.First -> Mode.First ->
@ -559,8 +560,26 @@ type Pattern
internal_matcher.replaceAll replacement internal_matcher.replaceAll replacement
Mode.Full -> Mode.Full ->
case this.match input mode=Mode.Full of case this.match input mode=Mode.Full of
Match _ _ _ _ -> replacement Match _ _ _ _ -> this.replace input replacement Mode.First
Nothing -> input Nothing -> input
Matching_Mode.Last ->
all_matches = this.match input
all_matches_count = if all_matches.is_nothing then 0 else all_matches.length
if all_matches_count == 0 then input else
internal_matcher = this.build_matcher input start end
buffer = StringBuffer.new
last_match_index = all_matches_count - 1
go match_index =
internal_matcher.find
case match_index == last_match_index of
True -> internal_matcher.appendReplacement buffer replacement
False -> @Tail_Call go (match_index + 1)
go 0
internal_matcher.appendTail buffer
buffer.to_text
Mode.Bounded _ _ _ -> Panic.throw <| Mode.Bounded _ _ _ -> Panic.throw <|
Mode_Error "Modes cannot be recursive." Mode_Error "Modes cannot be recursive."

View File

@ -81,22 +81,22 @@ type Text_Sub_Range
if delimiter.is_empty then (Range 0 0) else if delimiter.is_empty then (Range 0 0) else
span = Text_Utils.span_of text delimiter span = Text_Utils.span_of text delimiter
if span.is_nothing then (Range 0 (Text_Utils.char_length text)) else if span.is_nothing then (Range 0 (Text_Utils.char_length text)) else
(Range 0 span.start) (Range 0 span.codeunit_start)
Before_Last delimiter -> Before_Last delimiter ->
if delimiter.is_empty then (Range 0 (Text_Utils.char_length text)) else if delimiter.is_empty then (Range 0 (Text_Utils.char_length text)) else
span = Text_Utils.last_span_of text delimiter span = Text_Utils.last_span_of text delimiter
if span.is_nothing then (Range 0 (Text_Utils.char_length text)) else if span.is_nothing then (Range 0 (Text_Utils.char_length text)) else
(Range 0 span.start) (Range 0 span.codeunit_start)
After delimiter -> After delimiter ->
if delimiter.is_empty then (Range 0 (Text_Utils.char_length text)) else if delimiter.is_empty then (Range 0 (Text_Utils.char_length text)) else
span = Text_Utils.span_of text delimiter span = Text_Utils.span_of text delimiter
if span.is_nothing then (Range 0 0) else if span.is_nothing then (Range 0 0) else
(Range span.end (Text_Utils.char_length text)) (Range span.codeunit_end (Text_Utils.char_length text))
After_Last delimiter -> After_Last delimiter ->
if delimiter.is_empty then (Range 0 0) else if delimiter.is_empty then (Range 0 0) else
span = Text_Utils.last_span_of text delimiter span = Text_Utils.last_span_of text delimiter
if span.is_nothing then (Range 0 0) else if span.is_nothing then (Range 0 0) else
(Range span.end (Text_Utils.char_length text)) (Range span.codeunit_end (Text_Utils.char_length text))
While predicate -> While predicate ->
indices = find_sub_range_end text _-> start-> end-> indices = find_sub_range_end text _-> start-> end->
predicate (Text_Utils.substring text start end) . not predicate (Text_Utils.substring text start end) . not

View File

@ -12,6 +12,7 @@ import java.util.List;
import java.util.Locale; import java.util.Locale;
import java.util.regex.Pattern; import java.util.regex.Pattern;
import org.enso.base.text.CaseFoldedString; import org.enso.base.text.CaseFoldedString;
import org.enso.base.text.CaseFoldedString.Grapheme;
import org.enso.base.text.GraphemeSpan; import org.enso.base.text.GraphemeSpan;
import org.enso.base.text.Utf16Span; import org.enso.base.text.Utf16Span;
@ -231,19 +232,6 @@ public class Text_Utils {
return CaseFoldedString.simpleFold(string, locale); return CaseFoldedString.simpleFold(string, locale);
} }
/**
* Replaces all occurrences of {@code oldSequence} within {@code str} with {@code newSequence}.
*
* @param str the string to process
* @param oldSequence the substring that is searched for and will be replaced
* @param newSequence the string that will replace occurrences of {@code oldSequence}
* @return {@code str} with all occurrences of {@code oldSequence} replaced with {@code
* newSequence}
*/
public static String replace(String str, String oldSequence, String newSequence) {
return str.replace(oldSequence, newSequence);
}
/** /**
* Gets the length of char array of a string * Gets the length of char array of a string
* *
@ -306,7 +294,7 @@ public class Text_Utils {
StringSearch search = new StringSearch(needle, haystack); StringSearch search = new StringSearch(needle, haystack);
ArrayList<Utf16Span> occurrences = new ArrayList<>(); ArrayList<Utf16Span> occurrences = new ArrayList<>();
long ix; int ix;
while ((ix = search.next()) != StringSearch.DONE) { while ((ix = search.next()) != StringSearch.DONE) {
occurrences.add(new Utf16Span(ix, ix + search.getMatchLength())); occurrences.add(new Utf16Span(ix, ix + search.getMatchLength()));
} }
@ -456,13 +444,21 @@ public class Text_Utils {
* @return a minimal {@code GraphemeSpan} which contains all code units from the match * @return a minimal {@code GraphemeSpan} which contains all code units from the match
*/ */
private static GraphemeSpan findExtendedSpan(CaseFoldedString string, int position, int length) { private static GraphemeSpan findExtendedSpan(CaseFoldedString string, int position, int length) {
int firstGrapheme = string.codeUnitToGraphemeIndex(position); Grapheme firstGrapheme = string.findGrapheme(position);
if (length == 0) { if (length == 0) {
return new GraphemeSpan(firstGrapheme, firstGrapheme); return new GraphemeSpan(
firstGrapheme.index,
firstGrapheme.index,
firstGrapheme.codeunit_start,
firstGrapheme.codeunit_start);
} else { } else {
int lastGrapheme = string.codeUnitToGraphemeIndex(position + length - 1); Grapheme lastGrapheme = string.findGrapheme(position + length - 1);
int endGrapheme = lastGrapheme + 1; int endGraphemeIndex = lastGrapheme.index + 1;
return new GraphemeSpan(firstGrapheme, endGrapheme); return new GraphemeSpan(
firstGrapheme.index,
endGraphemeIndex,
firstGrapheme.codeunit_start,
lastGrapheme.codeunit_end);
} }
} }
@ -485,4 +481,30 @@ public class Text_Utils {
public static boolean is_all_whitespace(String text) { public static boolean is_all_whitespace(String text) {
return text.codePoints().allMatch(UCharacter::isUWhiteSpace); return text.codePoints().allMatch(UCharacter::isUWhiteSpace);
} }
/**
* Replaces all provided spans within the text with {@code newSequence}.
*
* @param str the string to process
* @param spans the spans to replace; the spans should be sorted by their starting point in the
* non-decreasing order; the behaviour is undefined if these requirements are not satisfied.
* @param newSequence the string that will replace the spans
* @return {@code str} with all provided spans replaced with {@code newSequence}
*/
public static String replace_spans(String str, List<Utf16Span> spans, String newSequence) {
StringBuilder sb = new StringBuilder();
int current_ix = 0;
for (Utf16Span span : spans) {
if (span.codeunit_start > current_ix) {
sb.append(str, current_ix, span.codeunit_start);
}
sb.append(newSequence);
current_ix = span.codeunit_end;
}
// Add the remaining part of the string (if any).
sb.append(str, current_ix, str.length());
return sb.toString();
}
} }

View File

@ -13,6 +13,20 @@ import java.util.Locale;
* indices back in the original string. * indices back in the original string.
*/ */
public class CaseFoldedString { public class CaseFoldedString {
public static class Grapheme {
/** The grapheme index of the given grapheme in the string. */
public final int index;
/** The codeunit indices of start and end of the given grapheme in the original string. */
public final int codeunit_start, codeunit_end;
public Grapheme(int index, int codeunit_start, int codeunit_end) {
this.index = index;
this.codeunit_start = codeunit_start;
this.codeunit_end = codeunit_end;
}
}
private final String foldedString; private final String foldedString;
/** /**
@ -24,33 +38,67 @@ public class CaseFoldedString {
*/ */
private final int[] graphemeIndexMapping; private final int[] graphemeIndexMapping;
/**
* A mapping from code units in the transformed string to the first code-unit of the corresponding
* grapheme in the original string.
*
* <p>The mapping must be valid from indices from 0 to @{code foldedString.length()+1}
* (inclusive).
*/
private final int[] codeunitStartIndexMapping;
/**
* A mapping from code units in the transformed string to the end code-unit of the corresponding
* grapheme in the original string.
*
* <p>The mapping must be valid from indices from 0 to @{code foldedString.length()+1}
* (inclusive).
*/
private final int[] codeunitEndIndexMapping;
/** /**
* Constructs a new instance of the folded string. * Constructs a new instance of the folded string.
* *
* @param foldeString the string after applying the case folding transformation * @param foldeString the string after applying the case folding transformation
* @param graphemeIndexMapping a mapping created during the transformation which maps code units * @param graphemeIndexMapping a mapping created during the transformation which maps code units
* in the transformed string to their corresponding graphemes in the original string * in the transformed string to their corresponding graphemes in the original string
* @param codeunitStartIndexMapping a mapping created during the transformation which maps code
* units in the transformed string to first codeunits of corresponding graphemes in the
* original string
* @param codeunitStartIndexMapping a mapping created during the transformation which maps code
* units in the transformed string to end codeunits of corresponding graphemes in the original
* string
*/ */
private CaseFoldedString(String foldeString, int[] graphemeIndexMapping) { private CaseFoldedString(
String foldeString,
int[] graphemeIndexMapping,
int[] codeunitStartIndexMapping,
int[] codeunitEndIndexMapping) {
this.foldedString = foldeString; this.foldedString = foldeString;
this.graphemeIndexMapping = graphemeIndexMapping; this.graphemeIndexMapping = graphemeIndexMapping;
this.codeunitStartIndexMapping = codeunitStartIndexMapping;
this.codeunitEndIndexMapping = codeunitEndIndexMapping;
} }
/** /**
* Maps a code unit in the folded string to the corresponding grapheme in the original string. * Finds the grapheme corresponding to a code unit in the folded string.
* *
* @param codeunitIndex the index of the code unit in the folded string, valid indices range from * @param codeunitIndex the index of the code unit in the folded string, valid indices range from
* 0 to {@code getFoldedString().length()+1} (inclusive), allowing to also ask for the * 0 to {@code getFoldedString().length()+1} (inclusive), allowing to also ask for the
* position of the end code unit which is located right after the end of the string - which * position of the end code unit which is located right after the end of the string - which
* should always map to the analogous end grapheme. * should always map to the analogous end grapheme.
* @return the index of the grapheme from the original string that after applying the * @return the index of the first code unit of the grapheme from the original string that after
* transformation contains the requested code unit * applying the transformation contains the requested code unit
*/ */
public int codeUnitToGraphemeIndex(int codeunitIndex) { public Grapheme findGrapheme(int codeunitIndex) {
if (codeunitIndex < 0 || codeunitIndex > this.foldedString.length()) { if (codeunitIndex < 0 || codeunitIndex > this.foldedString.length()) {
throw new IndexOutOfBoundsException(codeunitIndex); throw new IndexOutOfBoundsException(codeunitIndex);
} }
return graphemeIndexMapping[codeunitIndex];
return new Grapheme(
graphemeIndexMapping[codeunitIndex],
codeunitStartIndexMapping[codeunitIndex],
codeunitEndIndexMapping[codeunitIndex]);
} }
/** Returns the transformed string. */ /** Returns the transformed string. */
@ -74,7 +122,9 @@ public class CaseFoldedString {
breakIterator.setText(charSequence); breakIterator.setText(charSequence);
StringBuilder stringBuilder = new StringBuilder(charSequence.length()); StringBuilder stringBuilder = new StringBuilder(charSequence.length());
Fold foldAlgorithm = caseFoldAlgorithmForLocale(locale); Fold foldAlgorithm = caseFoldAlgorithmForLocale(locale);
IntArrayBuilder index_mapping = new IntArrayBuilder(charSequence.length() + 1); IntArrayBuilder grapheme_mapping = new IntArrayBuilder(charSequence.length() + 1);
IntArrayBuilder codeunit_start_mapping = new IntArrayBuilder(charSequence.length() + 1);
IntArrayBuilder codeunit_end_mapping = new IntArrayBuilder(charSequence.length() + 1);
// We rely on the fact that ICU Case Folding is _not_ context-sensitive, i.e. the mapping of // We rely on the fact that ICU Case Folding is _not_ context-sensitive, i.e. the mapping of
// each grapheme cluster is independent of surrounding ones. Regular casing is // each grapheme cluster is independent of surrounding ones. Regular casing is
@ -87,7 +137,9 @@ public class CaseFoldedString {
String foldedGrapheme = foldAlgorithm.apply(grapheme); String foldedGrapheme = foldAlgorithm.apply(grapheme);
stringBuilder.append(foldedGrapheme); stringBuilder.append(foldedGrapheme);
for (int i = 0; i < foldedGrapheme.length(); ++i) { for (int i = 0; i < foldedGrapheme.length(); ++i) {
index_mapping.add(grapheme_index); grapheme_mapping.add(grapheme_index);
codeunit_start_mapping.add(current);
codeunit_end_mapping.add(next);
} }
grapheme_index++; grapheme_index++;
@ -96,10 +148,13 @@ public class CaseFoldedString {
// The mapping should also be able to handle a {@code str.length()} query, so we add one more // The mapping should also be able to handle a {@code str.length()} query, so we add one more
// element to the mapping pointing to a non-existent grapheme after the end of the text. // element to the mapping pointing to a non-existent grapheme after the end of the text.
index_mapping.add(grapheme_index); grapheme_mapping.add(grapheme_index);
return new CaseFoldedString( return new CaseFoldedString(
stringBuilder.toString(), index_mapping.unsafeGetStorageAndInvalidateTheBuilder()); stringBuilder.toString(),
grapheme_mapping.unsafeGetStorageAndInvalidateTheBuilder(),
codeunit_start_mapping.unsafeGetStorageAndInvalidateTheBuilder(),
codeunit_end_mapping.unsafeGetStorageAndInvalidateTheBuilder());
} }
/** /**

View File

@ -9,20 +9,21 @@ package org.enso.base.text;
* <p>Represents an empty span if start and end indices are equal. Such an empty span refers to the * <p>Represents an empty span if start and end indices are equal. Such an empty span refers to the
* space just before the grapheme corresponding to index start. * space just before the grapheme corresponding to index start.
*/ */
public class GraphemeSpan { public class GraphemeSpan extends Utf16Span {
public final long start, end; public final int grapheme_start, grapheme_end;
/** /**
* Constructs a span of characters (understood as extended grapheme clusters). * Constructs a span of characters (understood as extended grapheme clusters).
* * @param grapheme_start index of the first extended grapheme cluster contained within the span (or
* @param start index of the first extended grapheme cluster contained within the span (or
* location of the span if it is empty) * location of the span if it is empty)
* @param end index of the first extended grapheme cluster after start that is not contained * @param grapheme_end index of the first extended grapheme cluster after start that is not contained
* within the span * @param codeunit_start code unit index of {@code grapheme_start}
* @param codeunit_end code unit index of {@code grapheme_end}
*/ */
public GraphemeSpan(long start, long end) { public GraphemeSpan(int grapheme_start, int grapheme_end, int codeunit_start, int codeunit_end) {
this.start = start; super(codeunit_start, codeunit_end);
this.end = end; this.grapheme_start = grapheme_start;
this.grapheme_end = grapheme_end;
} }
} }

View File

@ -8,11 +8,11 @@ package org.enso.base.text;
*/ */
public class Utf16Span { public class Utf16Span {
public final long start, end; public final int codeunit_start, codeunit_end;
/** Constructs a span of UTF-16 code units. */ /** Constructs a span of UTF-16 code units. */
public Utf16Span(long start, long end) { public Utf16Span(int codeunit_start, int codeunit_end) {
this.start = start; this.codeunit_start = codeunit_start;
this.end = end; this.codeunit_end = codeunit_end;
} }
} }

View File

@ -376,7 +376,7 @@ spec prefix table_builder supports_case_sensitive_columns pending=Nothing =
expect_column_names ["bar", "foo_001", "foo_1", "Foo_2", "foo_3", "foo_21", "foo_100"] <| table.sort_columns (Sort_Method natural_order=True case_sensitive=Case_Insensitive.new) expect_column_names ["bar", "foo_001", "foo_1", "Foo_2", "foo_3", "foo_21", "foo_100"] <| table.sort_columns (Sort_Method natural_order=True case_sensitive=Case_Insensitive.new)
expect_column_names ["foo_3", "foo_21", "foo_100", "foo_1", "foo_001", "bar", "Foo_2"] <| table.sort_columns (Sort_Method order=Sort_Order.Descending) expect_column_names ["foo_3", "foo_21", "foo_100", "foo_1", "foo_001", "bar", "Foo_2"] <| table.sort_columns (Sort_Method order=Sort_Order.Descending)
Test.specify "should correctly handle case insensitive sorting" <| Test.specify "should correctly handle case-insensitive sorting" <|
expect_column_names ["bar", "foo_001", "foo_1", "foo_100", "Foo_2", "foo_21", "foo_3"] <| table.sort_columns (Sort_Method case_sensitive=Case_Insensitive.new) expect_column_names ["bar", "foo_001", "foo_1", "foo_100", "Foo_2", "foo_21", "foo_3"] <| table.sort_columns (Sort_Method case_sensitive=Case_Insensitive.new)
Test.specify "should correctly handle natural order sorting" <| Test.specify "should correctly handle natural order sorting" <|
@ -412,7 +412,7 @@ spec prefix table_builder supports_case_sensitive_columns pending=Nothing =
expect_column_names ["FirstColumn", "beta", "gamma", "Another"] <| expect_column_names ["FirstColumn", "beta", "gamma", "Another"] <|
table.rename_columns (Column_Mapping.By_Name map (Text_Matcher True)) table.rename_columns (Column_Mapping.By_Name map (Text_Matcher True))
Test.specify "should work by name case insensitively" <| Test.specify "should work by name case-insensitively" <|
map = Map.from_vector [["ALPHA", "FirstColumn"], ["DELTA", "Another"]] map = Map.from_vector [["ALPHA", "FirstColumn"], ["DELTA", "Another"]]
expect_column_names ["FirstColumn", "beta", "gamma", "Another"] <| expect_column_names ["FirstColumn", "beta", "gamma", "Another"] <|
table.rename_columns (Column_Mapping.By_Name map (Text_Matcher Case_Insensitive.new)) table.rename_columns (Column_Mapping.By_Name map (Text_Matcher Case_Insensitive.new))

View File

@ -5,6 +5,7 @@ import Standard.Test
import Standard.Base.Data.Text.Regex import Standard.Base.Data.Text.Regex
import Standard.Base.Data.Text.Regex.Engine.Default as Default_Engine import Standard.Base.Data.Text.Regex.Engine.Default as Default_Engine
import Standard.Base.Data.Text.Regex.Mode import Standard.Base.Data.Text.Regex.Mode
import Standard.Base.Data.Text.Matching_Mode
import Standard.Base.Data.Text.Regex.Option as Global_Option import Standard.Base.Data.Text.Regex.Option as Global_Option
from Standard.Base.Data.Text.Span as Span_Module import Utf_16_Span from Standard.Base.Data.Text.Span as Span_Module import Utf_16_Span
@ -399,6 +400,11 @@ spec =
match = pattern.replace input "REPLACED" mode=Mode.Full match = pattern.replace input "REPLACED" mode=Mode.Full
match . should_equal "REPLACED" match . should_equal "REPLACED"
Test.specify "should correctly replace entire input in Full mode even if partial matches are possible" <|
pattern = engine.compile "(aa)+" []
pattern.replace "aaa" "REPLACED" mode=Mode.Full . should_equal "aaa"
pattern.replace "aaaa" "REPLACED" mode=Mode.Full . should_equal "REPLACED"
Test.specify "should return the input for a full replace if the pattern doesn't match the entire input" <| Test.specify "should return the input for a full replace if the pattern doesn't match the entire input" <|
pattern = engine.compile "(..)" [] pattern = engine.compile "(..)" []
input = "aa ab" input = "aa ab"
@ -417,6 +423,35 @@ spec =
match = pattern.replace input "REPLACED" mode=Mode.All match = pattern.replace input "REPLACED" mode=Mode.All
match . should_equal "REPLACEDREPLACEDb" match . should_equal "REPLACEDREPLACEDb"
Test.specify "should handle capture groups in replacement" <|
pattern = engine.compile "(?<capture>[a-z]+)" []
pattern.replace "foo bar, baz" "[$1]" mode=Mode.All . should_equal "[foo] [bar], [baz]"
pattern.replace "foo bar, baz" "[$1]" mode=0 . should_equal "foo bar, baz"
pattern.replace "foo bar, baz" "[$1]" mode=1 . should_equal "[foo] bar, baz"
pattern.replace "foo bar, baz" "[$1]" mode=2 . should_equal "[foo] [bar], baz"
pattern.replace "foo bar, baz" "[$1]" mode=3 . should_equal "[foo] [bar], [baz]"
pattern.replace "foo bar, baz" "[$1]" mode=4 . should_equal "[foo] [bar], [baz]"
pattern.replace "foo bar, baz" "[$1]" mode=Mode.First . should_equal "[foo] bar, baz"
pattern.replace "foo bar, baz" "[$1]" mode=Matching_Mode.Last . should_equal "foo bar, [baz]"
pattern.replace "foo bar, baz" "[${capture}]" mode=Mode.All . should_equal "[foo] [bar], [baz]"
pattern.replace "foo bar, baz" "[${capture}]" mode=0 . should_equal "foo bar, baz"
pattern.replace "foo bar, baz" "[${capture}]" mode=1 . should_equal "[foo] bar, baz"
pattern.replace "foo bar, baz" "[${capture}]" mode=2 . should_equal "[foo] [bar], baz"
pattern.replace "foo bar, baz" "[${capture}]" mode=3 . should_equal "[foo] [bar], [baz]"
pattern.replace "foo bar, baz" "[${capture}]" mode=4 . should_equal "[foo] [bar], [baz]"
pattern.replace "foo bar, baz" "[${capture}]" mode=Mode.First . should_equal "[foo] bar, baz"
pattern.replace "foo bar, baz" "[${capture}]" mode=Matching_Mode.Last . should_equal "foo bar, [baz]"
Test.specify "should handle capture groups in replacement in All mode" <|
pattern = engine.compile "([a-z]+)" []
pattern.replace "foo bar, baz" "[$1]" mode=Mode.Full . should_equal "foo bar, baz"
pattern.replace "foo" "[$1]" mode=Mode.Full . should_equal "[foo]"
pattern_2 = engine.compile '<a href="(?<addr>.*?)">(?<name>.*?)</a>' []
pattern_2.replace '<a href="url">content</a>' "$2 <- $1" mode=Mode.Full . should_equal "content <- url"
pattern_2.replace '<a href="url">content</a>' "${name} <- ${addr}" mode=Mode.Full . should_equal "content <- url"
Test.group "Match.group" <| Test.group "Match.group" <|
engine = Default_Engine.new engine = Default_Engine.new
pattern = engine.compile "(.. .. )(?<letters>.+)()??(?<empty>)??" [] pattern = engine.compile "(.. .. )(?<letters>.+)()??(?<empty>)??" []

View File

@ -52,10 +52,10 @@ spec =
codeunits = Vector.new folded.getFoldedString.utf_16.length+1 ix->ix codeunits = Vector.new folded.getFoldedString.utf_16.length+1 ix->ix
grapheme_ixes = codeunits.map ix-> grapheme_ixes = codeunits.map ix->
folded.codeUnitToGraphemeIndex ix folded.findGrapheme ix . index
grapheme_ixes . should_equal [0, 0, 1, 2, 3, 3, 4, 4, 4, 5, 6] grapheme_ixes . should_equal [0, 0, 1, 2, 3, 3, 4, 4, 4, 5, 6]
Test.expect_panic_with (folded.codeUnitToGraphemeIndex -1) Polyglot_Error Test.expect_panic_with (folded.findGrapheme -1) Polyglot_Error
Test.expect_panic_with (folded.codeUnitToGraphemeIndex folded.getFoldedString.utf_16.length+1) Polyglot_Error Test.expect_panic_with (folded.findGrapheme folded.getFoldedString.utf_16.length+1) Polyglot_Error
main = Test.Suite.run_main here.spec main = Test.Suite.run_main here.spec

View File

@ -942,7 +942,7 @@ spec =
abc.location_of "" mode=Matching_Mode.Last . should_equal (Span (Range 3 3) abc) abc.location_of "" mode=Matching_Mode.Last . should_equal (Span (Range 3 3) abc)
abc.location_of_all "" . should_equal [Span (Range 0 0) abc, Span (Range 1 1) abc, Span (Range 2 2) abc, Span (Range 3 3) abc] abc.location_of_all "" . should_equal [Span (Range 0 0) abc, Span (Range 1 1) abc, Span (Range 2 2) abc, Span (Range 3 3) abc]
Test.specify "should allow case insensitive matching in location_of" <| Test.specify "should allow case-insensitive matching in location_of" <|
hello = "Hello WORLD!" hello = "Hello WORLD!"
case_insensitive = Text_Matcher Case_Insensitive.new case_insensitive = Text_Matcher Case_Insensitive.new
hello.location_of "world" . should_equal Nothing hello.location_of "world" . should_equal Nothing
@ -1022,6 +1022,13 @@ spec =
abc.location_of_all "" matcher=regex . should_equal [Span (Range 0 0) abc, Span (Range 0 0) abc, Span (Range 1 1) abc, Span (Range 2 2) abc, Span (Range 3 3) abc] abc.location_of_all "" matcher=regex . should_equal [Span (Range 0 0) abc, Span (Range 0 0) abc, Span (Range 1 1) abc, Span (Range 2 2) abc, Span (Range 3 3) abc]
abc.location_of "" matcher=regex mode=Matching_Mode.Last . should_equal (Span (Range 3 3) abc) abc.location_of "" matcher=regex mode=Matching_Mode.Last . should_equal (Span (Range 3 3) abc)
Test.specify "should handle overlapping matches as shown in the examples"
"aaa".location_of "aa" mode=Matching_Mode.Last matcher=Text_Matcher . should_equal (Span (Range 1 3) "aaa")
"aaa".location_of "aa" mode=Matching_Mode.Last matcher=Regex_Matcher . should_equal (Span (Range 0 2) "aaa")
"aaa aaa".location_of "aa" mode=Matching_Mode.Last matcher=Text_Matcher . should_equal (Span (Range 5 7) "aaa aaa")
"aaa aaa".location_of "aa" mode=Matching_Mode.Last matcher=Regex_Matcher . should_equal (Span (Range 4 6) "aaa aaa")
Test.group "Regex matching" <| Test.group "Regex matching" <|
Test.specify "should be possible on text" <| Test.specify "should be possible on text" <|
match = "My Text: Goes Here".match "^My Text: (.+)$" mode=Regex_Mode.First match = "My Text: Goes Here".match "^My Text: (.+)$" mode=Regex_Mode.First
@ -1179,35 +1186,144 @@ spec =
splits.at 1 . should_equal "c" splits.at 1 . should_equal "c"
splits.at 2 . should_equal "e" splits.at 2 . should_equal "e"
Test.group "Regex replacement" <| Test.group "Text.replace" <|
Test.specify "should be possible on text" <| Test.specify "should work as in examples" <|
result = "ababab".replace "b" "a" 'aaa'.replace 'aa' 'b' . should_equal 'ba'
result . should_equal "aaaaaa" "Hello World!".replace "[lo]" "#" matcher=Regex_Matcher . should_equal "He### W#r#d!"
"Hello World!".replace "l" "#" mode=Matching_Mode.First . should_equal "He#lo World!"
'"abc" foo "bar" baz'.replace '"(.*?)"' '($1)' matcher=Regex_Matcher . should_equal '(abc) foo (bar) baz'
'ß'.replace 'S' 'A' matcher=(Text_Matcher Case_Insensitive) . should_equal 'AA'
'affib'.replace 'i' 'X' matcher=(Text_Matcher Case_Insensitive) . should_equal 'aXb'
Test.specify "should correctly handle empty-string edge cases" <|
[Mode.All, Matching_Mode.First, Matching_Mode.Last] . each mode->
'aaa'.replace '' 'foo' mode=mode . should_equal 'aaa'
''.replace '' '' mode=mode . should_equal ''
'a'.replace 'a' '' mode=mode . should_equal ''
''.replace 'a' 'b' mode=mode . should_equal ''
'aba' . replace 'a' '' Matching_Mode.First . should_equal 'ba'
'aba' . replace 'a' '' Matching_Mode.Last . should_equal 'ab'
'aba' . replace 'a' '' . should_equal 'b'
'aba' . replace 'c' '' . should_equal 'aba'
Test.specify "should correctly handle first, all and last matching with overlapping occurrences" <|
"aaa aaa".replace "aa" "c" . should_equal "ca ca"
"aaa aaa".replace "aa" "c" mode=Matching_Mode.First . should_equal "ca aaa"
"aaa aaa".replace "aa" "c" mode=Matching_Mode.Last . should_equal "aaa ac"
Test.specify "should correctly handle case-insensitive matches" <|
'AaąĄ' . replace "A" "-" matcher=(Text_Matcher Case_Insensitive) . should_equal '--ąĄ'
'AaąĄ' . replace "A" "-" . should_equal '-aąĄ'
'HeLlO wOrLd' . replace 'hElLo' 'Hey,' matcher=(Text_Matcher True) . should_equal 'HeLlO wOrLd'
'HeLlO wOrLd' . replace 'hElLo' 'Hey,' matcher=(Text_Matcher Case_Insensitive) . should_equal 'Hey, wOrLd'
"Iiİı" . replace "i" "-" . should_equal "I-İı"
"Iiİı" . replace "I" "-" . should_equal "-iİı"
"Iiİı" . replace "İ" "-" . should_equal "Ii-ı"
"Iiİı" . replace "ı" "-" . should_equal "Iiİ-"
"Iiİı" . replace "i" "-" matcher=(Text_Matcher Case_Insensitive) . should_equal "--İı"
"Iiİı" . replace "I" "-" matcher=(Text_Matcher Case_Insensitive) . should_equal "--İı"
"Iiİı" . replace "İ" "-" matcher=(Text_Matcher Case_Insensitive) . should_equal "Ii-ı"
"Iiİı" . replace "ı" "-" matcher=(Text_Matcher Case_Insensitive) . should_equal "Iiİ-"
tr_insensitive = Text_Matcher (Case_Insensitive (Locale.new "tr"))
"Iiİı" . replace "i" "-" matcher=tr_insensitive . should_equal "I--ı"
"Iiİı" . replace "I" "-" matcher=tr_insensitive . should_equal "-iİ-"
"Iiİı" . replace "İ" "-" matcher=tr_insensitive . should_equal "I--ı"
"Iiİı" . replace "ı" "-" matcher=tr_insensitive . should_equal "-iİ-"
Test.specify "should correctly handle Unicode edge cases" <|
'sśs\u{301}' . replace 's' 'O' . should_equal 'Ośs\u{301}'
'sśs\u{301}' . replace 's' 'O' Matching_Mode.Last . should_equal 'Ośs\u{301}'
'śs\u{301}s' . replace 's' 'O' Matching_Mode.First . should_equal 'śs\u{301}O'
'sśs\u{301}' . replace 'ś' 'O' . should_equal 'sOO'
'sśs\u{301}' . replace 's\u{301}' 'O' . should_equal 'sOO'
'SŚS\u{301}' . replace 's' 'O' . should_equal 'SŚS\u{301}'
'SŚS\u{301}' . replace 's' 'O' Matching_Mode.Last . should_equal 'SŚS\u{301}'
'ŚS\u{301}S' . replace 's' 'O' Matching_Mode.First . should_equal 'ŚS\u{301}S'
'SŚS\u{301}' . replace 'ś' 'O' . should_equal 'SŚS\u{301}'
'SŚS\u{301}' . replace 's\u{301}' 'O' . should_equal 'SŚS\u{301}'
'SŚS\u{301}' . replace 's' 'O' matcher=(Text_Matcher Case_Insensitive) . should_equal 'OŚS\u{301}'
'SŚS\u{301}' . replace 's' 'O' Matching_Mode.Last matcher=(Text_Matcher Case_Insensitive) . should_equal 'OŚS\u{301}'
'ŚS\u{301}S' . replace 's' 'O' Matching_Mode.First matcher=(Text_Matcher Case_Insensitive) . should_equal 'ŚS\u{301}O'
'SŚS\u{301}' . replace 'ś' 'O' matcher=(Text_Matcher Case_Insensitive) . should_equal 'SOO'
'SŚS\u{301}' . replace 's\u{301}' 'O' matcher=(Text_Matcher Case_Insensitive) . should_equal 'SOO'
'✨🚀🚧😍😃😍😎😙😉☺' . replace '🚧😍' '|-|:)' . should_equal '✨🚀|-|:)😃😍😎😙😉☺'
'Rocket Science' . replace 'Rocket' '🚀' . should_equal '🚀 Science'
Test.specify "should be possible on unicode text" <|
"Korean: 건반".replace "건반" "keyboard" . should_equal "Korean: keyboard" "Korean: 건반".replace "건반" "keyboard" . should_equal "Korean: keyboard"
Test.specify "should be possible in ascii mode" <| Test.specify "will approximate ligature matches" <|
result = "İiİ".replace "\w" "a" match_ascii=True # TODO do we want to improve this? highly non-trivial for very rare edge cases
result . should_equal "İaİ" ## Currently we lack 'resolution' to extract a partial match from
the ligature to keep it, probably would need some special
mapping.
'ffiffi'.replace 'ff' 'aa' matcher=(Text_Matcher Case_Insensitive) . should_equal 'aaaa'
'ffiffi'.replace 'ff' 'aa' mode=Matching_Mode.First matcher=(Text_Matcher Case_Insensitive) . should_equal 'aaffi'
'ffiffi'.replace 'ff' 'aa' mode=Matching_Mode.Last matcher=(Text_Matcher Case_Insensitive) . should_equal 'ffiaa'
'affiffib'.replace 'IF' 'X' matcher=(Text_Matcher Case_Insensitive) . should_equal 'aXb'
'aiffiffz' . replace 'if' '-' matcher=(Text_Matcher Case_Insensitive) . should_equal 'a--fz'
'AFFIB'.replace 'ffi' '-' matcher=(Text_Matcher Case_Insensitive) . should_equal 'A-B'
Test.specify "should be possible in case-insensitive mode" <| 'ß'.replace 'SS' 'A' matcher=(Text_Matcher Case_Insensitive) . should_equal 'A'
result = "abaBa".replace "b" "a" case_insensitive=True 'ß'.replace 'S' 'A' matcher=(Text_Matcher Case_Insensitive) . should_equal 'AA'
result . should_equal "aaaaa" 'ß'.replace 'S' 'A' mode=Matching_Mode.First matcher=(Text_Matcher Case_Insensitive) . should_equal 'A'
'ß'.replace 'S' 'A' mode=Matching_Mode.Last matcher=(Text_Matcher Case_Insensitive) . should_equal 'A'
'STRASSE'.replace 'ß' '-' matcher=(Text_Matcher Case_Insensitive) . should_equal 'STRA-E'
Test.specify "should be possible in dot_matches_newline mode" <| Test.specify "should perform simple replacement in Regex mode" <|
result = 'ab\na'.replace "b." "a" dot_matches_newline=True "ababab".replace "b" "a" matcher=Regex_Matcher . should_equal "aaaaaa"
result . should_equal "aaa" "ababab".replace "b" "a" mode=Matching_Mode.First matcher=Regex_Matcher . should_equal "aaabab"
"ababab".replace "b" "a" mode=Matching_Mode.Last matcher=Regex_Matcher . should_equal "ababaa"
"aaaa".replace "aa" "c" matcher=Regex_Matcher . should_equal "cc"
"aaaa".replace "aa" "c" mode=Matching_Mode.First matcher=Regex_Matcher . should_equal "caa"
"aaaa".replace "aa" "c" mode=Matching_Mode.Last matcher=Regex_Matcher . should_equal "aac"
"aaa".replace "aa" "c" matcher=Regex_Matcher . should_equal "ca"
"aaa".replace "aa" "c" mode=Matching_Mode.First matcher=Regex_Matcher . should_equal "ca"
"aaa".replace "aa" "c" mode=Matching_Mode.Last matcher=Text_Matcher . should_equal "ac"
"aaa".replace "aa" "c" mode=Matching_Mode.Last matcher=Regex_Matcher . should_equal "ca"
"aaa aaa".replace "aa" "c" matcher=Text_Matcher . should_equal "ca ca"
"aaa aaa".replace "aa" "c" mode=Matching_Mode.First matcher=Text_Matcher . should_equal "ca aaa"
"aaa aaa".replace "aa" "c" mode=Matching_Mode.Last matcher=Text_Matcher . should_equal "aaa ac"
"aaa aaa".replace "aa" "c" matcher=Regex_Matcher . should_equal "ca ca"
"aaa aaa".replace "aa" "c" mode=Matching_Mode.First matcher=Regex_Matcher . should_equal "ca aaa"
"aaa aaa".replace "aa" "c" mode=Matching_Mode.Last matcher=Regex_Matcher . should_equal "aaa ca"
Test.specify "in Regex mode should work with Unicode" <|
"Korean: 건반".replace "건반" "keyboard" matcher=Regex_Matcher . should_equal "Korean: keyboard"
'sśs\u{301}'.replace 'ś' '-' matcher=Regex_Matcher . should_equal 's--'
'sśs\u{301}'.replace 's\u{301}' '-' matcher=Regex_Matcher . should_equal 's--'
Test.specify "in Regex mode should support various Regex options" <|
r1 = "İiİ".replace "\w" "a" matcher=(Regex_Matcher match_ascii=True)
r1 . should_equal "İaİ"
r2 = "abaBa".replace "b" "a" matcher=(Regex_Matcher case_sensitive=Case_Insensitive)
r2 . should_equal "aaaaa"
r3 = 'ab\na'.replace "b." "a" matcher=(Regex_Matcher dot_matches_newline=True)
r3 . should_equal "aaa"
Test.specify "should be possible in multiline mode" <|
text = """ text = """
Foo Foo
bar bar
result = text.replace '\n' "" multiline=True r4 = text.replace '\n' "" matcher=(Regex_Matcher multiline=True)
result . should_equal "Foobar" r4 . should_equal "Foobar"
Test.specify "should be possible in comments mode" <| r5 = "ababd".replace "b\w # Replacing a `b` followed by any word character" "a" matcher=(Regex_Matcher comments=True)
result = "ababd".replace "b\w # Replacing a `b` followed by any word character" "a" comments=True r5 . should_equal "aaa"
result . should_equal "aaa"
Test.specify "in Regex mode should allow referring to capture groups in substitutions" <|
'<a href="url">content</a>'.replace '<a href="(.*?)">(.*?)</a>' '$2 is at $1' matcher=Regex_Matcher . should_equal 'content is at url'
'<a href="url">content</a>'.replace '<a href="(?<address>.*?)">(?<text>.*?)</a>' '${text} is at ${address}' matcher=Regex_Matcher . should_equal 'content is at url'
main = Test.Suite.run_main here.spec main = Test.Suite.run_main here.spec

View File

@ -28,7 +28,7 @@ spec =
table = Table.from_rows header [row_1] table = Table.from_rows header [row_1]
expect table '{"df_color":["red"],"df_label":["name"],"df_latitude":[11],"df_longitude":[10],"df_radius":[195]}' expect table '{"df_color":["red"],"df_label":["name"],"df_latitude":[11],"df_longitude":[10],"df_radius":[195]}'
Test.specify "is case insensitive" <| Test.specify "is case-insensitive" <|
header = ['latitude' , 'LONGITUDE' , 'LaBeL'] header = ['latitude' , 'LONGITUDE' , 'LaBeL']
row_1 = [11 , 10 , 09 ] row_1 = [11 , 10 , 09 ]
row_2 = [21 , 20 , 19 ] row_2 = [21 , 20 , 19 ]

View File

@ -46,7 +46,7 @@ spec =
table = Table.from_rows header [row_1, row_2] table = Table.from_rows header [row_1, row_2]
expect table 'value' [10,20] expect table 'value' [10,20]
Test.specify "is case insensitive" <| Test.specify "is case-insensitive" <|
header = ['α', 'Value'] header = ['α', 'Value']
row_1 = [11 , 10 ] row_1 = [11 , 10 ]
row_2 = [21 , 20 ] row_2 = [21 , 20 ]

View File

@ -49,7 +49,7 @@ spec =
table = Table.from_rows header [row_1] table = Table.from_rows header [row_1]
expect table (labels 'x' 'y') '[{"color":"ff0000","label":"label","shape":"square","size":50,"x":11,"y":10}]' expect table (labels 'x' 'y') '[{"color":"ff0000","label":"label","shape":"square","size":50,"x":11,"y":10}]'
Test.specify "is case insensitive" <| Test.specify "is case-insensitive" <|
header = ['X' , 'Y' , 'Size' , 'Shape' , 'Label' , 'Color' ] header = ['X' , 'Y' , 'Size' , 'Shape' , 'Label' , 'Color' ]
row_1 = [11 , 10 , 50 , 'square' , 'label' , 'ff0000'] row_1 = [11 , 10 , 50 , 'square' , 'label' , 'ff0000']
table = Table.from_rows header [row_1] table = Table.from_rows header [row_1]