From 8c529d3cff9671de0d92ec6587b8b0c6fb229614 Mon Sep 17 00:00:00 2001 From: Maxime Coste Date: Fri, 13 Oct 2017 13:14:31 +0800 Subject: [PATCH] Regex: add a regex.asciidoc documentation page describing the syntax --- doc/manpages/regex.asciidoc | 178 ++++++++++++++++++++++++++++++++++++ 1 file changed, 178 insertions(+) create mode 100644 doc/manpages/regex.asciidoc diff --git a/doc/manpages/regex.asciidoc b/doc/manpages/regex.asciidoc new file mode 100644 index 000000000..db8418768 --- /dev/null +++ b/doc/manpages/regex.asciidoc @@ -0,0 +1,178 @@ +kakoune(k) +========== + +NAME +---- +regex - a + +Regex Syntax +------------ + +Kakoune regex syntax is based on the ECMAScript syntax, as defined by the +ECMA-262 standard. + +Kakoune's regex always run on unicode codepoint sequences, not on bytes. + +Literals +-------- + +Every character except the syntax characters `\^$.*+?[]{}|().` match +themselves, syntax characters can be escaped with a backspace so `\$` will +match a literal `$` and `\\` will match a literal `\`. + +Some additional literals are available as escape sequences: + +* `\f` matches the form feed character. +* `\n` matches the line feed character. +* `\r` matches the carriage return character. +* `\t` matches the tabulation character. +* `\v` matches the the vertical tabulation character. + +Character classes +----------------- + +The `[` character introduces a character class, which can match multiple +characters. + +A character class contains a list of literals, character ranges, +and character class escapes surrounded by `[` and `]`. + +If the first character inside a character class is `^`, then the character +class is negated, meaning that it matches every character not specified +in the character class. + +Literals match themselves, including syntax characters, so `^` +does not need to be escaped in a character class. `[*+]` matches both +the `*` character and the `+` character. Literal escape sequences are +supported, so `[\n\r]` matches both the line feed and carriage return +characters. + +The `]` character needs to be escaped for it to match a literal `]` +instead of closing the character class. + +Character ranges are written as `-`, so +`[A-Z]` matches all upper case basic letters. `[A-Z0-9]` will match all +upper cases basic letters and all basic digits. + +The `-` characters in a character class that are not specifying a +range are treated as literal `-`, so `[A-Z-+]` matches all upper case +characters, the `-` character, and the `+` character. + +supported character class escapes are: + +* `\d` which matches all digits. +* `\w` which matches all word characters. +* `\s` which matches all whitespace characters. +* `\h` which matches all horizontal whitespace characters. + +Using a upper case letter instead of a lower case one will negate +the character class, meaning for example that `\D` will match every +non-digit character. + +character class escapes can be used outside of a character class, `\d` +is equivalent to `[\d]`. + +Any character +------------- + +`.` matches any character, including new lines. + +Groups +------ + +Regex atoms can be grouped using `(` and `)` or `(?:` and `)`. If `(` is +used, the group will be a capturing group. which means the positions from +the subject strings that matched between `(` and `)` will be recorded. + +Capture groups are numbered starting at 1 (0 is a special capture group +for the whole sequence that matched), They are numbered in the order of +appearance of their `(` in the regex. + +`(?:` introduces a non capturing group, which will not record the +matches positions. + +Alternations +------------ + +`|` introduces an alternation, which will either match its left hand side, +or its right hand side (preferring the left hand side) + +For example, `foo|bar` matches either `foo` or `bar`, `foo(bar|baz|qux)` +matches `foo` followed by either `bar`, `baz` or `qux`. + +Quantifier +---------- + +Literals, Character classes, Any characters and groups can be followed +by a quantifier, which specifies the number of times they can match. + +* `?` matches zero or one times. +* `*` matches zero or more times. +* `+` matches one or more times. +* `{n}` matches exactly n times. +* `{n,}` matches n or more times. +* `{n,m}` matches n to m times. +* `{,m}` matches zero to m times. + +By default, quantifiers are *greedy*, which means they will prefer to +match more characters if possible. Suffixing a quantifier with `?` will +make it non-greedy, meaning it will prefer to match less characters. + +Zero width assertions +--------------------- + +Assertions do not consume any character, but will prevent the regex +from matching if they are not fulfilled. + +* `^` matches at the start of a line, that is just after a new line + character, or at the subject begin (except if specified that the + subject begin is not a start of line). +* `$` matches at the end of a line, that is just before a new line, or + at the subject end (except if specified that the subject end + is not an end of line). +* `\b` matches at a word boundary, when one of the previous character + and current character is a word character, and the other is not. +* `\B` matches at a non word boundary, when both the previous character + and the current character are word, or are not. +* `\A` matches at the subject string begin. +* `\z` matches at the subject string end. +* `\K` matches anything, and reset the start position of the matching + text to the current position. + +More complex assertions can be expressed with lookarounds: + +* `(?=...)` is a lookahead, it will match if its content matches the text + following the current position +* `(?!...)` is a negative lookahead, it will match if its content does + not matches the text following the current position +* `(?<=...)` is a lookbehind, it will match if its content matches + the text preceding the current position +* `(?