From fa747062dcee62f3e176f2b1fd6e7a513fd250c2 Mon Sep 17 00:00:00 2001 From: HjalmarrSv <58831450+HjalmarrSv@users.noreply.github.com> Date: Tue, 17 Dec 2019 20:40:51 +0100 Subject: [PATCH] Modernized I wanted to properly parse links on https://dumps.wikimedia.org/mirrors.html when page copied as text My proposed changes does the job. Basically I had to change by replacing the + at end of line 5 with *(\/)? The pipe symbol could lead to crashes why I broke up line 5 to three lines. I suggest not using the pipe (|) after reading various posts. --- scripts/tokenizer/basic-protected-patterns | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/scripts/tokenizer/basic-protected-patterns b/scripts/tokenizer/basic-protected-patterns index 57a0dd485..5ccb071d6 100644 --- a/scripts/tokenizer/basic-protected-patterns +++ b/scripts/tokenizer/basic-protected-patterns @@ -2,4 +2,6 @@ <\S+( [a-zA-Z0-9]+\=\"?[^\"]\")+ ?\/?> <\S+( [a-zA-Z0-9]+\=\'?[^\']\')+ ?\/?> [\w\-\_\.]+\@([\w\-\_]+\.)+[a-zA-Z]{2,} -(http[s]?|ftp):\/\/[^:\/\s]+(\/\w+)*\/[\w\-\.]+ +http[s]?:\/\/[^:\/\s]+(\/\w+)*\/[\w\-\.]*(\/)? +ftp[s]?:\/\/[^:\/\s]+(\/\w+)*\/[\w\-\.]*(\/)? +rsync:\/\/[^:\/\s]+(\/\w+)*\/[\w\-\.]*(\/)?