commit d2ddec4cddecd50907c2908cb70fab4cec7e2294 Author: Lettier Date: Mon Jul 22 20:30:28 2019 -0400 Adds initial files. diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..1a2c3a8 --- /dev/null +++ b/.gitignore @@ -0,0 +1,9 @@ +*stack-work* +*cabal*sandbox* +*dist* +*tmp* +*blend1 +*blend2 +*blend3 +*blend4 +*blend5 diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000..b9d6d44 --- /dev/null +++ b/LICENSE @@ -0,0 +1,29 @@ +BSD 3-Clause License + +Copyright (c) 2019, David Lettier +All rights reserved. + +Redistribution and use in source and binary forms, with or without +modification, are permitted provided that the following conditions are met: + +* Redistributions of source code must retain the above copyright notice, this + list of conditions and the following disclaimer. + +* Redistributions in binary form must reproduce the above copyright notice, + this list of conditions and the following disclaimer in the documentation + and/or other materials provided with the distribution. + +* Neither the name of the copyright holder nor the names of its + contributors may be used to endorse or promote products derived from + this software without specific prior written permission. + +THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" +AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE +DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE +FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR +SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER +CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, +OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. diff --git a/README.md b/README.md new file mode 100644 index 0000000..f491870 --- /dev/null +++ b/README.md @@ -0,0 +1,1577 @@ + +

+Parsing With Haskell Parser Combinators +
+ +

+
+ +# Parsing With Haskell Parser Combinators + +Need to parse something? +Never heard of a "parser combinator"? +Looking to learn some Haskell? +Awesome! +Below is everything you'll need to get up and parsing with Haskell parser combinators. +From here you can try tackling esoteric data serialization formats, +compiler front ends, +domain specific languages—you name it! + +- [Building The Demos](#building-the-demos) +- [Running The Demos](#running-the-demos) +- [Parser Combinator](#parser-combinator) +- [Version Number](#version-number) +- [SRT](#srt) +- [Exercises](#exercises) + +## Building The Demos + +Included with this guide are two demo programs. + +`version-number-parser` parses a file for a version number. +`srt-file-parser` parses a file for SRT subtitles. +Feel free to try them out with the files found in `test-input/`. + +### Stack + +Download the Haskell tool [Stack](https://docs.haskellstack.org/en/stable/README/) +and then run the following. + +```bash +git clone https://github.com/lettier/parsing-with-haskell-parser-combinators +cd parsing-with-haskell-parser-combinators +stack build +``` + +### Cabal + +If using Cabal, you can run the following. + +```bash +git clone https://github.com/lettier/parsing-with-haskell-parser-combinators +cd parsing-with-haskell-parser-combinators +cabal sandbox init +cabal --require-sandbox build +cabal --require-sandbox install +``` + +## Running The Demos + +After building the two demo programs, you can run them like so. + +### Stack + +To try the version number parser, run the following. + +```bash +cd parsing-with-haskell-parser-combinators +stack exec -- version-number-parser +What is the version output file path? +test-input/gifcurry-version-output.txt +``` + +To try the SRT file parser, run the following. + +```bash +cd parsing-with-haskell-parser-combinators +stack exec -- srt-file-parser +What is the SRT file path? +test-input/subtitles.srt +``` + +### Cabal + +To try the version number parser, run the following. + +```bash +cd parsing-with-haskell-parser-combinators +.cabal-sandbox/bin/version-number-parser +What is the version output file path? +test-input/gifcurry-version-output.txt +``` + +To try the SRT file parser, run the following. + +```bash +cd parsing-with-haskell-parser-combinators +.cabal-sandbox/bin/srt-file-parser +What is the SRT file path? +test-input/subtitles.srt +``` + +## Parser Combinator + + +

+Parser Combinators +
+ +

+
+ +One of the better ways to learn about the parsing strategy, +[parser combinator](https://en.wikipedia.org/wiki/Parser_combinator), +is to look at an implementation of one. + +
+

+Parsers built using combinators are straightforward to construct, readable, modular, well-structured, and easily maintainable. +

+ +—Parser combinator - Wikipedia + +

+
+ +### ReadP + +Let's take a look under the hood of [ReadP](https://hackage.haskell.org/package/base-4.12.0.0/docs/Text-ParserCombinators-ReadP.html), +a parser combinator library found in base. +Since it is in base, you should already have it. + +:bulb: Note, you may want to try out [Parsec](https://hackage.haskell.org/package/parsec) after getting familiar with ReadP. +It too is a parser combinator library that others prefer to ReadP. +As an added bonus, it is included in +[GHC's boot libraries](https://gitlab.haskell.org/ghc/ghc/wikis/commentary/libraries/version-history) +as of GHC version 8.4.1. + +#### P Data Type + +```haskell +-- (c) The University of Glasgow 2002 + +data P a + = Get (Char -> P a) + | Look (String -> P a) + | Fail + | Result a (P a) + | Final [(a,String)] + deriving Functor +``` + +We'll start with the `P` data type. +The `a` in `P a` is up to you (the library user) and can be whatever you'd like. +The compiler creates a functor instance automatically and there are hand-written instances for +applicative, +monad, +`MonadFail`, +and alternative. + +:bulb: Note, for more on functors, applicatives, and monads, checkout +[Your easy guide to Monads, Applicatives, & Functors](https://medium.com/@lettier/your-easy-guide-to-monads-applicatives-functors-862048d61610). + +`P` is a [sum type](https://en.wikipedia.org/wiki/Tagged_union) with five cases. + +- `Get` consumes a single character from the input string and returns a new `P`. +- `Look` accepts a duplicate of the input string and returns a new `P`. +- `Fail` indicates the parser finished without a result. +- `Result` holds a possible parsing and another `P` case. +- `Final` is a list of two-tuples. The first tuple element is a possible parsing of the input + and the second tuple element is the rest of the input string that wasn't consumed by `Get`. + +#### Run + +```haskell +-- (c) The University of Glasgow 2002 + +run :: P a -> ReadS a +run (Get f) (c:s) = run (f c) s +run (Look f) s = run (f s) s +run (Result x p) s = (x,s) : run p s +run (Final r) _ = r +run _ _ = [] +``` + +`run` is the heart of the ReadP parser. +It does all of the heavy lifting as it recursively runs through all of the parser states that we saw up above. +You can see that it takes a `P` and returns a `ReadS`. + +```haskell +-- (c) The University of Glasgow 2002 + +type ReadS a = String -> [(a,String)] +``` + +`ReadS a` is a type alias for `String -> [(a,String)]`. +So whenever you see `ReadS a`, think `String -> [(a,String)]`. + +```haskell +-- (c) The University of Glasgow 2002 + +run :: P a -> String -> [(a,String)] +run (Get f) (c:s) = run (f c) s +run (Look f) s = run (f s) s +run (Result x p) s = (x,s) : run p s +run (Final r) _ = r +run _ _ = [] +``` + +`run` pattern matches the different cases of `P`. + +- If it's `Get`, + it calls itself with a new `P` (returned by passing the function `f`, in `Get f`, the next character `c` in the input string) + and the rest of the input string `s`. +- If it's `Look`, + it calls itself with a new `P` (returned by passing the function `f`, in `Look f`, the input string `s`) + and the input string. + Notice how `Look` doesn't consume any characters from the input string like `Get` does. +- If it's `Result`, + it assembles a two-tuple—containing the parsed result and what's left of the input string—and + prepends this to the result of a recursive call that runs with another `P` case and the input string. +- If it's `Final`, `run` returns a list of two-tuples containing parsed results and input string leftovers. +- For anything else, `run` returns an empty list. + For example, if the case is `Fail`, `run` will return an empty list. + +```haskell +> run (Get (\ a -> Get (\ b -> Result [a,b] Fail))) "12345" +[("12","345")] +``` + +ReadP doesn't expose `run` but if it did, you could call it like this. +The two `Get`s consume the `'1'` and `'2'`, leaving the `"345"` behind. + +```haskell +> run (Get (\ a -> Get (\ b -> Result [a,b] Fail))) "12345" +> run (Get (\ b -> Result ['1',b] Fail)) "2345" +> run (Result ['1','2'] Fail) "345" +> (['1', '2'], "345") : run (Fail) "345" +> (['1', '2'], "345") : [] +[("12","345")] +``` + +Running through each recursive call, you can see how we arrived at the final result. + +```haskell +> run (Get (\ a -> Get (\ b -> Result [a,b] (Final [(['a','b'],"c")])))) "12345" +[("12","345"),("ab","c")] +``` + +Using `Final`, you can include a parsed result in the final list of two-tuples. + +#### readP_to_S + +```haskell +-- (c) The University of Glasgow 2002 + + readP_to_S :: ReadP a -> ReadS a +-- readP_to_S :: ReadP a -> String -> [(a,String)] + readP_to_S (R f) = run (f return) +``` + +While ReadP doesn't expose `run` directly, it does expose it via `readP_to_S`. +`readP_to_S` introduces a `newtype` called `ReadP`. +`readP_to_S` accepts a `ReadP a`, a string, and returns a list of two-tuples. + +#### ReadP Newtype + + +

+ReadP Newtype +
+ +

+
+ +```haskell +-- (c) The University of Glasgow 2002 + +newtype ReadP a = R (forall b . (a -> P b) -> P b) +``` + +Here's the definition of `ReadP a`. +There are instances for functor, applicative, monad, `MonadFail`, alternative, and `MonadPlus`. +The `R` constructor takes a function that takes another function and returns a `P`. +The accepted function takes whatever you chose for `a` and returns a `P`. + +```haskell +-- (c) The University of Glasgow 2002 + +readP_to_S (R f) = run (f return) +``` + +Recall that `P` is a monad and `return`'s type is `a -> m a`. +So `f` is the `(a -> P b) -> Pb` function and `return` is the `(a -> P b)` function. +Ultimately, `run` gets the `P b` it expects. + +```haskell +-- (c) The University of Glasgow 2002 + +readP_to_S (R f) inputString = run (f return) inputString +-- ^^^^^^^^^^^ ^^^^^^^^^^^ +``` + +It's left off in the source code but remember that `readP_to_S` and `run` expects an input string. + +```haskell +-- (c) The University of Glasgow 2002 + +instance Functor ReadP where + fmap h (R f) = R (\k -> f (k . h)) +``` + +Here's the functor instance definition for `ReadP`. + +```haskell +> readP_to_S (fmap toLower get) "ABC" +[('a',"BC")] + +> readP_to_S (toLower <$> get) "ABC" +[('a',"BC")] +``` + +This allows us to do something like this. +`fmap` functor maps `toLower` over the functor `get` which equals `R Get`. +Recall that the type of `Get` is `(Char -> P a) -> P a` which the `ReadP` constructor (`R`) accepts. + +```haskell +-- (c) The University of Glasgow 2002 + +fmap h (R f ) = R (\ k -> f (k . h )) +fmap toLower (R Get) = R (\ k -> Get (k . toLower)) +``` + +Here you see the functor definition rewritten for the `fmap toLower get` example. + +#### Applicative P Instance + +Looking up above, how did `readP_to_S` return `[('a',"BC")]` when we only used `Get` which doesn't terminate `run`? +The answer lies in the applicative definition for `P`. + +```haskell +-- (c) The University of Glasgow 2002 + +instance Applicative P where + pure x = Result x Fail + (<*>) = ap +``` + +`return` equals `pure` so we could rewrite `readP_to_S (R f) = run (f return)` to be `readP_to_S (R f) = run (f pure)`. +By using `return` or rather `pure`, `readP_to_S` sets `Result x Fail` as the final case `run` will encounter. +If reached, +`run` will terminate and we'll get our list of parsings. + +```haskell +> readP_to_S (fmap toLower get) "ABC" + +-- Use the functor instance to transform fmap toLower get. +> readP_to_S (R (\ k -> Get (k . toLower))) "ABC" + +-- Call run which removes R. +> run ((\ k -> Get (k . toLower)) pure) "ABC" + +-- Call function with pure to get rid of k. +> run (Get (pure . toLower)) "ABC" + +-- Call run for Get case to get rid of Get. +> run ((pure . toLower) 'A') "BC" + +-- Call toLower with 'A' to get rid of toLower. +> run (pure 'a') "BC" + +-- Use the applicative instance to transform pure 'a'. +> run (Result 'a' Fail) "BC" + +-- Call run for the Result case to get rid of Result. +> ('a', "BC") : run (Fail) "BC" + +-- Call run for the Fail case to get rid of Fail. +> ('a', "BC") : [] + +-- Prepend. +[('a',"BC")] +``` + +Here you see the flow from `readP_to_S` to the parsed result. + +#### Alternative P Instance + +```haskell +-- (c) The University of Glasgow 2002 + +instance Alternative P where + -- ... + + -- most common case: two gets are combined + Get f1 <|> Get f2 = Get (\c -> f1 c <|> f2 c) + + -- results are delivered as soon as possible + Result x p <|> q = Result x (p <|> q) + p <|> Result x q = Result x (p <|> q) + + -- ... +``` + +The `Alternative` instance for `P` allows us to split the flow of the parser into a left and right path. +This comes in handy when the input can go none, one, or (more rarely) two of two ways. + +```haskell +> readP_to_S ((get >>= \ a -> return a) <|> (get >> get >>= \ b -> return b)) "ABC" +[('A',"BC"),('B',"C")] +``` + +The `<|>` operator or function introduces a fork in the parser's flow. +The parser will travel through both the left and right paths. +The end result will contain all of the possible parsings that went left +and all of the possible parsings that went right. +If both paths fail, then the whole parser fails. + +:bulb: Note, in other parser combinator implementations, +when using the `<|>` operator, +the parser will go left or right but not both. +If the left succeeds, the right is ignored. +The right is only processed if the left side fails. + +```haskell +> readP_to_S ((get >>= \ a -> return [a]) <|> look <|> (get >> get >>= \a -> return [a])) "ABC" +[("ABC","ABC"),("A","BC"),("B","C")] +``` + +You can chain the `<|>` operator for however many options or alternatives there are. +The parser will return a possible parsing involving each. + +#### ReadP Failure + +```haskell +-- (c) The University of Glasgow 2002 + +instance Monad ReadP where + fail _ = R (\_ -> Fail) + R m >>= f = R (\k -> m (\a -> let R m' = f a in m' k)) +``` + +Here is the `ReadP` monad instance. +Notice the definition for `fail`. + +```haskell +> readP_to_S ((\ a b c -> [a,b,c]) <$> get <*> get <*> get) "ABC" +[("ABC","")] + +> readP_to_S ((\ a b c -> [a,b,c]) <$> get <*> fail "" <*> get) "ABC" +[] + +> readP_to_S (get >>= \ a -> get >>= \ b -> get >>= \ c -> return [a,b,c]) "ABC" +[("ABC","")] + +> readP_to_S (get >>= \ a -> get >>= \ b -> fail "" >>= \ c -> return [a,b,c]) "ABC" +[] +``` + +You can cause an entire parser path to abort by calling `fail`. +Since ReadP doesn't provide a direct way to generate a `Result` or `Final` case, +the return value will be an empty list. +If the failed path is the only path, then the entire result will be an empty list. +Recall that when `run` matches `Fail`, it returns an empty list. + +```haskell +-- (c) The University of Glasgow 2002 + +instance Alternative P where + -- ... + + -- fail disappears + Fail <|> p = p + p <|> Fail = p + + -- ... +``` + +Going back to the alternative `P` instance, +you can see how a failure on either side (but not both) will not fail the whole parser. + +```haskell +> readP_to_S (get >>= \ a -> get >>= \ b -> pfail >>= \ c -> return [a,b,c]) "ABC" +[] +``` + +Instead of using `fail`, ReadP provides `pfail` which allows you to generate a `Fail` case directly. + +## Version Number + + +

+Version Number +
+ +

+
+ +[Gifcurry](https://github.com/lettier/gifcurry), +the Haskell-built video editor for GIF makers, shells out to various different programs. +To ensure compatibility, it needs the version number for each of the programs it shells out to. +One of those programs is ImageMagick. + +```bash +Version: ImageMagick 6.9.10-14 Q16 x86_64 2018-10-24 https://imagemagick.org +Copyright: © 1999-2018 ImageMagick Studio LLC +License: https://imagemagick.org/script/license.php +Features: Cipher DPC HDRI Modules OpenCL OpenMP +``` + +Here you see the output of `convert --version`. +How could you parse this to capture the 6, 9, 10, and 14? + +Looking at the output, +we know the version number is a collection of numbers separated by either a period or a dash. +This definition covers the dates as well so we'll make sure that the first two numbers are separated by a period. +That way, if they put a date before the version number, we won't get the wrong result. + + +

+Version Number Parser +
+ +

+
+ +```txt +1. Consume zero or more characters that are not 0 through 9 and go to 2. +2. Consume zero or more characters that are 0 through 9, save this number, and go to 3. +3. Look at the rest of the input and go to 4. +4. If the input + - is empty, go to 6. + - starts with a period, go to 1. + - starts with a dash + - and you have exactly one number, go to 5. + - and you have more than one number, go to 1. + - doesn't start with a period or dash + - and you have exactly one number, go to 5. + - you have more than one number, go to 6. +5. Delete any saved numbers and go to 1. +6. Return the numbers found. +``` + +Before we dive into the code, here's the algorithm we'll be following. + +### Building The Version Number Parser + +```haskell +parseVersionNumber + :: [String] + -> ReadP [String] +parseVersionNumber + nums + = do + _ <- parseNotNumber + num <- parseNumber + let nums' = nums ++ [num] + parseSeparator nums' parseVersionNumber +``` + +`parseVersionNumber` is the main parser combinator that parses an input string for a version number. +It accepts a list of strings and returns a list of strings in the context of the `ReadP` data type. +The accepted list of strings is not the input that gets parsed but rather the list of numbers found so far. +For the first function call, the list is empty since it hasn't parsed anything yet. + +```haskell +parseVersionNumber + nums +``` + +Starting from the top, +`parseVersionNumber` takes a list of strings which are the current list of numbers found so far. + +```haskell + _ <- parseNotNumber +``` + +`parseNotNumber` consumes everything that isn't a number from the input string. +Since we are not interested in the result, we discard it (`_ <-`). + +```haskell + num <- parseNumber + let nums' = nums ++ [num] +``` + +Next we consume everything that is a number and then add that to the list of numbers found so far. + +```haskell + parseSeparator nums' parseVersionNumber +``` + +After `parseVersionNumber` has processed the next number, it passes the list of numbers found and itself to `parseSeparator`. + +#### Parsing The Separator + +```haskell +parseSeparator + :: [String] + -> ([String] -> ReadP [String]) + -> ReadP [String] +parseSeparator + nums + f + = do + next <- look + case next of + "" -> return nums + (c:_) -> + case c of + '.' -> f nums + '-' -> if length nums == 1 then f [] else f nums + _ -> if length nums == 1 then f [] else return nums +``` + +Here you see `parseSeparator`. + +```haskell + next <- look + case next of + "" -> return nums + (c:_) -> +``` + +`look` allows us to get what's left of the input string without consuming it. +If there's nothing left, it returns the numbers found. +However, if there is something left, it analyzes the first character. + +```haskell + case c of + '.' -> f nums + '-' -> if length nums == 1 then f [] else f nums + _ -> if length nums == 1 then f [] else return nums +``` + +If the next character is a period, call `parseVersionNumber` again with the current list of numbers found. +If it's a dash and we have exactly one number, call `parseVersionNumber` with an empty list of numbers since it's a date. +If it's a dash and we don't have exactly one number, call `parseVersionNumber` with the list of numbers found so far. +Otherwise, +call `parseVersionNumber` with an empty list if we have exactly one number +or return the numbers found if we don't have exactly one number. + +#### Parsing Non-numbers + +```haskell +parseNotNumber + :: ReadP String +parseNotNumber + = + munch (not . isNumber) +``` + +`parseNotNumber` uses `munch` which `ReadP` provides. +`munch` is given the predicate `(not . isNumber)` which returns true for any character that isn't 0 through 9. + +```haskell +munch :: (Char -> Bool) -> ReadP String +``` + +`munch` continuously calls `get` if the next character in the input string satisfies the predicate. +If it doesn't, `munch` returns the characters that did, if any. +Since it only uses `get`, munch always succeeds. + +:bulb: Note, `parseNumber` is similar to `parseNotNumber`. +Instead of `not . isNumber`, the predicate is just `isNumber`. + +#### Munch Versus Many + +```haskell +parseNotNumber' + :: ReadP String +parseNotNumber' + = + many (satisfy (not . isNumber)) +``` + +Instead of using `munch`, +you could write `parseNotNumber` like this, +using `many` and `satisfy`—both of which ReadP provides. +Looking at the type signature for `many`, it accepts a single parser combinator (`ReadP a`). +In this instance, it's being given the parser combinator `satisfy`. + +```haskell +> readP_to_S (satisfy (not . isNumber)) "a" +[('a',"")] + +> readP_to_S (satisfy (not . isNumber)) "1" +[] +``` + +`satisfy` takes a predicate and uses `get` to consume the next character. +If the accepted predicate returns true, `satisfy` returns the character. +Otherwise, `satisfy` calls `pfail` and fails. + +```haskell +> readP_to_S (munch (not . isNumber)) "abc123" +[("abc","123")] + +> readP_to_S (many (satisfy (not . isNumber))) "abc123" +[("","abc123"),("a","bc123"),("ab","c123"),("abc","123")] +``` + +Using `many` can give you unwanted results. +Ultimately, `many` introduces one or more `Result` cases. +Because of this, `many` always succeeds. + +```haskell +> readP_to_S (many look) "abc123" +-- Runs forever. +``` + +`many` will run your parser until it fails or runs out of input. +If your parser never fails or never runs out of input, `many` will never return. + +```haskell +> readP_to_S (many (get >>= \ a -> return (read (a : "") :: Int))) "12345" +[([],"12345"),([1],"2345"),([1,2],"345"),([1,2,3],"45"),([1,2,3,4],"5"),([1,2,3,4,5],"")] +``` + +For every index in the result, +the parsed result will be the outcome of having ran the parser index times on the entire input. + +```haskell +> let parser = get >>= \ a -> return (read (a : "") :: Int) +> let many' results = return results <|> (parser >>= \ result -> many' (results ++ [result])) +> readP_to_S (many' []) "12345" +[([],"12345"),([1],"2345"),([1,2],"345"),([1,2,3],"45"),([1,2,3,4],"5"),([1,2,3,4,5],"")] +``` + +Here's an alternate definition for `many`. +On the left side of `<|>`, +it returns the current parser results. +On the right side of `<|>`, +it runs the parser, +adds that result to the current parser results, +and calls itself with the updated results. +This has a cumulative sum type effect where index `i` is the parser result appended to the parser result at +`i - 1`, +`i - 2`, +..., +and `1`. + +### Running The Version Number Parser + +Now that we built the parser, let's run it. + +```haskell +> let inputString = +> "Some Program (C) 1234-56-78 All rights reserved.\n\ +> \Version: 12.345.6-7\n\ +> \License: Some open source license." +> readP_to_S (parseVersionNumber []) inputString +[(["12","345","6","7"],"\nLicense: Some open source license.")] +``` + +You can see it extracted the version number correctly even with the date coming before it. + +## SRT + + +

+SRT +
+ +

+
+ +Now let's parse something more complicated—SRT files. + +For the release of +[Gifcurry](https://lettier.github.io/gifcurry) +six, I needed to parse +[SRT (SubRip Text) files](http://www.visualsubsync.org/help/srt). +SRT files contain subtitles that video processing programs use to display text on top of a video. +Typically this text is the dialog of a movie translated into various different languages. +By keeping the text separate from the video, +there only needs to be one video which saves time, storage space, and bandwidth. +The video software can swap out the text without having to swap out the video. +Contrast this with burning-in or hard-coding the subtitles where the text becomes a part of the image data that makes up the video. +In this case, you would need a video for each collection of subtitles. + + +

+Gifcurry +
+Inner Video © Blender Foundation | www.sintel.org +

+
+ +Gifcurry can take a SRT file and burn-in the subtitles for the video slice your select. + +```txt +7 +00:02:09,400 --> 00:02:13,800 +What brings you to +the land of the gatekeepers? + +8 +00:02:15,000 --> 00:02:17,500 +I'm searching for someone. + +9 +00:02:18,000 --> 00:02:22,200 +Someone very dear? +A kindred spirit? +``` + +Here you see the English subtitles for +[Sintel](https://durian.blender.org/) (© Blender Foundation | www.sintel.org). + +### SRT Format + +
+

+SRT is perhaps the most basic of all subtitle formats. +

+ +—SRT Subtitle | Matrosk + +

+
+ +The SRT file format consists of blocks, one for each subtitle, separated by an empty line. + +```txt +2 +``` + +At the top of the block is the index. +This determines the order of the subtitles. +Hopefully the subtitles are already in order and all of them have unique indexes but this may not be the case. + +```txt +01:04:13,000 --> 02:01:01,640 X1:167 X2:267 Y1:33 Y2:63 +``` + +After the index is the start time, end time, and an optional set of points specifying the rectangle the +subtitle text should go in. + +```txt +01:04:13,000 +``` + +The timestamp format is `hours:minutes:seconds,milliseconds`. + +:bulb: Note the comma instead of the period separating the seconds from the milliseconds. + +```txt +This is the actual subtitle +text. It can span multiple lines. +It may include formating +like bold, italic, +underline, +and font color. +``` + +The third and last part of a block is the subtitle text. +It can span multiple lines and ends when there is an empty line. +The text can include formatting tags reminiscent of HTML. + +### Building The SRT Parser + + +

+Parsing SRT +
+ +

+
+ +```haskell +parseSrt + :: ReadP [SrtSubtitle] +parseSrt + = + manyTill parseBlock (skipSpaces >> eof) +``` + +`parseSrt` is the main parser combinator that handles everything. +It parses each block until it reaches the end of the file (`eof`) or input. +To be on the safe side, +there could be trailing whitespace between the last block and the end of the file. +To handle this, it parses zero or more characters of whitespace (`skipSpaces`) before parsing +the end of the file (`skipSpaces >> eof`). +If there is still input left by the time `eof` is reached, `eof` will fail and this will return nothing. +Therefore, it's important that `parseBlock` doesn't leave any thing but whitespace behind. + +#### Building The SRT Block Parser + +```haskell +parseBlock + :: ReadP SrtSubtitle +parseBlock + = do + i <- parseIndex + (s, e) <- parseTimestamps + c <- parseCoordinates + t <- parseTextLines + return + SrtSubtitle + { index = i + , start = s + , end = e + , coordinates = c + , taggedText = t + } +``` + +As we went over earlier, a block consists of an index, timestamps, possibly some coordinates, and some lines of text. +In this version of `parseBlock`, you see the more imperative do notation style with the record syntax. + +```haskell +parseBlock' + :: ReadP SrtSubtitle +parseBlock' + = + SrtSubtitle + <$> parseIndex + <*> parseStartTimestamp + <*> parseEndTimestamp + <*> parseCoordinates + <*> parseTextLines +``` + +Here's another way you could write `parseBlock`. +This is the applicative style. +Just be sure to get the order right. +For example, I could've accidentally mixed up the start and end timestamps. + +#### Building The SRT Index Parser + + +

+Parsing The Index +
+ +

+
+ +```haskell +parseIndex + :: ReadP Int +parseIndex + = + skipSpaces + >> readInt <$> parseNumber +``` + +At the top of the block is the index. +Here you see `skipSpaces` again. +After skipping over whitespace, +it parses the input for numbers and converts it to an actual integer. + +```haskell +readInt + :: String + -> Int +readInt + = + read +``` + +`readInt` looks like this. + +```haskell +> read "123" :: Int +123 +> read "1abc" :: Int +*** Exception: Prelude.read: no parse +``` + +Normally using `read` directly can be dangerous. +`read` may not be able to convert the input to the specified type. +However, `parseNumber` will only return the 10 numerical digit characters (`['0'..'9']`) +so using `read` directly becomes safe. + +#### Building The SRT Timestamps Parser + + +

+Parsing The Timestamps +
+ +

+
+ +Parsing the timestamps are a little more involved than parsing the index. + +```haskell +parseTimestamps + :: ReadP (Timestamp, Timestamp) +parseTimestamps + = do + _ <- char '\n' + s <- parseTimestamp + _ <- skipSpaces + _ <- string "-->" + _ <- skipSpaces + e <- parseTimestamp + return (s, e) +``` + +This is the main combinator for parsing the timestamps. + +`char` parses the character you give it or it fails. +If it fails then `parseTimestamps` fails, ultimately causing `parseSrt` to fail +so there must be a newline character after the index. + +`string` is like `char` except instead of just one character, it +parses the string of characters you give it or it fails. + +```haskell +parseStartTimestamp + :: ReadP Timestamp +parseStartTimestamp + = + char '\n' + >> parseTimestamp +``` + +`parseTimestamps` parses both timestamps, +but for the applicative style (`parseSrt'`), +we need a parser just for the start timestamp. + +```haskell +parseEndTimestamp + :: ReadP Timestamp +parseEndTimestamp + = + skipSpaces + >> string "-->" + >> skipSpaces + >> parseTimestamp +``` + +This parses everything between the timestamps and returns the end timestamp. + +```haskell +parseTimestamp + :: ReadP Timestamp +parseTimestamp + = do + h <- parseNumber + _ <- char ':' + m <- parseNumber + _ <- char ':' + s <- parseNumber + _ <- char ',' <|> char '.' + m' <- parseNumber + return + Timestamp + { hours = readInt h + , minutes = readInt m + , seconds = readInt s + , milliseconds = readInt m' + } +``` + +This parses the four numbers that make up the timestamp. +The first three numbers are separated by a colon and the last one is separated by a comma. +To be more forgiving, however, we allow the possibility of there being a period instead of a comma. + +```haskell +> readP_to_S (char '.' <|> char ',') "..." +[('.',"..")] + +> readP_to_S (char '.' <|> char ',') ",.." +[(',',"..")] +``` + +:bulb: Note, when using `char` with `<|>`, +only one side can succeed (two `char` enter, one `char` leave) +since `char` consumes a single character and two characters cannot occupy the same space. + +#### Building The SRT Coordinates Parser + + +

+Parsing The Coordinates +
+ +

+
+ +The coordinates are an optional part of the block but if included, will be on the same line as the timestamps. + +```haskell +parseCoordinates + :: ReadP (Maybe SrtSubtitleCoordinates) +parseCoordinates + = + option Nothing $ do + _ <- skipSpaces1 + x1 <- parseCoordinate 'x' 1 + _ <- skipSpaces1 + x2 <- parseCoordinate 'x' 2 + _ <- skipSpaces1 + y1 <- parseCoordinate 'y' 1 + _ <- skipSpaces1 + y2 <- parseCoordinate 'y' 2 + return + $ Just + SrtSubtitleCoordinates + { x1 = readInt x1 + , x2 = readInt x2 + , y1 = readInt y1 + , y2 = readInt y2 + } +``` + +`option` takes two arguments. +The first argument is returned if the second argument, a parser, fails. +So if the coordinates parser fails, `parseCoordinates` will return `Nothing`. +Put another way, the coordinates parser failing does not cause the whole parser to fail. +This block will just have `Nothing` for its `coordinates` "field". + +```haskell +parseCoordinate + :: Char + -> Int + -> ReadP String +parseCoordinate + c + n + = do + _ <- char (Data.Char.toUpper c) <|> char (Data.Char.toLower c) + _ <- string $ show n ++ ":" + parseNumber +``` + +This parser allows the coordinate labels to be in either uppercase or lowercase. +For example, `x1:1 X2:2 Y1:3 y2:4` would succeed. + +#### Building The SRT Text Parser + + +

+Parsing The Text +
+ +

+
+ +Parsing the text is the most involved portion due to the HTML-like tag formatting. + +Tag parsing can be challenging—just ask anyone who parses them with a regular expression. +To make this easier on us—and for the user—we'll use a +[tag soup](https://en.wikipedia.org/wiki/Tag_soup) +kind of approach. +The parser will allow unclosed and/or wrongly nested tags. +It will also allow any tag and not just `b`, `u`, `i`, and `font`. + +```haskell +parseTextLines + :: ReadP [TaggedText] +parseTextLines + = + char '\n' + >> (getTaggedText <$> manyTill parseAny parseEndOfTextLines) +``` + +We start out by matching on a newline character. +After that, we functor map or fmap (`<$>`) `getTaggedText` over the subtitle text characters until we reach the end of the text lines. + +```haskell +parseEndOfTextLines + :: ReadP () +parseEndOfTextLines + = + void (string "\n\n") <|> eof +``` + +We stop collecting characters (`parseAny`) when we reach two newline characters or the end of the file. +This signals the end of the block. + +```haskell +getTaggedText + :: String + -> [TaggedText] +getTaggedText + s + = + fst + $ foldl + folder + ([], []) + parsed + where +``` + +`getTaggedText` folds through the parsed text from left to right, returning the accumulated tagged text. + +```haskell + parsed + :: [String] + parsed + = + case readP_to_S (parseTaggedText []) s of + [] -> [s] + r@(_:_) -> (fst . last) r +``` + +`parsed` returns a list of one or more strings. +It attempts to parse the input text for tags. +If that fails, `parsed` returns the input string inside a list. +Otherwise, if `parseTaggedText` succeeds, `parse` returns the last possible parsing (`(fst . last) r`). + +```haskell + folder + :: ([TaggedText], [Tag]) + -> String + -> ([TaggedText], [Tag]) + folder + (tt, t) + x + | isTag x = (tt, updateTags t x) + | otherwise = (tt ++ [TaggedText { text = x, tags = t}], t) +``` + +As `folder` moves from left to right, over the parsed strings, it checks if the current string is a tag. +If it is a tag, it updates the current set of active tags (`t`). +Otherwise, it appends another tagged piece of text associated with the set of active tags. + +```haskell +updateTags + :: [Tag] + -> String + -> [Tag] +updateTags + tags + x + | isClosingTag x = remove compare' tags (makeTag x) + | isOpeningTag x = add compare' tags (makeTag x) + | otherwise = tags + where + compare' + :: Tag + -> Tag + -> Bool + compare' + a + b + = + name a /= name b +``` + +`updateTags` updates the `tags` given by either removing or adding the given tag (`x`) depending on if it is a closing or opening tag. +If it is neither, it just returns the passed set of tags. +`add` will overwrite an existing tag if `tags` already has a tag by the same name. +You can see this in the `compare'` function given. + +To keep the parser simple, if an opening tag `T` is found, `T` gets added to the list of tags +or overwrites an exiting `T` if already present. +If a corresponding closing `/T` is found, then `T` is removed from the list of tags, if present. +It doesn't matter if there is two or more `T`s in a row, +one or more `T`s without a closing `/T`, +and/or there's a closing `/T` without an opening `T`. + +```haskell +makeTag + :: String + -> Tag +makeTag + s + = + Tag + { name = getTagName s + , attributes = getTagAttributes s + } +``` + +`makeTag` assembles a tag from the given string (`s`). +Each `Tag` has a name and zero or more attributes. + +```haskell +parseTaggedText + :: [String] + -> ReadP [String] +parseTaggedText + strings + = do + s <- look + case s of + "" -> return strings + _ -> do + r <- munch1 (/= '<') <++ parseClosingTag <++ parseOpeningTag + parseTaggedText $ strings ++ [r] +``` + +`parseTaggedText` returns the input string broken up into pieces. +Each piece is either the text enclosed by tags, a closing tag, or an opening tag. +After it splits off a piece, it adds it to the other pieces and calls itself again. +If the remaining input string is empty, it returns the list of strings found. + +```haskell +> readP_to_S (string "ab" <++ string "abc") "abcd" +[("ab","cd")] + +> readP_to_S (string "ab" +++ string "abc") "abcd" +[("ab","cd"),("abc","d")] + +> readP_to_S (string "ab" <|> string "abc") "abcd" +[("ab","cd"),("abc","d")] +``` + +The `<++` operator is left biased meaning that if the left side succeeds, it won't even bother with the right. +Recall that when we run the parser, we get a list of all the possible parsings. +All of these possible parsings are the result of the parser having traveled through all of the possible paths. +By using `<++`, +we receive the possible parsings from the left path and from the right path if and only if the left side failed. +If you'd like all of the possible parsings through the left and right side, +you can use the `+++` operator provided by `ReadP`. +`+++` is just `<|>` which we saw up above. + +```haskell +parseOpeningTag + :: ReadP String +parseOpeningTag + = do + _ <- char '<' + t <- munch1 (\ c -> c /= '/' && c /= '>') + _ <- char '>' + return $ "<" ++ t ++ ">" +``` + +An opening tag is an opening angle bracket, some text that doesn't include a forward slash, and the next immediate closing angle bracket. + +```haskell +parseClosingTag + :: ReadP String +parseClosingTag + = do + _ <- char '<' + _ <- char '/' + t <- munch1 (/= '>') + _ <- char '>' + return $ "" +``` + +A closing tag is an opening angle bracket, a forward slash, some text, and the next immediate closing angle bracket. + + +

+Parsing Tags +
+ +

+
+ +```haskell +getTagAttributes + :: String + -> [TagAttribute] +getTagAttributes + s + = + if isOpeningTag s + then + case readP_to_S (parseTagAttributes []) s of + [] -> [] + (x:_) -> fst x + else + [] +``` + +Opening tags can have attributes. +For example, ``. +Each attribute is a two-tuple, key-value pair. +In the above example, `color` would be the key and `#101010` would be the value. + +```haskell +getTagName + :: String + -> String +getTagName + s + = + case readP_to_S parseTagName s of + [] -> "" + (x:_) -> toLower' $ fst x +``` + +This returns the tag name in lowercase. + +```haskell +parseTagName + :: ReadP String +parseTagName + = do + _ <- char '<' + _ <- munch (== '/') + _ <- skipSpaces + n <- munch1 (\ c -> c /= ' ' && c /= '>') + _ <- munch (/= '>') + _ <- char '>' + return n +``` + +The tag name is the first string of non-whitespace characters +after the opening angle bracket, +a possible forward slash, +and some possible whitespace +and before some more whitespace +and/or the closing angle bracket. + +```haskell +parseTagAttributes + :: [TagAttribute] + -> ReadP [TagAttribute] +parseTagAttributes + tagAttributes + = do + s <- look + case s of + "" -> return tagAttributes + _ -> do + let h = head s + case h of + '>' -> return tagAttributes + '<' -> trimTagname >> parseTagAttributes' + _ -> parseTagAttributes' + where + parseTagAttributes' + :: ReadP [TagAttribute] + parseTagAttributes' + = do + tagAttribute <- parseTagAttribute + parseTagAttributes + ( add + (\ a b -> fst a /= fst b) + tagAttributes + tagAttribute + ) +``` + +`parseTagAttributes` recursively goes through the input string, collecting up the key-value pairs. +At the start of the tag (`<`), it first trims the tag name before tackling the attributes. +It stops parsing for attributes when it reaches the closing angle bracket (`>`). +If a tag happens to have duplicate attributes (based on the key), +`add` will ensure only the latest one remains in the list. + +```haskell +trimTagname + :: ReadP () +trimTagname + = + char '<' + >> skipSpaces + >> munch1 (\ c -> c /= ' ' && c /= '>') + >> return () +``` + +This trims or discards the tag name. + +```haskell +parseTagAttribute + :: ReadP TagAttribute +parseTagAttribute + = do + _ <- skipSpaces + k <- munch1 (/= '=') + _ <- string "=\"" + v <- munch1 (/= '\"') + _ <- char '\"' + _ <- skipSpaces + return (toLower' k, v) +``` + +The attribute key is any string of non-whitespace characters before the equal sign. +The attribute value is any characters after the equal sign and double quote and before the next immediate double quote. + +```haskell +isTag + :: String + -> Bool +isTag + s + = + isOpeningTag s || isClosingTag s +``` + +A string is a tag if it is either an opening tag or a closing tag. + +```haskell +isOpeningTag + :: String + -> Bool +isOpeningTag + s + = + isPresent $ readP_to_S parseOpeningTag s +``` + +A string is an opening tag if the opening tag parser succeeds. + +```haskell +isClosingTag + :: String + -> Bool +isClosingTag + s + = + isPresent $ readP_to_S parseClosingTag s +``` + +A string is a closing tag if the closing tag parser succeeds. + +### Running The SRT Parser + + +

+Parsed SRT Results +
+ +

+
+ +Now that we've assembled the parser, let's try it out. + +```haskell +> let srt = +> " 1\n\ +> \0:0:0,1 --> 0:1:0.2 x1:1 X2:3 y1:4 y2:10\n\ +> \This is some \n \ +> \subtitle \n\ +> \text. " +> readP_to_S parseSrt srt +[([ SrtSubtitle + { index = 1 + , start = Timestamp {hours = 0, minutes = 0, seconds = 0, milliseconds = 1} + , end = Timestamp {hours = 0, minutes = 1, seconds = 0, milliseconds = 2} + , coordinates = Just (SrtSubtitleCoordinates {x1 = 1, x2 = 3, y1 = 4, y2 = 10}) + , taggedText = [ TaggedText + { text = "This is some " + , tags = [ Tag {name = "font", attributes = [("color","blue")]} + ] + } + , TaggedText + { text = "\n subtitle \n" + , tags = [ Tag {name = "font", attributes = [("color","blue")]} + , Tag {name = "b", attributes = []} + , Tag {name = "u", attributes = []} + , Tag {name = "i", attributes = []} + ] + } + , TaggedText + { text = "text." + , tags = [ Tag {name = "font", attributes = [("color","blue")]} + , Tag {name = "b", attributes = []} + , Tag {name = "i", attributes = []} + ] + } + , TaggedText + { text = " " + , tags = [ Tag {name = "font", attributes = [("color","blue")]} + , Tag {name = "i", attributes = []} + ] + } + ] + } + ] +, "" +)] +``` + +Here you see the result of parsing a test string. +Notice the errors in the test string like the use of a period instead of a comma or the duplicate tag attribute. + +## Exercises + +- Write a program that can convert an SRT file to a JSON file. +- Rewrite the version number parser using Parsec instead of ReadP. +- Rewrite the SRT parser using Parsec instead of ReadP. + +## Copyright + +(C) 2019 David Lettier +
+[lettier.com](https://www.lettier.com/) diff --git a/Setup.hs b/Setup.hs new file mode 100644 index 0000000..9a994af --- /dev/null +++ b/Setup.hs @@ -0,0 +1,2 @@ +import Distribution.Simple +main = defaultMain diff --git a/docs/_build-docs.sh b/docs/_build-docs.sh new file mode 100755 index 0000000..e860121 --- /dev/null +++ b/docs/_build-docs.sh @@ -0,0 +1,23 @@ +#!/usr/bin/env bash + +SCRIPT_PATH="$(cd "$(dirname "$0")"; pwd -P)" +MAIN_TITLE="Parsing With Haskell Parser Combinators" +DESCRIPTION="Need to parse something? Never heard of a parser combinator? \ +Looking to learn some Haskell? \ +Awesome! This is everything you'll need to get up and parsing with Haskell parser combinators. \ +From here you can try tackling esoteric data serialization formats, compiler front ends, domain specific languages—you name it!" +REPO_URL="https://github.com/lettier/parsing-with-haskell-parser-combinators" +AUTHOR="David Lettier" +CSS="style.css" + +$PANDOC \ + -f gfm \ + -t html5 \ + --highlight-style=breezedark \ + --template=$SCRIPT_PATH/_template.html5 \ + $SCRIPT_PATH/../README.md \ + --metadata pagetitle="$MAIN_TITLE" \ + --metadata author-meta="$AUTHOR" \ + --metadata description="$DESCRIPTION" \ + --metadata css=$CSS \ + -o "$SCRIPT_PATH/index.html" diff --git a/docs/_template.html5 b/docs/_template.html5 new file mode 100644 index 0000000..b551f12 --- /dev/null +++ b/docs/_template.html5 @@ -0,0 +1,79 @@ + + + + + + + + + + + + + +$for(author-meta)$ + +$endfor$ +$if(date-meta)$ + +$endif$ +$if(keywords)$ + +$endif$ + $if(title-prefix)$$title-prefix$ – $endif$$pagetitle$ + +$if(highlighting-css)$ + +$endif$ +$if(math)$ + $math$ +$endif$ + +$for(header-includes)$ + $header-includes$ +$endfor$ +$for(css)$ + +$endfor$ + + +$for(include-before)$ + $include-before$ +$endfor$ +$if(title)$ +
+

$title$

+$if(subtitle)$ +

$subtitle$

+$endif$ +$for(author)$ +

$author$

+$endfor$ +$if(date)$ +

$date$

+$endif$ +
+$endif$ +$if(toc)$ + +$endif$ +$body$ +$for(include-after)$ + $include-after$ +$endfor$ + + diff --git a/docs/index.html b/docs/index.html new file mode 100644 index 0000000..8003fb5 --- /dev/null +++ b/docs/index.html @@ -0,0 +1,1064 @@ + + + + + + + + + + + + + + + Parsing With Haskell Parser Combinators + + + + + + + +

+Parsing With Haskell Parser Combinators +
+ +

+
+ +

Parsing With Haskell Parser Combinators

+

Need to parse something? Never heard of a "parser combinator"? Looking to learn some Haskell? Awesome! Below is everything you'll need to get up and parsing with Haskell parser combinators. From here you can try tackling esoteric data serialization formats, compiler front ends, domain specific languages—you name it!

+ +

Building The Demos

+

Included with this guide are two demo programs.

+

version-number-parser parses a file for a version number. srt-file-parser parses a file for SRT subtitles. Feel free to try them out with the files found in test-input/.

+

Stack

+

Download the Haskell tool Stack and then run the following.

+
git clone https://github.com/lettier/parsing-with-haskell-parser-combinators
+cd parsing-with-haskell-parser-combinators
+stack build
+

Cabal

+

If using Cabal, you can run the following.

+
git clone https://github.com/lettier/parsing-with-haskell-parser-combinators
+cd parsing-with-haskell-parser-combinators
+cabal sandbox init
+cabal --require-sandbox build
+cabal --require-sandbox install
+

Running The Demos

+

After building the two demo programs, you can run them like so.

+

Stack

+

To try the version number parser, run the following.

+
cd parsing-with-haskell-parser-combinators
+stack exec -- version-number-parser
+What is the version output file path?
+test-input/gifcurry-version-output.txt
+

To try the SRT file parser, run the following.

+
cd parsing-with-haskell-parser-combinators
+stack exec -- srt-file-parser
+What is the SRT file path?
+test-input/subtitles.srt
+

Cabal

+

To try the version number parser, run the following.

+
cd parsing-with-haskell-parser-combinators
+.cabal-sandbox/bin/version-number-parser
+What is the version output file path?
+test-input/gifcurry-version-output.txt
+

To try the SRT file parser, run the following.

+
cd parsing-with-haskell-parser-combinators
+.cabal-sandbox/bin/srt-file-parser
+What is the SRT file path?
+test-input/subtitles.srt
+

Parser Combinator

+ +

+Parser Combinators +
+ +

+
+ +

One of the better ways to learn about the parsing strategy, parser combinator, is to look at an implementation of one.

+
+

+Parsers built using combinators are straightforward to construct, readable, modular, well-structured, and easily maintainable. +

+ +—Parser combinator - Wikipedia + +

+
+ +

ReadP

+

Let's take a look under the hood of ReadP, a parser combinator library found in base. Since it is in base, you should already have it.

+

💡 Note, you may want to try out Parsec after getting familiar with ReadP. It too is a parser combinator library that others prefer to ReadP. As an added bonus, it is included in GHC's boot libraries as of GHC version 8.4.1.

+

P Data Type

+
-- (c) The University of Glasgow 2002
+
+data P a
+  = Get (Char -> P a)
+  | Look (String -> P a)
+  | Fail
+  | Result a (P a)
+  | Final [(a,String)]
+  deriving Functor
+

We'll start with the P data type. The a in P a is up to you (the library user) and can be whatever you'd like. The compiler creates a functor instance automatically and there are hand-written instances for applicative, monad, MonadFail, and alternative.

+

💡 Note, for more on functors, applicatives, and monads, checkout Your easy guide to Monads, Applicatives, & Functors.

+

P is a sum type with five cases.

+
    +
  • Get consumes a single character from the input string and returns a new P.
  • +
  • Look accepts a duplicate of the input string and returns a new P.
  • +
  • Fail indicates the parser finished without a result.
  • +
  • Result holds a possible parsing and another P case.
  • +
  • Final is a list of two-tuples. The first tuple element is a possible parsing of the input and the second tuple element is the rest of the input string that wasn't consumed by Get.
  • +
+

Run

+
-- (c) The University of Glasgow 2002
+
+run :: P a -> ReadS a
+run (Get f)      (c:s) = run (f c) s
+run (Look f)     s     = run (f s) s
+run (Result x p) s     = (x,s) : run p s
+run (Final r)    _     = r
+run _            _     = []
+

run is the heart of the ReadP parser. It does all of the heavy lifting as it recursively runs through all of the parser states that we saw up above. You can see that it takes a P and returns a ReadS.

+
-- (c) The University of Glasgow 2002
+
+type ReadS a = String -> [(a,String)]
+

ReadS a is a type alias for String -> [(a,String)]. So whenever you see ReadS a, think String -> [(a,String)].

+
-- (c) The University of Glasgow 2002
+
+run :: P a -> String -> [(a,String)]
+run (Get f)      (c:s) = run (f c) s
+run (Look f)     s     = run (f s) s
+run (Result x p) s     = (x,s) : run p s
+run (Final r)    _     = r
+run _            _     = []
+

run pattern matches the different cases of P.

+
    +
  • If it's Get, it calls itself with a new P (returned by passing the function f, in Get f, the next character c in the input string) and the rest of the input string s.
  • +
  • If it's Look, it calls itself with a new P (returned by passing the function f, in Look f, the input string s) and the input string. Notice how Look doesn't consume any characters from the input string like Get does.
  • +
  • If it's Result, it assembles a two-tuple—containing the parsed result and what's left of the input string—and prepends this to the result of a recursive call that runs with another P case and the input string.
  • +
  • If it's Final, run returns a list of two-tuples containing parsed results and input string leftovers.
  • +
  • For anything else, run returns an empty list. For example, if the case is Fail, run will return an empty list.
  • +
+
> run (Get (\ a -> Get (\ b -> Result [a,b] Fail))) "12345"
+[("12","345")]
+

ReadP doesn't expose run but if it did, you could call it like this. The two Gets consume the '1' and '2', leaving the "345" behind.

+
> run (Get (\ a -> Get (\ b -> Result [a,b] Fail))) "12345"
+> run (Get (\ b -> Result ['1',b] Fail)) "2345"
+> run (Result ['1','2'] Fail) "345"
+> (['1', '2'], "345") : run (Fail) "345"
+> (['1', '2'], "345") : []
+[("12","345")]
+

Running through each recursive call, you can see how we arrived at the final result.

+
> run (Get (\ a -> Get (\ b -> Result [a,b] (Final [(['a','b'],"c")])))) "12345"
+[("12","345"),("ab","c")]
+

Using Final, you can include a parsed result in the final list of two-tuples.

+

readP_to_S

+
-- (c) The University of Glasgow 2002
+
+   readP_to_S :: ReadP a -> ReadS a
+-- readP_to_S :: ReadP a -> String -> [(a,String)]
+   readP_to_S (R f) = run (f return)
+

While ReadP doesn't expose run directly, it does expose it via readP_to_S. readP_to_S introduces a newtype called ReadP. readP_to_S accepts a ReadP a, a string, and returns a list of two-tuples.

+

ReadP Newtype

+ +

+ReadP Newtype +
+ +

+
+ +
-- (c) The University of Glasgow 2002
+
+newtype ReadP a = R (forall b . (a -> P b) -> P b)
+

Here's the definition of ReadP a. There are instances for functor, applicative, monad, MonadFail, alternative, and MonadPlus. The R constructor takes a function that takes another function and returns a P. The accepted function takes whatever you chose for a and returns a P.

+
-- (c) The University of Glasgow 2002
+
+readP_to_S (R f) = run (f return)
+

Recall that P is a monad and return's type is a -> m a. So f is the (a -> P b) -> Pb function and return is the (a -> P b) function. Ultimately, run gets the P b it expects.

+
-- (c) The University of Glasgow 2002
+
+readP_to_S (R f) inputString = run (f return) inputString
+--               ^^^^^^^^^^^                  ^^^^^^^^^^^
+

It's left off in the source code but remember that readP_to_S and run expects an input string.

+
-- (c) The University of Glasgow 2002
+
+instance Functor ReadP where
+  fmap h (R f) = R (\k -> f (k . h))
+

Here's the functor instance definition for ReadP.

+
> readP_to_S (fmap toLower get) "ABC"
+[('a',"BC")]
+
+> readP_to_S (toLower <$> get) "ABC"
+[('a',"BC")]
+

This allows us to do something like this. fmap functor maps toLower over the functor get which equals R Get. Recall that the type of Get is (Char -> P a) -> P a which the ReadP constructor (R) accepts.

+
-- (c) The University of Glasgow 2002
+
+fmap h       (R f  ) = R (\ k -> f   (k . h      ))
+fmap toLower (R Get) = R (\ k -> Get (k . toLower))
+

Here you see the functor definition rewritten for the fmap toLower get example.

+

Applicative P Instance

+

Looking up above, how did readP_to_S return [('a',"BC")] when we only used Get which doesn't terminate run? The answer lies in the applicative definition for P.

+
-- (c) The University of Glasgow 2002
+
+instance Applicative P where
+  pure x = Result x Fail
+  (<*>) = ap
+

return equals pure so we could rewrite readP_to_S (R f) = run (f return) to be readP_to_S (R f) = run (f pure). By using return or rather pure, readP_to_S sets Result x Fail as the final case run will encounter. If reached, run will terminate and we'll get our list of parsings.

+
> readP_to_S (fmap toLower get) "ABC"
+
+-- Use the functor instance to transform fmap toLower get.
+> readP_to_S (R (\ k -> Get (k . toLower))) "ABC"
+
+-- Call run which removes R.
+> run ((\ k -> Get (k . toLower)) pure) "ABC"
+
+-- Call function with pure to get rid of k.
+> run (Get (pure . toLower)) "ABC"
+
+-- Call run for Get case to get rid of Get.
+> run ((pure . toLower) 'A') "BC"
+
+-- Call toLower with 'A' to get rid of toLower.
+> run (pure 'a') "BC"
+
+-- Use the applicative instance to transform pure 'a'.
+> run (Result 'a' Fail) "BC"
+
+-- Call run for the Result case to get rid of Result.
+> ('a', "BC") : run (Fail) "BC"
+
+-- Call run for the Fail case to get rid of Fail.
+> ('a', "BC") : []
+
+-- Prepend.
+[('a',"BC")]
+

Here you see the flow from readP_to_S to the parsed result.

+

Alternative P Instance

+
-- (c) The University of Glasgow 2002
+
+instance Alternative P where
+  -- ...
+
+  -- most common case: two gets are combined
+  Get f1     <|> Get f2     = Get (\c -> f1 c <|> f2 c)
+
+  -- results are delivered as soon as possible
+  Result x p <|> q          = Result x (p <|> q)
+  p          <|> Result x q = Result x (p <|> q)
+
+  -- ...
+

The Alternative instance for P allows us to split the flow of the parser into a left and right path. This comes in handy when the input can go none, one, or (more rarely) two of two ways.

+
> readP_to_S ((get >>= \ a -> return a) <|> (get >> get >>= \ b -> return b)) "ABC"
+[('A',"BC"),('B',"C")]
+

The <|> operator or function introduces a fork in the parser's flow. The parser will travel through both the left and right paths. The end result will contain all of the possible parsings that went left and all of the possible parsings that went right. If both paths fail, then the whole parser fails.

+

💡 Note, in other parser combinator implementations, when using the <|> operator, the parser will go left or right but not both. If the left succeeds, the right is ignored. The right is only processed if the left side fails.

+
> readP_to_S ((get >>= \ a -> return [a]) <|> look <|> (get >> get >>= \a -> return [a])) "ABC"
+[("ABC","ABC"),("A","BC"),("B","C")]
+

You can chain the <|> operator for however many options or alternatives there are. The parser will return a possible parsing involving each.

+

ReadP Failure

+
-- (c) The University of Glasgow 2002
+
+instance Monad ReadP where
+  fail _    = R (\_ -> Fail)
+  R m >>= f = R (\k -> m (\a -> let R m' = f a in m' k))
+

Here is the ReadP monad instance. Notice the definition for fail.

+
> readP_to_S ((\ a b c -> [a,b,c]) <$> get <*> get <*> get) "ABC"
+[("ABC","")]
+
+> readP_to_S ((\ a b c -> [a,b,c]) <$> get <*> fail "" <*> get) "ABC"
+[]
+
+> readP_to_S (get >>= \ a -> get >>= \ b -> get >>= \ c -> return [a,b,c]) "ABC"
+[("ABC","")]
+
+> readP_to_S (get >>= \ a -> get >>= \ b -> fail "" >>= \ c -> return [a,b,c]) "ABC"
+[]
+

You can cause an entire parser path to abort by calling fail. Since ReadP doesn't provide a direct way to generate a Result or Final case, the return value will be an empty list. If the failed path is the only path, then the entire result will be an empty list. Recall that when run matches Fail, it returns an empty list.

+
-- (c) The University of Glasgow 2002
+
+instance Alternative P where
+  -- ...
+
+  -- fail disappears
+  Fail       <|> p          = p
+  p          <|> Fail       = p
+
+  -- ...
+

Going back to the alternative P instance, you can see how a failure on either side (but not both) will not fail the whole parser.

+
> readP_to_S (get >>= \ a -> get >>= \ b -> pfail >>= \ c -> return [a,b,c]) "ABC"
+[]
+

Instead of using fail, ReadP provides pfail which allows you to generate a Fail case directly.

+

Version Number

+ +

+Version Number +
+ +

+
+ +

Gifcurry, the Haskell-built video editor for GIF makers, shells out to various different programs. To ensure compatibility, it needs the version number for each of the programs it shells out to. One of those programs is ImageMagick.

+
Version: ImageMagick 6.9.10-14 Q16 x86_64 2018-10-24 https://imagemagick.org
+Copyright: © 1999-2018 ImageMagick Studio LLC
+License: https://imagemagick.org/script/license.php
+Features: Cipher DPC HDRI Modules OpenCL OpenMP
+

Here you see the output of convert --version. How could you parse this to capture the 6, 9, 10, and 14?

+

Looking at the output, we know the version number is a collection of numbers separated by either a period or a dash. This definition covers the dates as well so we'll make sure that the first two numbers are separated by a period. That way, if they put a date before the version number, we won't get the wrong result.

+ +

+Version Number Parser +
+ +

+
+ +
1. Consume zero or more characters that are not 0 through 9 and go to 2.
+2. Consume zero or more characters that are 0 through 9, save this number, and go to 3.
+3. Look at the rest of the input and go to 4.
+4. If the input
+    - is empty, go to 6.
+    - starts with a period, go to 1.
+    - starts with a dash
+        - and you have exactly one number, go to 5.
+        - and you have more than one number, go to 1.
+    - doesn't start with a period or dash
+        - and you have exactly one number, go to 5.
+        - you have more than one number, go to 6.
+5. Delete any saved numbers and go to 1.
+6. Return the numbers found.
+

Before we dive into the code, here's the algorithm we'll be following.

+

Building The Version Number Parser

+
parseVersionNumber
+  ::  [String]
+  ->  ReadP [String]
+parseVersionNumber
+  nums
+  = do
+  _         <- parseNotNumber
+  num       <- parseNumber
+  let nums' = nums ++ [num]
+  parseSeparator nums' parseVersionNumber
+

parseVersionNumber is the main parser combinator that parses an input string for a version number. It accepts a list of strings and returns a list of strings in the context of the ReadP data type. The accepted list of strings is not the input that gets parsed but rather the list of numbers found so far. For the first function call, the list is empty since it hasn't parsed anything yet.

+
parseVersionNumber
+  nums
+

Starting from the top, parseVersionNumber takes a list of strings which are the current list of numbers found so far.

+
  _         <- parseNotNumber
+

parseNotNumber consumes everything that isn't a number from the input string. Since we are not interested in the result, we discard it (_ <-).

+
  num       <- parseNumber
+  let nums' = nums ++ [num]
+

Next we consume everything that is a number and then add that to the list of numbers found so far.

+
  parseSeparator nums' parseVersionNumber
+

After parseVersionNumber has processed the next number, it passes the list of numbers found and itself to parseSeparator.

+

Parsing The Separator

+
parseSeparator
+  ::  [String]
+  ->  ([String] -> ReadP [String])
+  ->  ReadP [String]
+parseSeparator
+  nums
+  f
+  = do
+  next <- look
+  case next of
+    ""    -> return nums
+    (c:_) ->
+      case c of
+        '.' -> f nums
+        '-' -> if length nums == 1 then f [] else f nums
+        _   -> if length nums == 1 then f [] else return nums
+

Here you see parseSeparator.

+
  next <- look
+  case next of
+    ""    -> return nums
+    (c:_) ->
+

look allows us to get what's left of the input string without consuming it. If there's nothing left, it returns the numbers found. However, if there is something left, it analyzes the first character.

+
      case c of
+        '.' -> f nums
+        '-' -> if length nums == 1 then f [] else f nums
+        _   -> if length nums == 1 then f [] else return nums
+

If the next character is a period, call parseVersionNumber again with the current list of numbers found. If it's a dash and we have exactly one number, call parseVersionNumber with an empty list of numbers since it's a date. If it's a dash and we don't have exactly one number, call parseVersionNumber with the list of numbers found so far. Otherwise, call parseVersionNumber with an empty list if we have exactly one number or return the numbers found if we don't have exactly one number.

+

Parsing Non-numbers

+
parseNotNumber
+  ::  ReadP String
+parseNotNumber
+  =
+  munch (not . isNumber)
+

parseNotNumber uses munch which ReadP provides. munch is given the predicate (not . isNumber) which returns true for any character that isn't 0 through 9.

+
munch :: (Char -> Bool) -> ReadP String
+

munch continuously calls get if the next character in the input string satisfies the predicate. If it doesn't, munch returns the characters that did, if any. Since it only uses get, munch always succeeds.

+

💡 Note, parseNumber is similar to parseNotNumber. Instead of not . isNumber, the predicate is just isNumber.

+

Munch Versus Many

+
parseNotNumber'
+  ::  ReadP String
+parseNotNumber'
+  =
+  many (satisfy (not . isNumber))
+

Instead of using munch, you could write parseNotNumber like this, using many and satisfy—both of which ReadP provides. Looking at the type signature for many, it accepts a single parser combinator (ReadP a). In this instance, it's being given the parser combinator satisfy.

+
> readP_to_S (satisfy (not . isNumber)) "a"
+[('a',"")]
+
+> readP_to_S (satisfy (not . isNumber)) "1"
+[]
+

satisfy takes a predicate and uses get to consume the next character. If the accepted predicate returns true, satisfy returns the character. Otherwise, satisfy calls pfail and fails.

+
> readP_to_S (munch (not . isNumber)) "abc123"
+[("abc","123")]
+
+> readP_to_S (many (satisfy (not . isNumber))) "abc123"
+[("","abc123"),("a","bc123"),("ab","c123"),("abc","123")]
+

Using many can give you unwanted results. Ultimately, many introduces one or more Result cases. Because of this, many always succeeds.

+
> readP_to_S (many look) "abc123"
+-- Runs forever.
+

many will run your parser until it fails or runs out of input. If your parser never fails or never runs out of input, many will never return.

+
> readP_to_S (many (get >>= \ a -> return (read (a : "") :: Int))) "12345"
+[([],"12345"),([1],"2345"),([1,2],"345"),([1,2,3],"45"),([1,2,3,4],"5"),([1,2,3,4,5],"")]
+

For every index in the result, the parsed result will be the outcome of having ran the parser index times on the entire input.

+
> let parser        = get >>= \ a -> return (read (a : "") :: Int)
+> let many' results = return results <|> (parser >>= \ result -> many' (results ++ [result]))
+> readP_to_S (many' []) "12345"
+[([],"12345"),([1],"2345"),([1,2],"345"),([1,2,3],"45"),([1,2,3,4],"5"),([1,2,3,4,5],"")]
+

Here's an alternate definition for many. On the left side of <|>, it returns the current parser results. On the right side of <|>, it runs the parser, adds that result to the current parser results, and calls itself with the updated results. This has a cumulative sum type effect where index i is the parser result appended to the parser result at i - 1, i - 2, ..., and 1.

+

Running The Version Number Parser

+

Now that we built the parser, let's run it.

+
> let inputString =
+>     "Some Program (C) 1234-56-78 All rights reserved.\n\
+>     \Version: 12.345.6-7\n\
+>     \License: Some open source license."
+> readP_to_S (parseVersionNumber []) inputString
+[(["12","345","6","7"],"\nLicense: Some open source license.")]
+

You can see it extracted the version number correctly even with the date coming before it.

+

SRT

+ +

+SRT +
+ +

+
+ +

Now let's parse something more complicated—SRT files.

+

For the release of Gifcurry six, I needed to parse SRT (SubRip Text) files. SRT files contain subtitles that video processing programs use to display text on top of a video. Typically this text is the dialog of a movie translated into various different languages. By keeping the text separate from the video, there only needs to be one video which saves time, storage space, and bandwidth. The video software can swap out the text without having to swap out the video. Contrast this with burning-in or hard-coding the subtitles where the text becomes a part of the image data that makes up the video. In this case, you would need a video for each collection of subtitles.

+ +

+Gifcurry +
+Inner Video © Blender Foundation | www.sintel.org +

+
+ +

Gifcurry can take a SRT file and burn-in the subtitles for the video slice your select.

+
7
+00:02:09,400 --> 00:02:13,800
+What brings you to
+the land of the gatekeepers?
+
+8
+00:02:15,000 --> 00:02:17,500
+I'm searching for someone.
+
+9
+00:02:18,000 --> 00:02:22,200
+Someone very dear?
+A kindred spirit?
+

Here you see the English subtitles for Sintel (© Blender Foundation | www.sintel.org).

+

SRT Format

+
+

+SRT is perhaps the most basic of all subtitle formats. +

+ +—SRT Subtitle | Matrosk + +

+
+ +

The SRT file format consists of blocks, one for each subtitle, separated by an empty line.

+
2
+

At the top of the block is the index. This determines the order of the subtitles. Hopefully the subtitles are already in order and all of them have unique indexes but this may not be the case.

+
01:04:13,000 --> 02:01:01,640 X1:167 X2:267 Y1:33 Y2:63
+

After the index is the start time, end time, and an optional set of points specifying the rectangle the subtitle text should go in.

+
01:04:13,000
+

The timestamp format is hours:minutes:seconds,milliseconds.

+

💡 Note the comma instead of the period separating the seconds from the milliseconds.

+
This is the actual subtitle
+text. It can span multiple lines.
+It may include formating
+like <b>bold</b>, <i>italic</i>,
+<u>underline</u>,
+and <font color="#010101">font color</font>.
+

The third and last part of a block is the subtitle text. It can span multiple lines and ends when there is an empty line. The text can include formatting tags reminiscent of HTML.

+

Building The SRT Parser

+ +

+Parsing SRT +
+ +

+
+ +
parseSrt
+  ::  ReadP [SrtSubtitle]
+parseSrt
+  =
+  manyTill parseBlock (skipSpaces >> eof)
+

parseSrt is the main parser combinator that handles everything. It parses each block until it reaches the end of the file (eof) or input. To be on the safe side, there could be trailing whitespace between the last block and the end of the file. To handle this, it parses zero or more characters of whitespace (skipSpaces) before parsing the end of the file (skipSpaces >> eof). If there is still input left by the time eof is reached, eof will fail and this will return nothing. Therefore, it's important that parseBlock doesn't leave any thing but whitespace behind.

+

Building The SRT Block Parser

+
parseBlock
+  ::  ReadP SrtSubtitle
+parseBlock
+  = do
+  i      <- parseIndex
+  (s, e) <- parseTimestamps
+  c      <- parseCoordinates
+  t      <- parseTextLines
+  return
+    SrtSubtitle
+      { index       = i
+      , start       = s
+      , end         = e
+      , coordinates = c
+      , taggedText  = t
+      }
+

As we went over earlier, a block consists of an index, timestamps, possibly some coordinates, and some lines of text. In this version of parseBlock, you see the more imperative do notation style with the record syntax.

+
parseBlock'
+  ::  ReadP SrtSubtitle
+parseBlock'
+  =
+      SrtSubtitle
+  <$> parseIndex
+  <*> parseStartTimestamp
+  <*> parseEndTimestamp
+  <*> parseCoordinates
+  <*> parseTextLines
+

Here's another way you could write parseBlock. This is the applicative style. Just be sure to get the order right. For example, I could've accidentally mixed up the start and end timestamps.

+

Building The SRT Index Parser

+ +

+Parsing The Index +
+ +

+
+ +
parseIndex
+  ::  ReadP Int
+parseIndex
+  =
+      skipSpaces
+  >>  readInt <$> parseNumber
+

At the top of the block is the index. Here you see skipSpaces again. After skipping over whitespace, it parses the input for numbers and converts it to an actual integer.

+
readInt
+  ::  String
+  ->  Int
+readInt
+  =
+  read
+

readInt looks like this.

+
> read "123" :: Int
+123
+> read "1abc" :: Int
+*** Exception: Prelude.read: no parse
+

Normally using read directly can be dangerous. read may not be able to convert the input to the specified type. However, parseNumber will only return the 10 numerical digit characters (['0'..'9']) so using read directly becomes safe.

+

Building The SRT Timestamps Parser

+ +

+Parsing The Timestamps +
+ +

+
+ +

Parsing the timestamps are a little more involved than parsing the index.

+
parseTimestamps
+  ::  ReadP (Timestamp, Timestamp)
+parseTimestamps
+  = do
+  _   <- char '\n'
+  s   <- parseTimestamp
+  _   <- skipSpaces
+  _   <- string "-->"
+  _   <- skipSpaces
+  e   <- parseTimestamp
+  return (s, e)
+

This is the main combinator for parsing the timestamps.

+

char parses the character you give it or it fails. If it fails then parseTimestamps fails, ultimately causing parseSrt to fail so there must be a newline character after the index.

+

string is like char except instead of just one character, it parses the string of characters you give it or it fails.

+
parseStartTimestamp
+  ::  ReadP Timestamp
+parseStartTimestamp
+  =
+      char '\n'
+  >>  parseTimestamp
+

parseTimestamps parses both timestamps, but for the applicative style (parseSrt'), we need a parser just for the start timestamp.

+
parseEndTimestamp
+  ::  ReadP Timestamp
+parseEndTimestamp
+  =
+      skipSpaces
+  >>  string "-->"
+  >>  skipSpaces
+  >>  parseTimestamp
+

This parses everything between the timestamps and returns the end timestamp.

+
parseTimestamp
+  ::  ReadP Timestamp
+parseTimestamp
+  = do
+  h  <- parseNumber
+  _  <- char ':'
+  m  <- parseNumber
+  _  <- char ':'
+  s  <- parseNumber
+  _  <- char ',' <|> char '.'
+  m' <- parseNumber
+  return
+    Timestamp
+      { hours        = readInt h
+      , minutes      = readInt m
+      , seconds      = readInt s
+      , milliseconds = readInt m'
+      }
+

This parses the four numbers that make up the timestamp. The first three numbers are separated by a colon and the last one is separated by a comma. To be more forgiving, however, we allow the possibility of there being a period instead of a comma.

+
> readP_to_S (char '.' <|> char ',') "..."
+[('.',"..")]
+
+> readP_to_S (char '.' <|> char ',') ",.."
+[(',',"..")]
+

💡 Note, when using char with <|>, only one side can succeed (two char enter, one char leave) since char consumes a single character and two characters cannot occupy the same space.

+

Building The SRT Coordinates Parser

+ +

+Parsing The Coordinates +
+ +

+
+ +

The coordinates are an optional part of the block but if included, will be on the same line as the timestamps.

+
parseCoordinates
+  ::  ReadP (Maybe SrtSubtitleCoordinates)
+parseCoordinates
+  =
+  option Nothing $ do
+    _  <- skipSpaces1
+    x1 <- parseCoordinate 'x' 1
+    _  <- skipSpaces1
+    x2 <- parseCoordinate 'x' 2
+    _  <- skipSpaces1
+    y1 <- parseCoordinate 'y' 1
+    _  <- skipSpaces1
+    y2 <- parseCoordinate 'y' 2
+    return
+      $ Just
+        SrtSubtitleCoordinates
+          { x1 = readInt x1
+          , x2 = readInt x2
+          , y1 = readInt y1
+          , y2 = readInt y2
+          }
+

option takes two arguments. The first argument is returned if the second argument, a parser, fails. So if the coordinates parser fails, parseCoordinates will return Nothing. Put another way, the coordinates parser failing does not cause the whole parser to fail. This block will just have Nothing for its coordinates "field".

+
parseCoordinate
+  ::  Char
+  ->  Int
+  ->  ReadP String
+parseCoordinate
+  c
+  n
+  = do
+  _  <- char (Data.Char.toUpper c) <|> char (Data.Char.toLower c)
+  _  <- string $ show n ++ ":"
+  parseNumber
+

This parser allows the coordinate labels to be in either uppercase or lowercase. For example, x1:1 X2:2 Y1:3 y2:4 would succeed.

+

Building The SRT Text Parser

+ +

+Parsing The Text +
+ +

+
+ +

Parsing the text is the most involved portion due to the HTML-like tag formatting.

+

Tag parsing can be challenging—just ask anyone who parses them with a regular expression. To make this easier on us—and for the user—we'll use a tag soup kind of approach. The parser will allow unclosed and/or wrongly nested tags. It will also allow any tag and not just b, u, i, and font.

+
parseTextLines
+  ::  ReadP [TaggedText]
+parseTextLines
+  =
+      char '\n'
+  >>  (getTaggedText <$> manyTill parseAny parseEndOfTextLines)
+

We start out by matching on a newline character. After that, we functor map or fmap (<$>) getTaggedText over the subtitle text characters until we reach the end of the text lines.

+
parseEndOfTextLines
+  ::  ReadP ()
+parseEndOfTextLines
+  =
+  void (string "\n\n") <|> eof
+

We stop collecting characters (parseAny) when we reach two newline characters or the end of the file. This signals the end of the block.

+
getTaggedText
+  ::  String
+  ->  [TaggedText]
+getTaggedText
+  s
+  =
+  fst
+    $ foldl
+      folder
+      ([], [])
+      parsed
+  where
+

getTaggedText folds through the parsed text from left to right, returning the accumulated tagged text.

+
    parsed
+      ::  [String]
+    parsed
+      =
+      case readP_to_S (parseTaggedText []) s of
+        []      -> [s]
+        r@(_:_) -> (fst . last) r
+

parsed returns a list of one or more strings. It attempts to parse the input text for tags. If that fails, parsed returns the input string inside a list. Otherwise, if parseTaggedText succeeds, parse returns the last possible parsing ((fst . last) r).

+
    folder
+      ::  ([TaggedText], [Tag])
+      ->  String
+      ->  ([TaggedText], [Tag])
+    folder
+      (tt, t)
+      x
+      | isTag x   = (tt, updateTags t x)
+      | otherwise = (tt ++ [TaggedText { text = x, tags = t}], t)
+

As folder moves from left to right, over the parsed strings, it checks if the current string is a tag. If it is a tag, it updates the current set of active tags (t). Otherwise, it appends another tagged piece of text associated with the set of active tags.

+
updateTags
+  ::  [Tag]
+  ->  String
+  ->  [Tag]
+updateTags
+  tags
+  x
+  | isClosingTag x = remove compare' tags (makeTag x)
+  | isOpeningTag x = add    compare' tags (makeTag x)
+  | otherwise      = tags
+  where
+    compare'
+      ::  Tag
+      ->  Tag
+      ->  Bool
+    compare'
+      a
+      b
+      =
+      name a /= name b
+

updateTags updates the tags given by either removing or adding the given tag (x) depending on if it is a closing or opening tag. If it is neither, it just returns the passed set of tags. add will overwrite an existing tag if tags already has a tag by the same name. You can see this in the compare' function given.

+

To keep the parser simple, if an opening tag T is found, T gets added to the list of tags or overwrites an exiting T if already present. If a corresponding closing /T is found, then T is removed from the list of tags, if present. It doesn't matter if there is two or more Ts in a row, one or more Ts without a closing /T, and/or there's a closing /T without an opening T.

+
makeTag
+  ::  String
+  ->  Tag
+makeTag
+  s
+  =
+  Tag
+    { name       = getTagName       s
+    , attributes = getTagAttributes s
+    }
+

makeTag assembles a tag from the given string (s). Each Tag has a name and zero or more attributes.

+
parseTaggedText
+  ::  [String]
+  ->  ReadP [String]
+parseTaggedText
+  strings
+  = do
+  s <- look
+  case s of
+    "" -> return strings
+    _  -> do
+      r <- munch1 (/= '<') <++ parseClosingTag <++ parseOpeningTag
+      parseTaggedText $ strings ++ [r]
+

parseTaggedText returns the input string broken up into pieces. Each piece is either the text enclosed by tags, a closing tag, or an opening tag. After it splits off a piece, it adds it to the other pieces and calls itself again. If the remaining input string is empty, it returns the list of strings found.

+
> readP_to_S (string "ab" <++ string "abc") "abcd"
+[("ab","cd")]
+
+> readP_to_S (string "ab" +++ string "abc") "abcd"
+[("ab","cd"),("abc","d")]
+
+> readP_to_S (string "ab" <|> string "abc") "abcd"
+[("ab","cd"),("abc","d")]
+

The <++ operator is left biased meaning that if the left side succeeds, it won't even bother with the right. Recall that when we run the parser, we get a list of all the possible parsings. All of these possible parsings are the result of the parser having traveled through all of the possible paths. By using <++, we receive the possible parsings from the left path and from the right path if and only if the left side failed. If you'd like all of the possible parsings through the left and right side, you can use the +++ operator provided by ReadP. +++ is just <|> which we saw up above.

+
parseOpeningTag
+  ::  ReadP String
+parseOpeningTag
+  = do
+  _ <- char '<'
+  t <- munch1 (\ c -> c /= '/' && c /= '>')
+  _ <- char '>'
+  return $ "<" ++ t ++ ">"
+

An opening tag is an opening angle bracket, some text that doesn't include a forward slash, and the next immediate closing angle bracket.

+
parseClosingTag
+  ::  ReadP String
+parseClosingTag
+  = do
+  _ <- char '<'
+  _ <- char '/'
+  t <- munch1 (/= '>')
+  _ <- char '>'
+  return $ "</" ++ t ++ ">"
+

A closing tag is an opening angle bracket, a forward slash, some text, and the next immediate closing angle bracket.

+ +

+Parsing Tags +
+ +

+
+ +
getTagAttributes
+  ::  String
+  ->  [TagAttribute]
+getTagAttributes
+  s
+  =
+  if isOpeningTag s
+    then
+      case readP_to_S (parseTagAttributes []) s of
+        []    -> []
+        (x:_) -> fst x
+    else
+      []
+

Opening tags can have attributes. For example, <font color="#101010">. Each attribute is a two-tuple, key-value pair. In the above example, color would be the key and #101010 would be the value.

+
getTagName
+  ::  String
+  ->  String
+getTagName
+  s
+  =
+  case readP_to_S parseTagName s of
+    []    -> ""
+    (x:_) -> toLower' $ fst x
+

This returns the tag name in lowercase.

+
parseTagName
+  ::  ReadP String
+parseTagName
+  = do
+  _ <- char '<'
+  _ <- munch (== '/')
+  _ <- skipSpaces
+  n <- munch1 (\ c -> c /= ' ' && c /= '>')
+  _ <- munch  (/= '>')
+  _ <- char '>'
+  return n
+

The tag name is the first string of non-whitespace characters after the opening angle bracket, a possible forward slash, and some possible whitespace and before some more whitespace and/or the closing angle bracket.

+
parseTagAttributes
+  ::  [TagAttribute]
+  ->  ReadP [TagAttribute]
+parseTagAttributes
+  tagAttributes
+  = do
+  s <- look
+  case s of
+    "" -> return tagAttributes
+    _  -> do
+      let h = head s
+      case h of
+        '>' -> return tagAttributes
+        '<' -> trimTagname >> parseTagAttributes'
+        _   -> parseTagAttributes'
+  where
+    parseTagAttributes'
+      ::  ReadP [TagAttribute]
+    parseTagAttributes'
+      = do
+      tagAttribute <- parseTagAttribute
+      parseTagAttributes
+        ( add
+            (\ a b -> fst a /= fst b)
+            tagAttributes
+            tagAttribute
+        )
+

parseTagAttributes recursively goes through the input string, collecting up the key-value pairs. At the start of the tag (<), it first trims the tag name before tackling the attributes. It stops parsing for attributes when it reaches the closing angle bracket (>). If a tag happens to have duplicate attributes (based on the key), add will ensure only the latest one remains in the list.

+
trimTagname
+  :: ReadP ()
+trimTagname
+  =
+      char '<'
+  >> skipSpaces
+  >> munch1 (\ c -> c /= ' ' && c /= '>')
+  >> return ()
+

This trims or discards the tag name.

+
parseTagAttribute
+  ::  ReadP TagAttribute
+parseTagAttribute
+  = do
+  _ <- skipSpaces
+  k <- munch1 (/= '=')
+  _ <- string "=\""
+  v <- munch1 (/= '\"')
+  _ <- char '\"'
+  _ <- skipSpaces
+  return (toLower' k, v)
+

The attribute key is any string of non-whitespace characters before the equal sign. The attribute value is any characters after the equal sign and double quote and before the next immediate double quote.

+
isTag
+  ::  String
+  ->  Bool
+isTag
+  s
+  =
+  isOpeningTag s || isClosingTag s
+

A string is a tag if it is either an opening tag or a closing tag.

+
isOpeningTag
+  ::  String
+  ->  Bool
+isOpeningTag
+  s
+  =
+  isPresent $ readP_to_S parseOpeningTag s
+

A string is an opening tag if the opening tag parser succeeds.

+
isClosingTag
+  ::  String
+  ->  Bool
+isClosingTag
+  s
+  =
+  isPresent $ readP_to_S parseClosingTag s
+

A string is a closing tag if the closing tag parser succeeds.

+

Running The SRT Parser

+ +

+Parsed SRT Results +
+ +

+
+ +

Now that we've assembled the parser, let's try it out.

+
> let srt =
+>       " 1\n\
+>       \0:0:0,1 --> 0:1:0.2  x1:1 X2:3  y1:4 y2:10\n\
+>       \<font color=\"red\" color=\"blue\">This is some <b><u><i>\n \
+>       \subtitle \n\
+>       \</u>text.</b>  "
+> readP_to_S parseSrt srt
+[([ SrtSubtitle
+      { index = 1
+      , start = Timestamp {hours = 0, minutes = 0, seconds = 0, milliseconds = 1}
+      , end   = Timestamp {hours = 0, minutes = 1, seconds = 0, milliseconds = 2}
+      , coordinates = Just (SrtSubtitleCoordinates {x1 = 1, x2 = 3, y1 = 4, y2 = 10})
+      , taggedText =  [ TaggedText
+                        { text = "This is some "
+                        , tags = [ Tag {name = "font", attributes = [("color","blue")]}
+                                 ]
+                        }
+                      , TaggedText
+                          { text = "\n subtitle \n"
+                          , tags = [ Tag {name = "font", attributes = [("color","blue")]}
+                                   , Tag {name = "b",    attributes = []}
+                                   , Tag {name = "u",    attributes = []}
+                                   , Tag {name = "i",    attributes = []}
+                                   ]
+                          }
+                      , TaggedText
+                          { text = "text."
+                          , tags = [ Tag {name = "font", attributes = [("color","blue")]}
+                                   , Tag {name = "b",    attributes = []}
+                                   , Tag {name = "i",    attributes = []}
+                                   ]
+                          }
+                      , TaggedText
+                          { text = "  "
+                          , tags = [ Tag {name = "font", attributes = [("color","blue")]}
+                                   , Tag {name = "i",    attributes = []}
+                                   ]
+                          }
+                      ]
+      }
+  ]
+, ""
+)]
+

Here you see the result of parsing a test string. Notice the errors in the test string like the use of a period instead of a comma or the duplicate tag attribute.

+

Exercises

+
    +
  • Write a program that can convert an SRT file to a JSON file.
  • +
  • Rewrite the version number parser using Parsec instead of ReadP.
  • +
  • Rewrite the SRT parser using Parsec instead of ReadP.
  • +
+ +

(C) 2019 David Lettier
lettier.com

+ + diff --git a/docs/style.css b/docs/style.css new file mode 100644 index 0000000..5ea6ec7 --- /dev/null +++ b/docs/style.css @@ -0,0 +1,332 @@ +html { + font-size: 100%; + overflow-y: scroll; + -webkit-text-size-adjust: 100%; + -ms-text-size-adjust: 100%; +} + +body { + color: #444; + font-family: Helvetica, Arial, sans-serif; + font-size: 20px; + line-height: 2; + padding: 1em; + margin: auto; + max-width: 887px; + background: #fefefe; +} + +a { + color: #059; + text-decoration: none; +} + +a:visited { + color: #048; +} + +a:hover { + color: #06a; +} + +a:active { + color: #06a; +} + +a:focus { + outline: thin dotted; +} + +*::-moz-selection { + background: rgba(0, 200, 255, 0.3); + color: #111; +} + +*::selection { + background: rgba(0, 200, 255, 0.3); + color: #111; +} + +a::-moz-selection { + background: rgba(0, 200, 255, 0.3); + color: #048; +} + +a::selection { + background: rgba(0, 200, 255, 0.3); + color: #048; +} + +a > span.emoji { + font-size: 30px; + margin-left: 5px; +} + +p { + margin: 1em 0; +} + +img { + max-width: 100%; +} + +h1, h2, h3, h4, h5, h6 { + color: #111; + line-height: 125%; + margin-top: 1em; + font-weight: lighter; + font-family: 'Roboto Condensed', Helvetica, Arial, sans-serif; +} + +h4, h5, h6 { + font-weight: bold; +} + +h1 { + font-size: 2.5em; +} + +h2 { + font-size: 2em; +} + +h3 { + font-size: 1.5em; +} + +h4 { + font-size: 1.2em; +} + +h5 { + font-size: 1em; +} + +h6 { + font-size: 0.9em; +} + +blockquote { + color: #666666; + margin: 0; + padding-left: 3em; + border-left: 0.5em #EEE solid; +} + +hr { + display: block; + height: 2px; + border: 0; + border-top: 1px solid #aaa; + border-bottom: 1px solid #eee; + margin: 1em 0; + padding: 0; +} + +pre, code, kbd, samp { + font-family: monospace; + font-size: 14px; +} + +pre { + white-space: pre; + white-space: pre-wrap; + word-wrap: break-word; + padding: 15px; +} + +b, strong { + font-weight: bold; +} + +p > code { + font-weight: bold; +} + +dfn { + font-style: italic; +} + +ins { + background: #ff9; + color: #000; + text-decoration: none; +} + +mark { + background: #ff0; + color: #000; + font-style: italic; + font-weight: bold; +} + +sub, sup { + font-size: 75%; + line-height: 0; + position: relative; + vertical-align: baseline; +} + +sup { + top: -0.5em; +} + +sub { + bottom: -0.25em; +} + +ul, ol { + margin: 1em 0; + padding: 0 0 0 2em; +} + +li p:last-child { + margin-bottom: 0; +} + +ul ul, ol ol { + margin: .3em 0; +} + +dl { + margin-bottom: 1em; +} + +dt { + font-weight: bold; + margin-bottom: .8em; +} + +dd { + margin: 0 0 .8em 2em; +} + +dd:last-child { + margin-bottom: 0; +} + +img { + border: 0; + -ms-interpolation-mode: bicubic; + vertical-align: middle; +} + +figure { + display: block; + text-align: center; + margin: 1em 0; +} + +figure img { + border: none; + margin: 0 auto; +} + +figcaption { + font-size: 0.8em; + font-style: italic; + margin: 0 0 .8em; +} + +table { + margin-bottom: 2em; + border-bottom: 1px solid #ddd; + border-right: 1px solid #ddd; + border-spacing: 0; + border-collapse: collapse; +} + +table th { + padding: .2em 1em; + background-color: #eee; + border-top: 1px solid #ddd; + border-left: 1px solid #ddd; +} + +table td { + padding: .2em 1em; + border-top: 1px solid #ddd; + border-left: 1px solid #ddd; + vertical-align: top; +} + +kbd { + border: 1px solid #999; + padding: 5px; + border-radius: 2px; + background-color: #555; + color: #eee; + white-space: nowrap; +} + +.author { + font-size: 1.2em; + text-align: center; +} + +@media print { + * { + background: transparent !important; + color: black !important; + filter: none !important; + -ms-filter: none !important; + } + + body { + font-size: 12pt; + max-width: 100%; + } + + a, a:visited { + text-decoration: underline; + } + + hr { + height: 1px; + border: 0; + border-bottom: 1px solid black; + } + + a[href]:after { + content: " (" attr(href) ")"; + } + + abbr[title]:after { + content: " (" attr(title) ")"; + } + + .ir a:after, a[href^="javascript:"]:after, a[href^="#"]:after { + content: ""; + } + + pre, blockquote { + border: 1px solid #999; + padding-right: 1em; + page-break-inside: avoid; + } + + tr, img { + page-break-inside: avoid; + } + + img { + max-width: 100% !important; + } + + @page :left { + margin: 15mm 20mm 15mm 10mm; +} + + @page :right { + margin: 15mm 10mm 15mm 20mm; +} + + p, h2, h3 { + orphans: 3; + widows: 3; + } + + h2, h3 { + page-break-after: avoid; + } +} diff --git a/parsing-with-haskell-parser-combinators.cabal b/parsing-with-haskell-parser-combinators.cabal new file mode 100644 index 0000000..3acc4d5 --- /dev/null +++ b/parsing-with-haskell-parser-combinators.cabal @@ -0,0 +1,27 @@ +name: parsing-with-haskell-parser-combinators +version: 0.0.0.0 +homepage: https://github.com/lettier/parsing-with-haskell-parser-combinators +author: David Lettier +copyright: 2019 David Lettier +license: BSD3 +build-type: Simple +cabal-version: >= 1.10 +extra-source-files: README.md + +source-repository head + type: git + location: https://github.com/lettier/parsing-with-haskell-parser-combinators + +executable version-number-parser + main-is: src/version-number-parser.hs + build-depends: + base >=4.7 && <5 + ghc-options: -threaded -rtsopts -with-rtsopts=-N + default-language: Haskell2010 + +executable srt-file-parser + main-is: src/srt-file-parser.hs + build-depends: + base >=4.7 && <5 + ghc-options: -threaded -rtsopts -with-rtsopts=-N + default-language: Haskell2010 diff --git a/src/srt-file-parser.hs b/src/srt-file-parser.hs new file mode 100644 index 0000000..84ce186 --- /dev/null +++ b/src/srt-file-parser.hs @@ -0,0 +1,476 @@ +{- + SRT File Parser + (C) 2019 David Lettier + lettier.com +-} + +{-# LANGUAGE + NamedFieldPuns +#-} + +import Control.Applicative ((<|>)) +import Control.Monad +import Text.ParserCombinators.ReadP +import Data.Char +import Data.Maybe + +type TagAttribute = (String, String) + +data Tag = + Tag + { name :: String + , attributes :: [TagAttribute] + } + deriving (Show, Read) + +data TaggedText = + TaggedText + { text :: String + , tags :: [Tag] + } + deriving (Show, Read) + +data Timestamp = + Timestamp + { hours :: Int + , minutes :: Int + , seconds :: Int + , milliseconds :: Int + } + deriving (Show, Read) + +data SrtSubtitleCoordinates = + SrtSubtitleCoordinates + { x1 :: Int + , x2 :: Int + , y1 :: Int + , y2 :: Int + } + deriving (Show, Read) + +data SrtSubtitle = + SrtSubtitle + { index :: Int + , start :: Timestamp + , end :: Timestamp + , coordinates :: Maybe SrtSubtitleCoordinates + , taggedText :: [TaggedText] + } + deriving (Show, Read) + +main + :: IO () +main + = do + putStrLn "What is the SRT file path?" + filePath <- getLine + text <- readFile filePath + let result = + case readP_to_S parseSrt text of + [] -> [] + r@(_:_) -> fst $ last r + putStrLn "" + print result + +parseSrt + :: ReadP [SrtSubtitle] +parseSrt + = + manyTill parseBlock (skipSpaces >> eof) + +parseBlock + :: ReadP SrtSubtitle +parseBlock + = do + i <- parseIndex + (s, e) <- parseTimestamps + c <- parseCoordinates + t <- parseTextLines + return + SrtSubtitle + { index = i + , start = s + , end = e + , coordinates = c + , taggedText = t + } + +parseBlock' + :: ReadP SrtSubtitle +parseBlock' + = + SrtSubtitle + <$> parseIndex + <*> parseStartTimestamp + <*> parseEndTimestamp + <*> parseCoordinates + <*> parseTextLines + +parseIndex + :: ReadP Int +parseIndex + = + skipSpaces + >> readInt <$> parseNumber + +parseTimestamps + :: ReadP (Timestamp, Timestamp) +parseTimestamps + = do + _ <- char '\n' + s <- parseTimestamp + _ <- skipSpaces + _ <- string "-->" + _ <- skipSpaces + e <- parseTimestamp + return (s, e) + +parseStartTimestamp + :: ReadP Timestamp +parseStartTimestamp + = + char '\n' + >> parseTimestamp + +parseEndTimestamp + :: ReadP Timestamp +parseEndTimestamp + = + skipSpaces + >> string "-->" + >> skipSpaces + >> parseTimestamp + +parseTimestamp + :: ReadP Timestamp +parseTimestamp + = do + h <- parseNumber + _ <- char ':' + m <- parseNumber + _ <- char ':' + s <- parseNumber + _ <- char ',' <|> char '.' + m' <- parseNumber + return + Timestamp + { hours = readInt h + , minutes = readInt m + , seconds = readInt s + , milliseconds = readInt m' + } + +parseCoordinates + :: ReadP (Maybe SrtSubtitleCoordinates) +parseCoordinates + = + option Nothing $ do + _ <- skipSpaces1 + x1 <- parseCoordinate 'x' 1 + _ <- skipSpaces1 + x2 <- parseCoordinate 'x' 2 + _ <- skipSpaces1 + y1 <- parseCoordinate 'y' 1 + _ <- skipSpaces1 + y2 <- parseCoordinate 'y' 2 + return + $ Just + SrtSubtitleCoordinates + { x1 = readInt x1 + , x2 = readInt x2 + , y1 = readInt y1 + , y2 = readInt y2 + } + +parseCoordinate + :: Char + -> Int + -> ReadP String +parseCoordinate + c + n + = do + _ <- char (Data.Char.toUpper c) <|> char (Data.Char.toLower c) + _ <- string $ show n ++ ":" + parseNumber + +parseTextLines + :: ReadP [TaggedText] +parseTextLines + = + char '\n' + >> (getTaggedText <$> manyTill parseAny parseEndOfTextLines) + +getTaggedText + :: String + -> [TaggedText] +getTaggedText + s + = + fst + $ foldl + folder + ([], []) + parsed + where + parsed + :: [String] + parsed + = + case readP_to_S (parseTaggedText []) s of + [] -> [s] + r@(_:_) -> (fst . last) r + folder + :: ([TaggedText], [Tag]) + -> String + -> ([TaggedText], [Tag]) + folder + (tt, t) + x + | isTag x = (tt, updateTags t x) + | otherwise = (tt ++ [TaggedText { text = x, tags = t}], t) + +updateTags + :: [Tag] + -> String + -> [Tag] +updateTags + tags + x + | isClosingTag x = remove compare' tags (makeTag x) + | isOpeningTag x = add compare' tags (makeTag x) + | otherwise = tags + where + compare' + :: Tag + -> Tag + -> Bool + compare' + a + b + = + name a /= name b + +makeTag + :: String + -> Tag +makeTag + s + = + Tag + { name = getTagName s + , attributes = getTagAttributes s + } + +parseEndOfTextLines + :: ReadP () +parseEndOfTextLines + = + void (string "\n\n") <|> eof + +parseTaggedText + :: [String] + -> ReadP [String] +parseTaggedText + strings + = do + s <- look + case s of + "" -> return strings + _ -> do + r <- munch1 (/= '<') <++ parseClosingTag <++ parseOpeningTag + parseTaggedText $ strings ++ [r] + +parseOpeningTag + :: ReadP String +parseOpeningTag + = do + _ <- char '<' + t <- munch1 (\ c -> c /= '/' && c /= '>') + _ <- char '>' + return $ "<" ++ t ++ ">" + +parseClosingTag + :: ReadP String +parseClosingTag + = do + _ <- char '<' + _ <- char '/' + t <- munch1 (/= '>') + _ <- char '>' + return $ "" + +getTagAttributes + :: String + -> [TagAttribute] +getTagAttributes + s + = + if isOpeningTag s + then + case readP_to_S (parseTagAttributes []) s of + [] -> [] + (x:_) -> fst x + else + [] + +getTagName + :: String + -> String +getTagName + s + = + case readP_to_S parseTagName s of + [] -> "" + (x:_) -> toLower' $ fst x + +parseTagName + :: ReadP String +parseTagName + = do + _ <- char '<' + _ <- munch (== '/') + _ <- skipSpaces + n <- munch1 (\ c -> c /= ' ' && c /= '>') + _ <- munch (/= '>') + _ <- char '>' + return n + +parseTagAttributes + :: [TagAttribute] + -> ReadP [TagAttribute] +parseTagAttributes + tagAttributes + = do + s <- look + case s of + "" -> return tagAttributes + _ -> do + let h = head s + case h of + '>' -> return tagAttributes + '<' -> trimTagname >> parseTagAttributes' + _ -> parseTagAttributes' + where + parseTagAttributes' + :: ReadP [TagAttribute] + parseTagAttributes' + = do + tagAttribute <- parseTagAttribute + parseTagAttributes + ( add + (\ a b -> fst a /= fst b) + tagAttributes + tagAttribute + ) + +trimTagname + :: ReadP () +trimTagname + = + char '<' + >> skipSpaces + >> munch1 (\ c -> c /= ' ' && c /= '>') + >> return () + +parseTagAttribute + :: ReadP TagAttribute +parseTagAttribute + = do + _ <- skipSpaces + k <- munch1 (/= '=') + _ <- string "=\"" + v <- munch1 (/= '\"') + _ <- char '\"' + _ <- skipSpaces + return (toLower' k, v) + +parseAny + :: ReadP Char +parseAny + = + satisfy (const True) + +parseNumber + :: ReadP String +parseNumber + = + munch1 isNumber + +skipSpaces1 + :: ReadP () +skipSpaces1 + = + void $ skipMany1 (char ' ') + +isTag + :: String + -> Bool +isTag + s + = + isOpeningTag s || isClosingTag s + +isOpeningTag + :: String + -> Bool +isOpeningTag + s + = + isPresent $ readP_to_S parseOpeningTag s + +isClosingTag + :: String + -> Bool +isClosingTag + s + = + isPresent $ readP_to_S parseClosingTag s + +readInt + :: String + -> Int +readInt + = + read + +toLower' + :: String + -> String +toLower' + = + map toLower + +remove + :: (a -> a -> Bool) + -> [a] + -> a + -> [a] +remove + f + xs + x + = + filter + (f x) + xs + +add + :: (a -> a -> Bool) + -> [a] + -> a + -> [a] +add + f + xs + x + | isPresent xs = remove f xs x ++ [x] + | otherwise = [x] + +isPresent + :: Foldable t + => t a + -> Bool +isPresent + = + not . null diff --git a/src/version-number-parser.hs b/src/version-number-parser.hs new file mode 100644 index 0000000..50b88ec --- /dev/null +++ b/src/version-number-parser.hs @@ -0,0 +1,71 @@ +{- + Version Number Parser + (C) 2019 David Lettier + lettier.com +-} + +import Control.Monad +import Text.ParserCombinators.ReadP +import Data.Char +import Data.Maybe + +main + :: IO () +main + = do + putStrLn "What is the version output file path?" + filePath <- getLine + text <- readFile filePath + let result = + case readP_to_S (parseVersionNumber []) text of + [] -> [] + r@(_:_) -> map readInt $ fst $ last r + putStrLn "" + print result + +parseVersionNumber + :: [String] + -> ReadP [String] +parseVersionNumber + nums + = do + _ <- parseNotNumber + num <- parseNumber + let nums' = nums ++ [num] + parseSeparator nums' parseVersionNumber + +parseSeparator + :: [String] + -> ([String] -> ReadP [String]) + -> ReadP [String] +parseSeparator + nums + f + = do + next <- look + case next of + "" -> return nums + (c:_) -> + case c of + '.' -> f nums + '-' -> if length nums == 1 then f [] else f nums + _ -> if length nums == 1 then f [] else return nums + +parseNotNumber + :: ReadP String +parseNotNumber + = + munch (not . isNumber) + +parseNumber + :: ReadP String +parseNumber + = + munch1 isNumber + +readInt + :: String + -> Int +readInt + = + read diff --git a/stack.yaml b/stack.yaml new file mode 100644 index 0000000..cdcc25b --- /dev/null +++ b/stack.yaml @@ -0,0 +1,3 @@ +resolver: lts-13.27 +packages: +- . diff --git a/stack.yaml.lock b/stack.yaml.lock new file mode 100644 index 0000000..7fb2c31 --- /dev/null +++ b/stack.yaml.lock @@ -0,0 +1,12 @@ +# This file was autogenerated by Stack. +# You should not edit this file by hand. +# For more information, please see the documentation at: +# https://docs.haskellstack.org/en/stable/lock_files + +packages: [] +snapshots: +- completed: + size: 500539 + url: https://raw.githubusercontent.com/commercialhaskell/stackage-snapshots/master/lts/13/27.yaml + sha256: 690db832392afe55733b4c7023fd29b1b1c660ee42f1fb505b86b07394ca994e + original: lts-13.27 diff --git a/test-input/gifcurry-version-output.txt b/test-input/gifcurry-version-output.txt new file mode 100644 index 0000000..4e9fa91 --- /dev/null +++ b/test-input/gifcurry-version-output.txt @@ -0,0 +1,22 @@ + +         ▄▄▄▄▄▄▄▄                                                                              +     ▄▄████   ▀▀███▄                                                                          +      ████▀   ▄    ▀███           ▄    ▐██▌   ▄███▄                                           +  ▄   ▐███   ████   ▀███      ▄███▀▀██       ███                                              + ▐█▌   ██   ▐███     ████    ███        ▐██ █████▌ ▄█████ ▐██▌  ██▌  ██▄██▌ ██▄██▌ ██▌   ███  + ███   ▐▌   ███      ▐███▌   ███  ████▌ ▐██   ██▌  ███     ▐██▌  ██▌  ███▀   ███▀   ▐██  ███   + ████      ███▀  ▐█   ███▌   ███    ██▌ ▐██   ██▌  ███     ▐██▌  ██▌  ██▌    ██▌     ██▌▐██    + ▐███▄    ▐██▌   ██    ██     ███▄▄▄██▌ ▐██   ██▌   ███▄▄█ ███▄███▌  ██▌    ██▌      ████▌    +  ▀███   ▀███   ▐███   ▀        ▀▀▀▀▀    ▀▀   ▀▀      ▀▀▀    ▀▀▀   ▀▀     ▀▀        ███     +    ███▄   ▀    ████▌                                                                ███▀      +      ▀███▄▄   █████▀                                                                          +          ▀▀▀▀▀▀▀                                                                              + + +Gifcurry 6.0.0.0 +(C) 2016 David Lettier +https://lettier.com + +Wanna help out Gifcurry? Star it on GitHub! ☺ Thanks for helping out—you rock! +https://github.com/lettier/gifcurry/stargazers + diff --git a/test-input/imagemagick-version-output.txt b/test-input/imagemagick-version-output.txt new file mode 100644 index 0000000..91c242b --- /dev/null +++ b/test-input/imagemagick-version-output.txt @@ -0,0 +1,4 @@ +Version: ImageMagick 2018-10-24 6.9.10-14 Q16 x86_64 https://imagemagick.org +Copyright: © 1999-2018 ImageMagick Studio LLC +License: https://imagemagick.org/script/license.php +Features: Cipher DPC HDRI Modules OpenCL OpenMP diff --git a/test-input/subtitles.srt b/test-input/subtitles.srt new file mode 100644 index 0000000..008bc95 --- /dev/null +++ b/test-input/subtitles.srt @@ -0,0 +1,27 @@ + + + +1 +01:00:12,000 --> 01:00:15,000 +This is a subtitle. + + + + +2 +1:02:18,010 --> 01:05:10,010 +This is some subtitle +text that spans multiple lines. +It includes formatting +like bold, italic, +underline, < font color="#010101" color="#333" > +font > color
, and << even +nested tags over multiple + lines. + +3 +03:23:11,010 --> 03:55:17.110 X1:123 X2:223 Y1:50 Y2:101 +This subtitle specifies a text box +using X1, X2, Y1, and Y2. + +