2 months later, I dug more deeply into parsing in general, and I think I got some insights on this.
My main confusion here due to there being two kinds of parsers: “ad hoc” and “formal”. For example, a PureScript compiler is a “formal parser”, because it cares about validity of everything. But PureScript syntax highlight and indentation in Emacs (or any editor) is an “ad hoc” parsing, because it only cares about some constructions, such as keywords, parentheses being balanced, etc. You can see that “ad hoc parser” is a subset of “formal parser”.
grep
here is an ad hoc parser, trivially. But “Parsing” library may be used to write a “formal” parser, and getting down to “ad hoc” requires explicitly ignoring unrelated text.
So expected code is matching all text till the pattern (same as a .*?foo
regexp):
myParser :: Parser String String
myParser = anyTill (string "foo") *> pure "foo"
main :: Effect Unit
main = do
logShow $ runParser "foo ffoo foo" (many $ try $ myParser)
Now, I know I could do something like myParser <|> anyChar
and I was doing that. However, for a 7kb file with just 2 matches this was returning a List with thousands of Nil
s, which is both slow and then have to be filtered out.
That was happening because I tried naively implementing an “ad hoc” parser in a library that provides “formal” parser API. The myParser <|> anyChar
here was an emulation of an ad hoc matching by applying myParser
to every char, advancing one char on fail. This of course worked awkwardly (returning the thousands of non-matches). Instead it’s necessary to parse the “uninteresting” portion of text as well (the .*?
above).
Additionally, today I stumbled upon orthogonal challenge but related solution: I have a parser that returns HashTable
with a single match, whereas my ultimate goal is having a single HashTable
with all matches. So, inconveniently, I had to concatenate the tables.
I think, this deserves a separate question, given what I found above. But I think, implementing this requires digging into the internals of ParserT
monad, and perhaps using runParserT'
or writing some function which would explicitly manipulate the monad internal state. I tried doing this, but didn’t succeed offhand. Currently it is kinda too complicated for me and I just have things with a more pressing time-spent/usefulness ratio (that is, for my experience level — I’m sure if a Parsing
maintainer will read this post, it will be a no-brainer for them, so yeah, solution to this is welcome if somebody has one ).
On unrelated note: I found that PureScript core libs provide a Data.HashMap
module that does same thing as HashTable
. I was using non-core HashTable
because I wasn’t getting Data.Map
during the search for PureScript hashmap. Mentioning just in case someone gets into the same situation.