[SOLVED] Parsing: how to correctly advance on fail?

Hi-Angel · October 15, 2024, 12:47pm

Related to “how to get all available matches”. I want the usual behavior you’d expect from regexp engines: for every character, try to match, on fail advance by one. E.g. if you do echo foo ffoo foo | grep foo you’ll get all 3 matches, because when grep reaches ffoo and fails to match ff with fo, it simply advances by a single character and tries again.

Now, I know I could do something like myParser <|> anyChar and I was doing that. However, for a 7kb file with just 2 matches this was returning a List with thousands of Nils, which is both slow and then have to be filtered out.

So my question is: how can I achieve the behavior of simply advancing the parser internal index by one on fail without returning anything related to the failure?

Code for the grep example:

module Main where

import Prelude

import Effect (Effect)
import Effect.Console (logShow)
import Parsing (runParser, Parser)
import Parsing.Combinators (many, try)
import Parsing.String (string)

myParser :: Parser String String
myParser = string "foo"

main :: Effect Unit
main = do
  logShow $ runParser "foo ffoo foo" (many $ try $ myParser)

I want this to return 3 matches and perhaps a single Nil for the end, but definitely not a dozen.

natefaubion · October 16, 2024, 12:00am

I think that’s provided in the Parser.String.Replace module:
https://pursuit.purescript.org/packages/purescript-parsing/10.2.0/docs/Parsing.String.Replace#v:splitCap

Hi-Angel · October 16, 2024, 9:25pm

Thanks, that might work, though neither performant due to the Left return values, nor convenient for the same reason.

Additionally, today I stumbled upon orthogonal challenge but related solution: I have a parser that returns HashTable with a single match, whereas my ultimate goal is having a single HashTable with all matches. So, inconveniently, I had to concatenate the tables. If a solution exists, it would be similar to the discussed problem, because I’d have to modify results of many after every match, whether it be dropping a result entirely (e.g. if I matched anyChar branch in a myParser <|> anyChar-like expression), or inserting to the table.

After sitting for a while I came up with a solution:

module Main where

import Prelude

import Data.Maybe (Maybe(..))
import Effect (Effect)
import Effect.Console (logShow)
import Parsing (Parser, runParser)
import Parsing.Combinators (optionMaybe)
import Parsing.String (anyCodePoint, string)

-- Search for all matches in the text
manyViaRetries :: ∀ r. Parser String r -> Parser String (Array r)
manyViaRetries p = manyViaRetries' p []
  where
    manyViaRetries' _ acc = optionMaybe p >>= case _ of
      Just v  -> manyViaRetries' p $ acc <> [v]
      Nothing -> optionMaybe anyCodePoint >>= case _ of
        Just _  -> manyViaRetries' p acc -- try again at next position
        Nothing -> pure acc              -- end of string

myParser :: Parser String String
myParser = string "foo"

main :: Effect Unit
main = logShow $ runParser "foo ffoo foo" $ manyViaRetries myParser

Hi-Angel · October 17, 2024, 7:59am

Well, after applying this code to the real parser it turned out it fails randomly, and upon reducing to the minimal testcase it turns out this is due to Parsing module making distinction between “failing without consuming” and “failing with consuming”. Yeah, different kinds of “fails”.

“What that means?” you’re asking? That means the above works for non-matching string "foobar" but will fail for non-matching string "foo" *> string "bar". Here, optionMaybe is documented as:

[returns] Nothing in the case where a parser fails without consuming input.

So a simple string "foobar" fails without consuming, but anything more complicated will consume input before failing, so parser failure will propagate past optionMaybe via the monad.

I’ve no idea what to do. I’m so tired of this Parsing module, you can’t even imagine… I’m removing the Resolved mark till anybody comes up with an idea, but for now for my MVP I guess I will have a half-backed parser with carefully crafted inputs, and if I’d go past MVP and there’s still no solution, I’d probably have to look at either writing an ad-hoc parser or removing PureScript from backend completely, whichever is faster…

jamesbrock · October 17, 2024, 1:31pm

Yeah that’s exactly how it works.

It is confusing. This is how most traditional monadic parser combinator libraries in Haskell work, including Parsec, Megaparsec, Attoparsec. Other parser combinator libraries have made different choices about this.

There is reasoning behind it though. There is more discussion in the Alt instance section of the docs for the type ParserT.

rhendric · October 18, 2024, 6:50am

Have you tried anyTill?

myParser :: Parser String String
myParser = map snd <<< anyTill $ string "foo"

Hi-Angel · October 18, 2024, 8:25am

Thank you, an interesting solution, similar to splitCap

Hi-Angel · December 4, 2024, 12:24pm

2 months later, I dug more deeply into parsing in general, and I think I got some insights on this.

My main confusion here due to there being two kinds of parsers: “ad hoc” and “formal”. For example, a PureScript compiler is a “formal parser”, because it cares about validity of everything. But PureScript syntax highlight and indentation in Emacs (or any editor) is an “ad hoc” parsing, because it only cares about some constructions, such as keywords, parentheses being balanced, etc. You can see that “ad hoc parser” is a subset of “formal parser”.

grep here is an ad hoc parser, trivially. But “Parsing” library may be used to write a “formal” parser, and getting down to “ad hoc” requires explicitly ignoring unrelated text.

So expected code is matching all text till the pattern (same as a .*?foo regexp):

myParser :: Parser String String
myParser = anyTill (string "foo") *> pure "foo"

main :: Effect Unit
main = do
  logShow $ runParser "foo ffoo foo" (many $ try $ myParser)

Now, I know I could do something like myParser <|> anyChar and I was doing that. However, for a 7kb file with just 2 matches this was returning a List with thousands of Nils, which is both slow and then have to be filtered out.

That was happening because I tried naively implementing an “ad hoc” parser in a library that provides “formal” parser API. The myParser <|> anyChar here was an emulation of an ad hoc matching by applying myParser to every char, advancing one char on fail. This of course worked awkwardly (returning the thousands of non-matches). Instead it’s necessary to parse the “uninteresting” portion of text as well (the .*? above).

Additionally, today I stumbled upon orthogonal challenge but related solution: I have a parser that returns HashTable with a single match, whereas my ultimate goal is having a single HashTable with all matches. So, inconveniently, I had to concatenate the tables.

I think, this deserves a separate question, given what I found above. But I think, implementing this requires digging into the internals of ParserT monad, and perhaps using runParserT' or writing some function which would explicitly manipulate the monad internal state. I tried doing this, but didn’t succeed offhand. Currently it is kinda too complicated for me and I just have things with a more pressing time-spent/usefulness ratio (that is, for my experience level — I’m sure if a Parsing maintainer will read this post, it will be a no-brainer for them, so yeah, solution to this is welcome if somebody has one ).

On unrelated note: I found that PureScript core libs provide a Data.HashMap module that does same thing as HashTable. I was using non-core HashTable because I wasn’t getting Data.Map during the search for PureScript hashmap. Mentioning just in case someone gets into the same situation.

Hi-Angel · December 4, 2024, 2:57pm

I didn’t get it back then, but I get it now. Seeing this as necessary probably requires some experience. The parser I wrote 2 months ago has a problem that sometimes I would make incorrect config, but then the parser would parse it without errors and with the “incorrect part” (and a bunch of other stuff due to that) missing in the result. I was wondering how to handle that correctly, but I had far too many other problems (with the parser in particular), and that one was the least important.

Now that I am revisiting the parsing situation, I figured that disregarding if I continue using “Parsing” or would have to write one myself, I’d have to dig into the existing “state of art”. I know Parsec is kind of established standard in the industry, because it is used by pandoc, and AFAIK pandoc is one of the best (if not the best) format converters. So I thought: seeing the success of pandoc, Parsec can’t be that bad, right? So I cursory read and experimented with half of this tutorial, and then also went through Design Patterns for Parser Combinators (Functional Pearl) paper (I recommend mostly paying attention to chapter 3 in this paper though).

And now I get it: sometimes I need to stop parsing, but calling fail explicitly is too much work. Imagine having to do that on every line inside a 10-lines long parser — that would be 20 lines, half of which just checking if we want to fail. So instead it is implemented implicitly, by making the distinction between fail with and without position consumed. So the beginning of a complex parser would be code that may fail without consuming input, by using try if necessary. And the rest of it would consume it, so that fail would actually mean fail.

Hi-Angel · February 25, 2025, 1:53pm

On that one, found a solution too, posting to have it recorded in the annals of PureScript tutoriality. Comes down to tail recursion:

foldParse ::  ∀ a. (a -> TextParser a) -> a -> TextParser a
foldParse f init = do
  acc :: Maybe a <- optionMaybe $ f init
  case acc of
    Nothing   -> pure init
    Just acc' -> foldParse f acc'

It’s your typical foldl, except we’re folding the underlying monad. Internally it uses optionMaybe to check/ignore if parser couldn’t consume any input. So no “Unexpected EOF” error, for example. A failure with having input consumed is propagated as usual.

A small usage example in terms of the thread-starter case (i.e. “grep implementation” via Parsing), where I use foldParse to concatenate each subsequent match to a string. So running over foo ffoo foo returns Found: foo, foo, foo, (didn’t handle the trailing comma for simplicity):

Code (click to unroll)

module Main where

import Prelude

import Data.Maybe (Maybe(..))
import Data.String.CodeUnits as Str
import Effect (Effect)
import Effect.Console (logShow)
import Parsing (Parser, runParser)
import Parsing.Combinators (optionMaybe, try, (<|>))
import Parsing.String (anyChar, string)

type TextParser = Parser String

foldParse ::  ∀ a. (a -> TextParser a) -> a -> TextParser a
foldParse f init = do
  acc :: Maybe a <- optionMaybe $ f init
  case acc of
    Nothing   -> pure init
    Just acc' -> foldParse f acc'

anyCharS :: TextParser String
anyCharS = anyChar >>= pure <<< Str.singleton

skipChar :: ∀ a. a -> TextParser a
skipChar marker = anyCharS *> pure marker

myParser :: TextParser String
myParser = string "foo"

main :: Effect Unit
main = logShow $ runParser "foo ffoo foo" $ foldParse concatMyParseToString "Found: "
  where
    concatMyParseToString acc = do
      ret <- try myParser <|> skipChar "skipped"
      pure if ret == "skipped" then acc
           else acc <> ret <> ", "