Initial release of `purescript-language-cst-parser`

Hi, everyone! I’m super excited to announce an initial release of purescript-language-cst-parser. It’s a parser for the PureScript language (based off of current master) written in PureScript. It supports error recovery, and is fast enough (hopefully) for common tooling. I originally wrote the compiler’s CST parser so the community could develop syntactic tooling, but I think having to write and distribute Haskell to do it has been a huge barrier. No excuses anymore!

Writing a performant, stack-safe parser-combinator-based parser in PureScript is extremely difficult :laughing:. My initial attempt used purescript-parsing with a trampoline, and the slowest file in the package set parsed in 15s! :scream_cat: So I’ve been on a journey writing a faster parser combinator interpreter, and I’ve gotten that time down to about 1.1s. That’s still obviously not great in absolute terms for a single file, but the mean time for the 0.13 package set is about 9ms. Note this is all on my (several year old) machine so YMMV.

I’d like to thank @thomashoneyman, @colinwahl, @garyb, and @kritzcreek for helping me get to this release. There’s still work to do to improve error recovery. Right now it only works in layout contexts (top-level declarations, do/ado statements, let/where), and lexer errors do not currently recover at all. It’s also likely that things might change with the types and interface, so I’m keeping it at 0.x for now.


I followed the readme’s instructions to run npm run parse-package-set. I got this output. Looks like the error is in parsing these lines in purescript-quantities:

---- [Error 1 of 1. Failed in 93.995ms ] ----

  Unexpected operator - at line 450, column 16

  Unexpected operator - at line 451, column 16

  Unexpected operator - at line 452, column 16

  Unexpected operator - at line 453, column 16

  Unexpected operator - at line 454, column 16

Successfully parsed 1969 of 1970 modules.

---- [ Success Case Timing Information ] ----
Fastest Parse Times:
     0.115ms  /tmp/cst-integration-BsFrOP/.spago/metadata/v0.13.8/src/Metadata.purs
     0.202ms  /tmp/cst-integration-BsFrOP/.spago/sjcl/v0.0.1/src/Crypto/SJCL/MAC/Types.purs
     0.206ms  /tmp/cst-integration-BsFrOP/.spago/sjcl/v0.0.1/src/Crypto/SJCL/Cipher/Mode/Types.purs
     0.225ms  /tmp/cst-integration-BsFrOP/.spago/sjcl/v0.0.1/src/Crypto/SJCL/Cipher/Block/Types.purs
     0.228ms  /tmp/cst-integration-BsFrOP/.spago/undefined/v1.0.2/src/Undefined.purs
     0.238ms  /tmp/cst-integration-BsFrOP/.spago/payload/v0.3.0/src/Payload/Internal/Utils.purs
     0.242ms  /tmp/cst-integration-BsFrOP/.spago/payload/v0.3.0/src/Payload/Client/Internal/EncodeUri.purs
     0.248ms  /tmp/cst-integration-BsFrOP/.spago/validated-molecule/v1.0.5/src/BondOrder.purs
     0.252ms  /tmp/cst-integration-BsFrOP/.spago/milkis/v7.4.0/src/Milkis.Impl.purs
     0.253ms  /tmp/cst-integration-BsFrOP/.spago/bucketchain-simple-api/v3.0.0/src/Bucketchain/SimpleAPI/Auth.purs
     0.263ms  /tmp/cst-integration-BsFrOP/.spago/node-electron/v0.0.2/src/Node/Electron/Menu.purs
     0.273ms  /tmp/cst-integration-BsFrOP/.spago/react-basic-dnd/v7.0.0/src/React/Basic/ReactDND/Backends/HTML5Backend.purs
     0.275ms  /tmp/cst-integration-BsFrOP/.spago/react-basic-dnd/v7.0.0/src/React/Basic/ReactDND/Backends/TouchBackend.purs
     0.281ms  /tmp/cst-integration-BsFrOP/.spago/sjcl/v0.0.1/src/Crypto/SJCL/Hash/RIPEMD160.purs
     0.291ms  /tmp/cst-integration-BsFrOP/.spago/payload/v0.3.0/src/Payload/Server/Path.purs
     0.293ms  /tmp/cst-integration-BsFrOP/.spago/react-basic-dnd/v7.0.0/src/React/Basic/ReactDND/Backends/TestBackend.purs
     0.305ms  /tmp/cst-integration-BsFrOP/.spago/hyper/v0.11.0/src/Hyper/Header.purs
     0.306ms  /tmp/cst-integration-BsFrOP/.spago/literals/v0.1.1/src/Literals/Null.purs
     0.317ms  /tmp/cst-integration-BsFrOP/.spago/milkis/v7.4.0/src/Milkis.Impl.Node.purs
     0.318ms  /tmp/cst-integration-BsFrOP/.spago/prelude/v4.1.1/src/Data/Boolean.purs

Slowest Parse Times:
  1700.467ms  /tmp/cst-integration-BsFrOP/.spago/unicode/v4.0.1/src/Data/Char/Unicode/Internal.purs
  1323.569ms  /tmp/cst-integration-BsFrOP/.spago/react-basic-dom/v3.1.0/src/React/Basic/DOM/Generated.purs
   457.701ms  /tmp/cst-integration-BsFrOP/.spago/react-basic-native/v0.1.3/src/React/Basic/Native/Generated.purs
   434.924ms  /tmp/cst-integration-BsFrOP/.spago/react-basic-dom/v3.1.0/src/React/Basic/DOM/SVG.purs
   318.444ms  /tmp/cst-integration-BsFrOP/.spago/colors/v5.0.0/src/Color/Scale/Perceptual.purs
   235.720ms  /tmp/cst-integration-BsFrOP/.spago/arraybuffer-class/v0.2.5/src/Data/ArrayBuffer/Class.purs
   158.897ms  /tmp/cst-integration-BsFrOP/.spago/halogen-bootstrap/v8.0.0/src/Halogen/Themes/Bootstrap3.purs
   139.754ms  /tmp/cst-integration-BsFrOP/.spago/ace/v7.0.0/src/Ace/EditSession.purs
   127.005ms  /tmp/cst-integration-BsFrOP/.spago/halogen-bootstrap4/v0.1.4/src/Halogen/Themes/Bootstrap4.purs
   125.111ms  /tmp/cst-integration-BsFrOP/.spago/ordered-collections/v1.6.1/src/Data/Map/Internal.purs
   121.026ms  /tmp/cst-integration-BsFrOP/.spago/bucketchain-static/v0.3.0/src/Bucketchain/Static/ContentType.purs
   109.189ms  /tmp/cst-integration-BsFrOP/.spago/typelevel-rowlist-limits/v0.0.6/src/Type/RowList/Limits/LimitCount.purs
   107.542ms  /tmp/cst-integration-BsFrOP/.spago/react/v8.0.0/src/React/DOM/Props.purs
   106.779ms  /tmp/cst-integration-BsFrOP/.spago/halogen/v5.0.1/src/Halogen/HTML/Elements.purs
    98.056ms  /tmp/cst-integration-BsFrOP/.spago/canvas-action/v5.0.1/src/Graphics/CanvasAction.purs
    97.523ms  /tmp/cst-integration-BsFrOP/.spago/querydsl/v0.10.1/src/QueryDsl.purs
    93.647ms  /tmp/cst-integration-BsFrOP/.spago/ace/v7.0.0/src/Ace/Editor.purs
    92.337ms  /tmp/cst-integration-BsFrOP/.spago/concur-react/v0.4.2/src/Concur/React/Props.purs
    89.082ms  /tmp/cst-integration-BsFrOP/.spago/colors/v5.0.0/src/Color.purs
    89.075ms  /tmp/cst-integration-BsFrOP/.spago/react/v8.0.0/src/React/DOM.purs

Mean Parse: 13.225ms

Right, the parser is based off of master which has breaking changes for negative literal binders. It’s an expected failure.

Really great! :smiley:

If you didn’t use purescript-parsing in the end, what kind of parse combinators did you use and will you release those as a separate library?

1 Like

A roundabout answer, but an overall performance story:

I originally started with purescript-parsing for the main parser, and purescript-string-parsers for the lexer. Tearing out purescript-string-parsers was kind of a no-brainer. The lexer doesn’t need to be stack safe since it’s only consuming a single token and mostly uses regular expressions anyway. I also originally started with purescript-unicode, but it’s character class lookups are too slow and I’m not really sure how to make them reasonably faster. 25% of the parse time was in character lookups. We switched it out to use character groups from the native RegExp engine and that time basically disappeared. Improving the lexer significantly would probably take using a generator or a brave soul manually inlining all the parse states. The latter is possible, and not particularly difficult, just super tedious.

The main parser though needs to be stack safe, however. This means adding Trampoline to ParserT. If you are going to add Free for a trampoline, then you might as well just write a Free-based parser. I started with this, and it cut the parse time in half. I then removed Free and just inlined it into the Parser type, which cut it in half again. The amount of allocation is just insane. I also added primitives for many-like parsers, which allows me to collect results in a mutable array instead of using Array.cons. This was a respectable improvement.

One interesting thing is that I have proof (in a branch) that the PS grammar is context-free. The parser can be implemented with purely applicative combinators. I didn’t end up committing to it because performance suffered, likely due to a lot more indirection needed since essentially everything needed to go through defer.

I’m not sure how much more there is to possible optimize within the parser implementation itself. Writing an interpreter in the FFI would definitely help, but that immediately makes this library non-portable, and I’d like for it to be usable on other backends. Getting more performance would just require a different approach I think, again, using some kind of generator. One possibility is a grammar EDSL that one could write backends for, and then you could essentially write something that JITs to eval’ed JS code, or can run it within a pure PS interpreter.

That said, I’m open to releasing what I’ve written as a separate parser library, but I’m not sure how interested I am in maintaining it outside of the current needs of this project.