Typelevel validation of HTML/XML output (Proof of concept typelevel regexes)

cheery · October 6, 2020, 11:23pm

I wrote something that can may turn out to be fairly interesting to everybody here. Lets start with an illustration in the theme of this post. These demos are written in Haskell, I’m sorry about that but I really hurry to show this out.

Update: There’s purescript-rady in my repositories. That makes it a bit easier to develop and try this library.

minimal =
  tagged "head"
      (tagged "title" text
       `Interleave` anything)
  `Group`
  tagged "body" anything

Here we got a schema written in our language. Validator.nu uses RELAX NG validators that are similar to this. If I’ve understood it right, they are not sufficient to really validate HTML5, but they do validate the structure of a document.

Here’s a twist of this work. The inferred type for the above schema is:

minimal :: Shape Node ((String, [Node]), [Node])

It’s a type with two parameters, (Node) and ((String, [Node]), [Node]). This is the type of the information you need in order to construct a document. Now let me show the API of this library:

parse :: Shape x y -> [x] -> Maybe y
generate :: Shape x a -> a -> [x]

To illustrate I show how it goes if I plug the shape we have into these functions:

parse minimal :: [Node] -> Maybe ((String, [Node]), [Node])
generate minimal :: ((String, [Node]), [Node]) -> [Node]

Next I’m going to show you what the generate does:

> generate minimal (("Hello html", []), [Node "p" [Text "This is text"]])
[Node "head" [Node "title" [Text "Hello html"]],
 Node "body" [Node "p" [Text "This is text"]]]

And if you feed the result back into the parse minimal, you get:

Just (("Hello html",[]),[Node "p" [Text "This is text"]])

If you give in an unambigous expression, it is well-behaved in a sense that the following equations are satisfied:

Just a = parse p <<< generate p
map (generate p) (parse p a) = Just a

In case that the expression is ambiguous, it prefers the leftmost branch in alternatives.

The parsing itself happens through Brzozowski derivatives.

gist.github.com

https://gist.github.com/cheery/8eacfed16811e4fa68c9a3256de18ea7

Rady.hs

{-# LANGUAGE GADTs #-}
module Rady where

import Data.Void

data Node = Node String [Node]
          | Text String
    deriving (Show)

tagged :: String -> Shape Node a -> Shape Node a

This file has been truncated. show original

Rady.purs

module Rady where

import Data.Exists

import Control.Applicative (pure)
import Control.Apply ((<*>))
import Control.Bind (composeKleisli, (>>=))
import Control.Category (identity)
import Control.Semigroupoid ((<<<))
import Data.Array (concat, uncons, (:))

This file has been truncated. show original

I don’t think anything of this is new but I’ve figured some use for it. For instance, Oleg Grenrus connected regular expressions to types some while ago. Other interesting links I found:

David Aspinall had a proposal of doing a RELAX NG schema validator in Haskell.
James Clark wrote such an algorithm, though he didn’t extend it to types.

Overall I started with something similar that was behind the previous link and then thought out how to bring the schema to the type level. Although it was supposed to verify the schema reads/produces correct results, I was surprised how much it also ended up verifying that the parser/generator -functions themselves are correct.

Next I’ll likely go and generate some RSS feeds with this code, so I’m not even touching the parsing side of it, but I wanted to see how it goes because I like structures that can be constructed/deconstructed.
There are plenty of other uses for this library beyond HTML/XML generation/parsing.

All right, so hopefully it’s clear what this library is doing to you. I’ll explain more if somebody asks for elaboration. I’d wish for some feedback on these ideas:

I wonder about a good name for this project. I call it “Rady” or “Rad”, a shortening of “Relax already!” as a joke to the name of RELAX & RELAX NG. It’s more general than those libraries though…
What kind of example projects you’d like to see about the use of this library?
I’d like that the program would be able to print out the schema aside the produced document. I could use some proposals on how to do that.
I’d like the parser be capable of explaining why the input doesn’t pass the “regex”. I wonder about how to structure this part of the library.
Maybe also it’d be fun to examine some fuzzy pattern matching. That might allow to use this library for filling up “templates” by skipping elements not mentioned in the pattern.
I wonder if it’s possible to check on type-level whether a pattern is ambiguous or not?

Why did I make this? I’m trying to get my blog scripts off Python into Haskell and I feel I needed something that verifies the HTML I produced from the system has sort of validation instead of just having an untyped dump containing whatever the program might just output into the file.