Make Char represent an opaque code point, instead of utf-16 code unit

drathier · October 5, 2021, 3:39pm

Hi everyone,

We’re primarily using the purerl backend for purescript, and we’re having quote a lot of issues with string parsing, due to differing unicode encodings between Javascript and Erlang.

Erlang, which Purerl compiles to, uses utf-8 encoded strings. Thus if you want to do FFI, it makes sense to use utf-8 encoded strings in Purerl too. Javascript however uses utf-16 encoded strings, so most library code written for the JS backend to deal with the Char type will not port correctly to other backends.

You could argue that JS is correct here, since the Char type is defined to be A single character (UTF-16 code unit). The JavaScript representation is a normal String, which is guaranteed to contain one code unit. This means that astral plane characters (i.e. those with code point values greater than 0xFFFF) cannot be represented as Char values. but I’d very much like the Char type to represent an opaque code point, rather than a code unit. If it’s an opaque code point, backends can use whatever encoding internally they want. Go can use int32 runes, Erlang can use utf-8 strings, Javascript can use utf-16 strings (like it already does).

Right now the C, Erlang/Elixir, Scheme and seemingly the C++/Go backends all use utf-8 encoded strings. Even though that’s not according to spec, I think it’s the right thing to do and that we should change the spec.

Counterarguments

One could argue that there’s lots of subtle places where this will break existing code, and that’s a valid point. But also, in the current state, there’s a lot of code that is already subtly broken because of this, so I think it’s better to change this sooner rather than later.

There’s also a valid argument in terms of performance, if you’re writing performance critical code. Slicing or indexing into a string wouldn’t be as efficient anymore. I’m fairly confident that that this not an issue for most code, but I’m sure this will affect someone.

Unpaired surrogates also wouldn’t be representable anymore in Char with the Javascript backend. I don’t think this is an issue for most, but I’d also be surprised if it didn’t affect anyone.

Original suggestion

This was afaict originally suggested by Nate here: Consider changing `Char` to represent a code point rather than a UTF-16 code unit · Issue #3662 · purescript/purescript · GitHub so do go there and read the feedback people have posted over the last 3 years.

Best,
Filip