Blogged: thoughts on PureScript package management

i’ve just published a post with some thoughts on package management and I thought some of you might be interested to read it: https://harry.garrood.me/blog/purescript-package-management-in-2019/

5 Likes

Regarding the thing one in package management that is most annoying (which was using git tags in your case), I find it odd that package authors need to manually publish to Pursuit, and I would be confused if the same would be expected for a package repository. I guess it should be possible to add a project, as a whole, to a tracking system, and the tracking system would occasionally poll the project for changes, and if there are changes it would calculate a new version of the project, publish the updated docs, and publish to a package repository. Anyways, just didn’t want to be silent about this long-held thought I’ve held on this topic.

2 Likes

Yes, the situation with publishing to Pursuit being separate from publishing a release is slightly less than ideal. It’s mostly because Pursuit isn’t a package registry, in that it’s not where people download packages’ source code from (I very much do not want to be responsible for a package registry).

It’s worth noting that if you use pulp version and pulp publish for making releases, there’s no manual “upload to Pursuit” step involved; it’s taken care of you by pulp.

We have previously talked about the kind of system you’re imagining, see https://github.com/purescript/pursuit/issues/96. We didn’t implement it because I thought that pulp version and pulp publish would be sufficient to address the need (that publishing to Pursuit is extra work which means that people often forget it). But perhaps we can reconsider this.

1 Like

Interesting! I wasn’t aware of this history! Thanks for linking it!

1 Like

@hdgarrood thanks for the great post and for the kind words! :blush:

In the past months I reflected on some things that you touched on, so I’ll do a bit of a brain dump on them here (I’ll write down some more context to help readers not involved with the details of this)

Version bounds

Premise: humans are not good at versioning, and bounds are accurate only if they are machine-checked (corollary: bounds that have not been machine-checked are fictional)

In JS and other dynamic languages you have to run the thing in order to machine-check, but having a nice type system and some care from humans (i.e. “breaking the types” on behaviour change) means that we can run a huge CI to compile together all the various combinations of packages to see if the human-inputted bounds make sense.
This is what the Hackage Matrix CI does, (here’s a random package from it), but not even in there they compile with the whole set of versions in the bound (i.e. one working plan per compiler version is enough)
This is basically what we do in package-sets too, except in our case it’s enough that there’s a build plan with the latest versions in the set.

In the end this means that I think the “package set” approach is equivalent to the “version bounds” one, except there’s combinatorially more CI involved in the latter (the assumption here is that humans are not supposed to select versions of packages to use. You have a single version available or a solver that picks it for you)

A PureScript package registry

I also agree that we should have a proper package registry and the current situation is less than ideal.

As @chexxor mentioned the fact that the Pursuit DB is disjoint from the Bower registry is sometimes uncomfortable, and I wrote down a proposal to avoid manual publishing of docs, and sync packages and releases from Bower instead (which is the “single source of truth” at the moment)

While Entropic sounds interesting, it currently looks like a self-hosted npm and this means that we’d still need to upload docs somewhere, and that somewhere would still need to be kept in sync.
I also think we shouldn’t try too hard to get packages on npm (or the next npm), as that would be convenient for the JS backend but not for the others (as they’d have to still use the whatever-VM package manager to get packages for that VM), so this means having a somehow-integrated PureScript-only registry.

Features I’d like for it:

  • package uploads
  • integrity checks for packages (i.e. hashing package contents)
  • docs for packages
  • storage for package sets
  • should be distributed

This looks like a distributed Hackage+Stackage, and there are no existing registries that do this out of the box (though if you squint you can see that it is basically a Nix cache with some sugar on top, more on this below)
It looks like this idea has been considered in the making of Stack 2 too: in the Stack 2 announcement post you can read about a “Pantry server” which would be a distributed content-addressable storage server (it probably makes more sense if you also read all the Pantry articles linked from the above post. For context, “Pantry” is the part of Stack that fetches stuff)

In the end I think implementing something would not be a big effort: such a backend can be put together with a git repo (for holding the metadata, hashes, checksums, etc) and an S3 bucket (to hold the package uploads and generated files).
Advantages of doing this (note that this is basically nixpkgs: git repo + huge CI):

  • not necessary to implement authentication, as packages would be added as PRs (so we’d exploit authentication implemented in GitHub/GitLab/$GitProvider)
  • the git repo itself is easily mirrored on GitHub/GitLab/self-hosted so availability is guaranteed if clients implement fallback logic in case of unavailability
  • the file storage is easily distributed since since mirroring an S3 bucket somewhere else is trivial
  • everything is just static files and there’s no actual HTTP backend running, so yay less security holes

@hdgarrood if any of this makes sense I can put together a small prototype

3 Likes

Have you considered machine-creating versions of a package? I started writing a program to do that for PureScript projects on GitHub, but I postponed it until I could figure out the right way to handle a module’s transitive dependencies.

I feel like it would be really nice to have more options of function implementations to choose from, rather than just whatever is in the latest package set.

This doesn’t address the issue of correct bounds. A machine can verify if I changed a function signature, and that it’s therefore a major breaking change. But what if my other library never calls that function but depends on something else? It should still be able to use the new version of the library even though the existing bounds say it can’t. This is why Hackage has the bounds matrix, and revisions which let administrators bump bounds after a version has been published.

Premise: humans are not good at versioning, and bounds are accurate only if they are machine-checked (corollary: bounds that have not been machine-checked are fictional)

I have a few minor objections to this.

Package authors necessarily have incomplete information when they are talking about version bounds; they don’t know what will change in future versions of their dependencies. It’s pretty common than an upstream breaking release doesn’t change anything you’re actually using, for instance, so you can relax your upper bound without making any code changes.

Also, bounds usually only become incorrect after they are published, as a result of newly published versions of upstream packages. Most bounds are correct (at least, in the sense of not being overly lax) at the time of publishing. This isn’t really a case of humans not being good at something, I think it’s more a case of being forced to guess because of incomplete information.

Bounds can be wrong in two different ways: they can be too strict or too lax. These can cause problems in quite different ways: bounds which are too strict can prevent install plans from being found, whereas bounds which are too lax can cause compile errors. These can result in quite different situations, so I think they need to be considered separately.

It’s often not possible for a machine to be able to tell when a version bound is too strict. For example, if an Aeson update changes the way a particular data type is serialized, my package may continue to compile and pass tests. On this basis, a machine might tell me I have an overly strict upper bound on Aeson. However, updating to the new version of Aeson could break backwards compatibility, which means that relaxing the bound isn’t safe. In fact, we’ve actually had this exact scenario happen with the compiler in the past.

Of course, a machine can check to see whether all bounds are satisfied if we take the latest versions of each package in some set of packages; this is part of what Stackage does. But this is different from checking if bounds are correct; even if all bounds are correct, it may still be the case that there’s a problem because the latest versions of things don’t build together, and so we might want one or more package maintainers to address this by making some code changes in order to allow relaxing some bounds.

Detecting bounds which are too lax is feasible as well, of course, with something like Hackage matrix builder (as you pointed out).

So the way I’d put it is that bounds are often wrong simply because getting them right all the time requires being able to see into the future. I don’t think it’s the case that bounds which have not been machine-checked are fictional; I don’t think you can really say anything stronger than “the more time has passed since a package’s version bounds were last checked, the less likely they are to be correct”.

1 Like

While Entropic sounds interesting, it currently looks like a self-hosted npm and this means that we’d still need to upload docs somewhere, and that somewhere would still need to be kept in sync.
I also think we shouldn’t try too hard to get packages on npm (or the next npm), as that would be convenient for the JS backend but not for the others (as they’d have to still use the whatever-VM package manager to get packages for that VM), so this means having a somehow-integrated PureScript-only registry.

I’m not sure about this. Firstly, JS is the default backend, and is already privileged in a number of ways, most obviously that it’s the only backend which ships with the compiler. I am also pretty sure that it’s the most used backend, as having being designed with compilation to JavaScript in mind is one of PureScript’s distinguishing selling points. Therefore I think it makes sense to privilege the JS backend in this way too by having PureScript packages hosted in a registry alongside JS packages. It’s also worth noting that npm (and entropic) have explicitly stated that they are not just for JavaScript.

I also don’t think it follows that just because npm hosts JS packages and therefore would be convenient for the JS backend, we shouldn’t use it because people using other backends would have to get their packages from elsewhere. Firstly, as we know already from experience, getting packages from more than one registry is perfectly doable (e.g. almost any PureScript project targeting JS will use packages from Bower as well as from npm). Secondly, the alternative is making it equally inconvenient for everyone, which is worse, surely? I think it’s preferable to have a situation where only people who are using alternate backends need to use two separate package registries, rather than needing everyone to use two separate package registries.

More importantly, though, I’m really really not keen on taking responsibility for package hosting. Even if we can design a registry in such a way that the ops burden is relatively low, there’s still a bunch of other things to consider. Legal issues in particular are not something I want to have to deal with at all. Neither are package name disputes. I also am a little concerned about how we’d pay our hosting bill; soliciting sponsorship from companies is probably doable, but the administration work of accepting donating and estimating costs and so on is again not really something I think we should try to take on. The benefit of being able to integrate Pursuit with our package registry is, to me, not appealing enough to consider taking these additional responsibilities on.

1 Like

What you point out here is an inevitable problem of version bounds: they will all at some point break (for being either too tight or too lax) because incomplete information.
And we cannot do anything about this, it’s only good to keep it in mind.

But what I meant above is that humans often get bounds wrong even when it just requires looking in the past (so a totally preventable issue).
Example based on a similar case that happened to me recently:

  • I can totally publish a package on Hackage having the (fictional, and too lax) bound text < 1.0, while it’s broken with text <= 0.8.1 (but I don’t know it)
  • then we’ll figure out that the bound is not correct after you try using it with text@0.8.0 (this is when you “machine-check” it)
  • then we restrict the bound to text > 0.8.0 && < 1.0
  • until someone tries to compile with text@0.8.1 and figures out the same problem, so in the end we get the “correct” bound (one that reflects reality)

Of course many bounds in the wild are also correct, but not all of them, so we have to consider all of them inaccurate until the compiler tells us otherwise

The JS backend is privileged for good reasons, but listing “it’s the most used backend” as a reason to privilege it more sounds like a circular argument, as the causal effect goes the other way: it’s the most used backend because it’s the only one that ships with the compiler :smile:

As an aside, I think in fact that shipping other backends with the compiler would incentivize their usage a lot IMHO. E.g. at work we wanted to try out the Erlang backend but didn’t manage yet because it’s not included by default and the inertia of setting up the pipeline is too big compared to just passing some flag to the compiler (I bet it’s not hard, I’m just noting how the incentives are not there. I’m also not advocating for shipping new backends with the compiler tomorrow, I know about maintenance burden etc etc)

It would not be a different situation than today though - i.e. having to use two package repositories, one for the PureScript stuff and the other for the backend-related packages

These are fair points, but fortunately we’re not alone in this, and none of this problems sounds insurmountable:

  • For legal protection there are things like The Software Freedom Law Center and the Electronic Frontier Foundation that offer legal help for open source projects
  • Package name disputes can be handled either as today on Bower (FCFS/FIFO) or as npm does (“we have a policy about it, you write us and we give you our indisputable decision”)
  • Hosting bill: the packages from package-sets currently fit in ~50MB, S3 costs $0.023/GB. All scripts can be run on free CI, or by volunteers on their spare hardware (in case you’re counting hands, I’m available for this), or on donated hardware (see next point)
  • companies do not have to donate cash, and in fact cash is about the most inconvenient thing to donate here because of the things you mentioned. Much better things to donate:
    • hardware/computing power
    • contributors time
    • help with bureaucracy (since companies have infrastructure in place for this)

Would you be available to change your mind if we had some more manpower/money or sponsorship from companies than today?

The JS backend is privileged for good reasons, but listing “it’s the most used backend” as a reason to privilege it more sounds like a circular argument, as the causal effect goes the other way: it’s the most used backend because it’s the only one that ships with the compiler :smile:

Yes of course, you’re absolutely right. I do think it would be nice to make it easier to use other backends, although I’d consider that low priority at the moment.

None of this problems sounds insurmountable: […]

I agree that these problems are surmountable, but to me it’s a question of allocation of resources and efficiency: I’m not convinced that investing time and/or money in this would give us as good a return on our investment as compared to other things we could be doing. I think doing this would involve quite a bit of work on the core maintainers’ part, even if we did manage to offload a lot of the work to other contributors/companies.

Regarding S3, by the way, it’s worth noting that we’d need to store every version of every package ever published, and we’d be billed per request and per byte of data transferred too. Those charges are also admittedly fairly small but I have no idea how large the volume would be.

Additionally, I think making a half-decent first pass at a package registry is doable, but I think there is a huge gap between a half-decent first pass and a proper registry which has had lots of work put into it. For instance, npm has had people working full-time for years on these sorts of things, making installation (including solving) fast, handling package deprecations sensibly without breaking downstream builds, handling package access control, providing package discoverability and statistics, and so on.

I also don’t think language-specific package registries are a particularly good thing, and I don’t think people will be particularly keen on learning how to use a new one. Almost everyone using PureScript already knows how to use npm, so I think using the npm registry would be much better from a UX perspective.

1 Like