Pursuit, the registry, and monorepos

mikesol · August 6, 2024, 4:06am

IIRC, the blocker to having the registry accept packages in monorepos is Pursuit’s lack of monorepo support. Is that the case?

Also, IIRC, someone was rewriting Pursuit? Is that also the case? If so, does the rewrite tackle the monorepo problem?

I ask because Deku has been in a monorepo & out of the registry for almost a year now, and folks struggle getting started with it. I have a bit of time to try to tackle this if it’s untackled.

thomashoneyman · August 11, 2024, 2:06pm

Yes, the blocker to having the registry accept packages in monorepos is because of Pursuit. Specifically, Pursuit (and the purs publish command up to 0.14.7) expect all packages to be checked out from GitHub at a specific tag, where the tag must be a valid SemVer version. This isn’t suitable for monorepos, and it also restricts packages to GitHub.

The registry, on the other hand, is designed to work for monorepos and for packages from anywhere online where we can fetch their contents. But these features are explicitly disabled for now. For example:
https://github.com/purescript/registry-dev/blob/ebf4944079318c2eb27180b3b7dae15da4303ddd/app/src/App/Effect/Source.purs#L65-L72

There is an open issue on the registry about replacing purs publish in the registry with something else that can be sent to Pursuit:
https://github.com/purescript/registry-dev/issues/525

In the same vein, there was a long conversation last year in the PureScript chat about the specific dependency of Pursuit on the purs publish output, which gives a bit more context:
https://gist.github.com/thomashoneyman/94f3cc5e9fcf9a374cff51f92a63bc85

These are good places to start understanding the issue. However, a lot of Pursuit’s functionality has already been ported to a Halogen app within Spago, and it may be that we can sidestep the purs publish / existing Pursuit issue altogether if we just rewrote Pursuit such that it no longer requires the direct output of purs publish.
https://github.com/purescript/spago/tree/master/docs-search

That would require that the registry maintainers (ie. @f-f and myself) be involved in specifying what Pursuit needs from the registry to work properly. I’m happy to help work on the spec and registry side, though I don’t quite have the time to do the Pursuit work myself.

mikesol · August 11, 2024, 2:11pm

A spec would be great!

It seems like, for any given package set, one could download it, build it, get the doc files, and output static html. Would we even need JS in those pages? The only tricky thing could theoretically be the search functionality, but Algolia and Weaviate are so powerful these days that it can likely be outsourced for free.

thomashoneyman · August 11, 2024, 4:59pm

Potentially we could build the HTML documentation on package publication. We haven’t done that before because PureScript has never had infrastructure to build packages as part of publishing; Pursuit has always just accepted the output of purs publish, which builds locally.

We can’t rely on package sets because inclusion in a package set is not a prerequisite for publishing to Pursuit, and Pursuit should show documentation for more than what’s in the current package set.

However, I do have a PR up that actually will build packages as part of publishing, and potentially that opens the door to do docs generation as well:

github.com/purescript/registry-dev

Require all packages to solve / compile and include all valid compilers in their metadata

purescript:master ← purescript:trh/compilers-in-metadata

opened 09:55PM - 13 Nov 23 UTC

thomashoneyman

+1798 -730

Fixes #577. Fixes #255. The core problem solved here is identifying what comp…ilers are compatible with a specific package version, such as `aff@7.0.0`. We need this to support an oft-requested feature for Pursuit: filtering search results by a compiler version (or range). It's also useful for other things; for one, it allows us to add the compiler version as a constraint to the solver to produce a build plan that works with the specific compiler given. Metadata files now include a `compilers` key in published metadata that lists either a bare version (the version used to publish the package) or an array of versions (the full set of compilers known to work with the package). The reason for two representations is that computing the full set of compilers can take a long time; this approach lets us upload the package right away and compute the rest of the valid compilers in a fixup pass. A bare version means the full set has not been computed yet. All packages must now be solvable. We can't compile a package version if we can't fetch its dependencies, so this becomes a requirement for all packages. There are only 2 scenarios in which we need to compute the available compilers for a package version: 1. A new package version is published 2. A new compiler version is published This PR is focused on the first case, and we should do a followup for the second case. (The second case is straightforward, and @colinwahl's compiler versions script already essentially implements it. It's been omitted from this PR for brevity). A new package version can be published via the legacy importer or via a user submitting an API call, but the result is the same: eventually the `publish` pipeline is run. For that reason I've decided to compute the compiler versions for a package version as part of the publish pipeline where we're already determining resolutions and building with a specific compiler. That centralizes the logic to a single place. Therefore this PR centers on two things: trying compilers to find all that work for a package version at the end of publishing, and updating the legacy importer to determine a valid compiler version and resolutions before calling `publish`. I've added some tests and I've run the legacy importer locally; it's about 500 packages in so far and every failure appears to be correct. More comments in the PR.

mikesol · August 11, 2024, 6:46pm

Thanks for the response!

What are the current prerequisites for publishing to Pursuit? Would you want to see those expanded and/or restricted as part of a rewrite?

In broad strokes, I’d imagine using vike prerendering to build the HTML on a GitHub action, and part of that can be a compilation step. If it’s a public repo, I believe it’ll be free of charge.

thomashoneyman · August 11, 2024, 7:41pm

Off the top of my head I believe we’ve already baked all prerequisites to publish to Pursuit into the registry itself, so there should be no additional requirements if we make it so that the registry is the only way to push documentation. I believe this was the eventual plan anyway (cc: @f-f)

The problem, then, is how to go about generating the HTML. This is where we’ll have to do the most digging.

In short, the job is to get a clear idea of how we’ll go from a publish run in the registry to pushing whatever Pursuit will need for the HTML and search. Right now you can see in the Pursuit effect in the registry how we push docs over.

thomashoneyman · August 11, 2024, 7:43pm

As far as using vike — we will soon stop using GitHub actions except as a wrapper around curl to send API requests to the registry server. I imagine we would build the HTML on the server instead of in a GitHub action. We can then include the HTML rendering in our integration tests.

mikesol · August 11, 2024, 7:46pm

Nice! Is there a way to simulate these runs locally? For example, running just the doc file generation without the publish step?

thomashoneyman · August 11, 2024, 8:23pm

You can simulate parts of the registry either via the normal test suite — see, for example, this run of the publish pipeline and pursuit publishing:
https://github.com/purescript/registry-dev/blob/ebf4944079318c2eb27180b3b7dae15da4303ddd/app/test/App/API.purs#L71

…or, we have a more sophisticated integration test that runs the real effects against a virtual machine here:
https://github.com/purescript/registry-dev/blob/ebf4944079318c2eb27180b3b7dae15da4303ddd/flake.nix#L785-L789

We can simulate doc file generation in either place, I’m sure.

mikesol · August 12, 2024, 3:41am

I started snooping a bit, in general it looks great. Could you send me read-only creds for the bucket where the registry’s storage lives? I can try to mess around with generating HTML using that info as a base and see how far I get.

thomashoneyman · August 12, 2024, 2:09pm

Sure. The bucket is open to anyone at packages.registry.purescript.org, and a package can be found at any given version at name/version, e.g. http://packages.registry.purescript.org/effect/1.0.0.tar.gz. However, this won’t necessarily help you with docs generation because it’s only the source code for the package, and you’d have that (and all dependencies, and the compiled output) during registry publishing anyway. You might want to look at these two files as a reference as well:

The contributing guide
The registry spec

That said, however, I don’t know if the registry has anything very helpful for docs generation right now. If you look at how Pursuit generates documentation for a module you’ll see it’s relying on the compiler directly via Language.PureScript imports:
https://github.com/purescript/pursuit/blob/master/src/TemplateHelpers.hs#L13C1-L14

Then, if you look at how the compiler generates docs, you’ll see that it’s essentially its own backend — there is a lot of code in here:
https://github.com/purescript/purescript/tree/master/src/Language/PureScript/Docs

Perhaps the Spago docs-search package already provides the code necessary to produce HTML from externs files or something, but I’m not personally familiar with it. The big open question for me is how you go from what the registry knows about to the rendered documentation HTML. (Other than by just using the compiler to generate the HTML, which is potentially viable.) Right now all the registry knows about is the package source code; with my compilers-in-metadata branch merged we will also have the package’s compiled output, and the compiled output of its dependencies.

mikesol · August 13, 2024, 3:13am

All looks good!

I can go about stitching together stuff one of two ways.

Easier, probably less helpful. I can ignore all the packages that are not part of a recent-ish package set. This’ll go pretty fast. Then, once the general approach is validated, we can tackle the thornier issue of how to build older stuff.
Harder, probably more helpful. I can recreate something like the pipeline in API.purs where I read a manifest, and for stuff not in a package set, loop backwards through compiler versions until something doesn’t error out.

@thomashoneyman whaddaya think? I’d prefer (1) because it’d give us something concrete to iterate on in a few days, but if it’s not helpful, I can start with (2).

thomashoneyman · August 13, 2024, 7:59pm

I think it’s sufficient to begin with a proof-of-concept that works on a single package version Perhaps we should begin with a short call on the PureScript chat so that people who care about the next iteration of Pursuit can weigh in a bit (at least myself and @f-f, but potentially others). That way I could learn more about how you’re approaching this, and I bet that Fabrizio and I know things about the current Pursuit and registry that could help save you some time.

I can recreate something like the pipeline in API.purs where I read a manifest, and for stuff not in a package set, loop backwards through compiler versions until something doesn’t error out.

This is an example of where we might want to sync; as mentioned, in the compilers-in-metadata branch, we already solve dependencies and compile packages with all compiler versions, so you would not need to compile anything. Instead, we can generate docs after we’ve compiled the package and its dependencies (already taken care of for you).

Potentially we could do next Tuesday, Wednesday, or Thursday morning?

thomashoneyman · August 13, 2024, 8:01pm

Of course, if you’re ready to start exploring how we can produce Pursuit-like HTML docs for an arbitrary PureScript package right now, given the compiled package & its compiled dependencies, then I’m sure that work will be useful regardless of the call.

mikesol · August 14, 2024, 4:36am

A call would be great! All those times work for me as they’re evening here, just propose something that works for you.

The approach I’d plan to use would be similar to what I used for the Deku docs, which hooks into the vercel build outputs API for serving pre-compiled html & then doing partial hydration. They serve the HTML from their edge network, which makes the lighthouse score really good.

Would you be open to deploying the new Pursuit on vercel or a similar service that does automatic edge caching of html?

Vike makes this super easy to do: there’s a minimal example at GitHub - mikesol/vike-deku-client-routing . We can go over that on the call as well.

mikesol · August 19, 2024, 12:27pm

We should probably hold off a week or two just until I have something to report - I’m still struggling with the search functionality.

The current one is fine, but I see this as an opportunity to try something different and I’d like to imagine how to make it a bit beefier. NPM search, for example, can help someone find a networking library or a canvas library, but Pursuit can’t do that unless you know what the name may be like already. The type search is great, though. I’m wondering if we can push it a bit closer to NPM and something that feels like it’s Algolia-driven while keeping the type search?

thomashoneyman · August 19, 2024, 3:28pm

I’m definitely interested in innovation on the Pursuit search.

Another thing to keep in mind is that with the compilers-in-metadata branch we can restrict searches to a compiler version or range of compilers, which should help keep the results more relevant (ie. I only want to see 0.15-series compilers). We could also consider listing package sets, so you can only search for packages in a given package set.

To be honest, my primary concern with the Pursuit rewrite is that we keep it easy to maintain going forward. Spago and the Registry are ultimately maintained long-term by Fabrizio and I, and with limited time I’m generally willing to sacrifice some bells and whistles to keep it stable and dependent on as few external services as possible.

Happy to push the meeting back a week; I think that will work better for Fabrizio too.

mikesol · August 20, 2024, 4:08am

My idea is to:

batch-generate queries using Mamba Codestral

Screenshot 2024-08-20 at 7.03.132112×1106 339 KB
fine-tune Sentence BERT on the queries
use transformers.js to generate an embedding for a query on the client side
find the best matches using faiss-node

The first step of query generation already going well, needs some fine tuning but even the naïve result is plausible.

I put together a repo that compiles mamba codestral using tensorrt, and using that setup, an A100 can get through a single package set in about 20 minutes.

I’m not sure how easy/hard this is to maintain, but at least it’s not much code We can go over it & evaluate the approach when we meet up, & also we can keep ideating async on this thread (ofc anyone else can chime in).

f-f · August 27, 2024, 10:08pm

Over the years I sketched some notes about dealing with Pursuit in the new registry world, so what follows is a rough chop of those tailored to this thread. As the quote goes, I apologise for the long letter as I didn’t have the leisure to make it shorter

Background on Pursuit V1

Pursuit is currently the home of all PureScript documentation, and an informal registry - if you can find a package there, then you can probably use it in your project.

Traditionally you would get your documentation in there by calling pulp publish, which:

cut a git tag in your repo
collected some info from Bower to pass to the compiler
called purs publish with those, to get the compiler to gather all the docs in the project, generate docs.json files and a resolution of your package versions/addresses
packaged all of that up, and hit Pursuit’s API with the result
at that point Pursuit would call the compiler itself with a custom renderer to generate HTML docs (from the docs.json files), and cross-link them to the GitHub repos (through the resolution file), to get nice source links
That’s is, your docs are published, and Pursuit will serve them itself

A vision for Pursuit V2

Pursuit V1 is based on the requirements of a PureScript registry that relies on Bower. The new Registry changes some of those requirements (e.g. no Bower, support for monorepos, git hosting through other providers outside GitHub, etc) and that either requires patching Pursuit V1 or using something else to render the docs.

At this point it seems fairly clear to me that both the core team and the community do not have much appetite to maintain a webservice written in Haskell, so new thing it is - this gives us the opportunity to rethink the current model in the wake of the new Registry.

What do we expect from something like Pursuit?

Currently:

host compiler-generated docs, patched with links to the original source, and the package READMEs
search for packages, functions, and types

Now, if we are going to ditch Pursuit V1 entirely, I question the need for a webserver in the first place: compiler-generated docs are entirely static files, READMEs are as well, and wonderful languages such as PureScript allow us to implement client-side search, if the search index can be broken up in small enough chunks. These index chunks are static files as well!
Pursuit can then be entirely hosted on an S3-like service, alongside the packages for the Registry. (which we host on Digital Ocean’s S3 thing called Spaces)

This line of thinking is actually much older than the new Registry, as Spago integrates a Pursuit-approximation since 2019 (gosh that was 5y ago, time flies): spago docs will generate docs from the compiler, build a sharded search index over the whole package set, and add a client-side app to let your search these docs.
This almost covers the entirety of the requirement list from above:

we get compiler-generated docs, but they don’t have links to GitHub nor READMEs for the packages
the type-search is roughly on-par with Pursuit

Point (1) above is where the whole purs publish issue ties in: as detailed in the registry ticket and the gist that Thomas linked above, it’s a piece of the compiler that fills in information about re-exports, while at the same time adding restrictions (clean git tree, etc).

We are currently calling purs publish in the registry pipeline, to upload docs to pursuit, but for allowing things like monorepo publishing, we should either patch the current purs publish (and hopefully not break the current Pursuit) or go around it - in the gist above I propose that we patch purs docs to add the information that we need instead, but today I believe that we should just patch purs publish to remove all restrictions.

Once we have this functionality then the Registry pipeline - i.e. the batch job that is kicked off on our infra when you call spago publish - can call purs docs, patch the html to reference GitHub locations, generate the index, and upload index+html to S3.

This is the bare minimum that would get us to parity with Pursuit, but once things are integrated with the Registry pipeline, then we can easily add info about package sets, compiler versions, the registry index, etc, since we are dealing with that info in there anyways.

The path to Pursuit V2

I want to stress that most of the pieces are already there thanks to docs-search. Things that are missing:

patch purs docs to add reExports (see the gist linked above), or purs publish to remove the git restrictions
patch the generated docs to have README and GitHub/other-hosting-providers source links

As with every approach, there are risk factors that are not quantified at the moment (these are known unknowns. Unknown unknowns are not listed for obvious reasons):

docs-search works off a whole package set to generate its index, and does not need to worry about incremental updates. The Registry pipeline will need to worry about updating the index in an efficient way, in order to avoid trashing the cache (i.e. unstable sharding, where a new package triggers the regeneration of many chunks. I don’t have a clear view of what the chunking logic will do in this case, but I do believe it’s possible and maybe even easy to ensure that the sharding of the index is stable).
for the client side search to work well on poor connections the shards can’t be too big nor too many, or the client will have to download a ton of data, which is not always possible/pleasant. We’ll need to benchmark that. However, it’s an easy one to solve in case we can’t shrink the index to an acceptable size: we’d have a webserver answer the search query, that is just a thin thin layer as it would be a “local client”: i.e. it would download the whole index locally and answer queries from the internet with the local index. This would still be an improvement over the current situation, as the rest of the assets would still be static files, with just the search being server-based.

f-f · August 27, 2024, 10:11pm

A few more things that I’d like to address from this thread:

I think generating html for docs as part of the compiler codebase is pretty fundamental at this point. We could do it in other ways, but my understanding is that it would require considerably more work than just taking what the compiler produces, which is actually pretty good
I am fairly wary of using external services such at the mentioned Vercel, Algolia, etc. The Bower debacle really cemented the belief that any “free” service is really a trap and that things like that seem nice in the moment but really just create a maintenance bomb 10y down the line. I really do want both myself and PureScript to be around in 10y, so my belief is that any effort we spend to keep things small, self-contained and easy/cheap to maintain is extremely worth it. We used to rely heavily on GitHub Actions in the early days of the Registry, but we are in the process of migrating even that stuff to our own infrastructure.
similarly, AI-related stuff sounds pretty expensive to run (on our own infra) and to maintain (cognitively speaking), so it’d have to be orders of magnitude better than the current search for it to be convincing in a Pursuit deployment. If the current type search is not good enough I believe we should be looking at full-text search first (and sqlite can do that in a small package) and at things like adding tags to packages, so that they can be taken into account when searching.

The other big blocker that I haven’t seen mentioned is Spago’s release status: legacy Spago can only deal with single-package-repos, and allowing monorepos would break compatibility with that.

We should have the new Spago out in General Availability before we can allow monorepos in the Registry, or we’ll end up in a situation where a package ends up in the package sets and the users of legacy Spago can’t use it, even though it’s the tool we recommend for new users.
I’d be careful with handling this situation so that we clear it before or at the same time as Pursuit, since we already have way too many papercuts

Yes - I am of the opinion there is no value and just headaches in having a webserver that accepts documentation like Pursuit does, given the current architecture of the Registry where users ask it to pull from public locations rather than pushing code to it.

As detailed above, I really do see Pursuit as a byproduct of the Registry pipeline, rather than a separate thing.