Chipper CI started in 2019 with a super naive build system. It worked great to start, but quickly hit scaling issues. Adding additional servers was a manual action, and it would often fail to clean up after itself properly. It required a bunch of manual intervention.
That brought us to the new build system in late 2020, which ran on a Nomad cluster of EC2 servers. That system worked great, and required very little maintennance. I actually can't remember any downtime caused by this build system! It's still around too - as both a backup and for some specifc customers.
So, Why Another Build System?
One thing that always bugged me was running servers 24/7. I really wanted an on-demand compute platform, but the trade-offs in AWS services never made sense.
Fargate (serverless Docker) took too long to spin up containers, and was 3x the cost than EC2 servers. Running individual EC2 servers for each job also took too long. Lambda had too many execution limits (the 15 minute limit in particular, but there were others).
So while the build servers were autoscaled (based on business hours), the active servers standing by were often just...standing by.
The Latest System
Now we're on our 3rd iteration of build systems!
Under the hood, these are run on bare-metal servers using Firecracker, which creates micro-VM's. That's the same technology powering Lambda on AWS, but without the AWS-added limitations.
These machines can quickly be created, and come with a bunch of other fun features such as private networking and logging.
What's the result of all this?
- More resources per build by default
- Less (often no) resource contention from other builds
- More resources for ancillary services (Chrome for Dusk tests, databases, etc)
- Cheaper, more efficient compute
- Higher bandwidth costs 😅 (we use S3 for the build cache, among other things, so we're paying more egress cost to send the build cache to Fly build VMs. Working on that!)
What's In a Build?
Each build gets their own "cluster" of VM's - the build VM + any ancillary services defined such as Redis, MySQL, etc. These are in a private network, and each VM can communicate with eachother.
Since the hostnames of each VM are a bit random, we take pains to ensure environment variables are available to builds so you don't really need to think or know about them.
This includes setting things like
REDIS_HOST, etc, with each build VM - allowing Laravel to automatically get these values without requiring you to configure anything.
How Chipper Operates on Fly
On the previous build system, Nomad essentially acted an an HTTP API we could ping with a job definition. Nomad then made sure the job spec was valid and implemented it, taking care of the details such as "which server has room to run all these containers?".
With an on-demand platform like Fly.io, we don't need to operate a Nomad cluster. We now talk to Fly.io's API, which is responsible for all of that stuff.
On our end, we replaced the Nomad API with our own home-grown thing - a micro-service that takes a job spec (a murder of JSON) and converts it to HTTP requests sent to Fly.io. It handles things like retrying API calls if they fail, and edge cases such as retrying builds in different regions (if one is having issues or is out of capacity).
One fun thing (and the only thing that's not backwards-compatible) is related to Dusk tests.
We now run Chrome in a separate VM, instead of using a huge, bloated build VM. Luckily, Dusk supports this!
However it's annoying because you need to adjust your Dusk tests a tad (documented here).
Cypress & Puppeteer
Honestly, like 99% of Chipper's issues are related to Node.
Users of Cypress and Puppeteer can't use a remote VM. NodeJS needs to be in the same VM as Chrome in their cases.
The solution to that is to install Chrome into the VM yourself as a set of (fairly quick!)
apt-get install commands.