Our Naive Build System

February 7th, 2020

Our build system is fantastically naive, and I (mostly) love it.

A build system for a CI app usually involves spinning up some Docker containers, running some commands, and reporting the results.

There are a LOT of details glossed over in that explanation.

One such detail is how to run your containers. There are a lot of options!

Hosted Solutions

Some hosted solutions that we couild conceivably try out include:

  • AWS CodeBuild (which is itself a CI system)
  • AWS ECR/EKS (which is a "real" scheduler built on top of EC2)
  • Google Cloud Build (another CI system)
  • Google hosted Kubernetes (GKE)
  • Many others!

Each of these have some drawbacks that stopped us from using them.

First, all the hosted solutions cost about 2x-3x per hour than running your own EC2 servers. Calculating the number of build hours/minutes used per month, and comparing that cost in something like AWS CodeBuild vs running some EC2 servers (fulltime!) still made it cheaper to run EC2 servers.

The trade-off of course is the requirement to manage your own servers!

The top contender (AWS CodeBuild) made a few things hard, such as streaming command output in a way that let us do real-time updates within Chipper, and complications around running multiple containers (so you can, for example, use MySQL/Redis within a build).

ECR/EKS (including the serverless Fargate options) are also expensive. More importantly, however, they are really built to host long-running applications - not short-living CI jobs. In addition to a higher hourly cost, they also can have prohibitively slow (and inconsistent) spin-up times.

Google services are neat, but wading through their documentation was somehow worse than AWS's documentation. That's...saying something. I had no idea how to get started in there, and by the time we were looking, we had a working solution.

Docker Schedulers

A "scheduler" is a thing that decides how and where a container "service" is run. They typically are run in a cluster/fleet of servers and are able to take into account server resource usage, along with some custom parameters defined by developers to decide what server(s) should run what containers.

It's usually up to the developer to say something like "this service needs this much RAM and CPU available, and these need to be in the same local network". The scheduler takes those parameters (and others, depending on the scheduler) and runs the services.

Managing schedulers, especially ones you host yourself, can be very complex.

For our first version of Chipper CI, we avoided using these. The results have been better than expected!

Chipper CI's Setup

Chipper CI's build system is basically just Laravel running queue jobs that, in turn, run Docker commands directly on whichever host the queue job is being processed on.

We use AWS Autoscaling to scale on a schedule (rather than one based on server resource usage). It turns out build server demand is very predictable based on business hours, and a time-based schedule works great. This lets us save a lot of money by spinning down servers every night and on weekends.

There's a future post to be made about how to gracefully shut down servers that are terminated during AWS autoscaling.

So, we have a set of build servers. These build servers have the Chipper CI application on them. Each runs 3 workers that only process our ProcessBuild queue jobs. This means that any one server will run a maximum of 3 builds concurrently. Since each build server has 3 workers, jobs are distributed across the various build servers at random.

The Process

The general process for a build to run within Chipper CI is as follows:

  1. A webhook comes in, and Chipper CI checks to see if the webhook is one it should take action on. If so, a ProcessBuild queue job is dispatched.
  2. The ProcessBuild job has a concurrency checker to ensure the team running the build isn't using all of their available build containers (if so, the job is delayed and attempted again)
  3. If a job begins, then we generate a docker-compose.yml file that configures the containers needed, environment variables, security settings, and CPU/RAM limitations.
  4. The container gathers needed assets (deploy keys, project pipeline scripts) and we run those scripts. We gather the output to record it (and update the web browsers via Pusher), as well as save the exit code of each step.
  5. When a job is complete, we run some extra scripts to handle build caches and creating artifacts from Laravel log files and/or Dusk screenshots
  6. Finally we shut down and remove the build containers used in that job

This is all done in PHP by running docker-compose commands via Symfony\Component\Process\Process. This was, essentially, the first iteration made on the build system. It was meant to be a proof of concept, but it worked so well, we've kept it!

Issues

This build system is great in that it's relatively simple, fairly straight forward, and, despite being a bit naive, works better than we ever expected.

However, it's not at all perfect.

  1. The build system has no knowledge of server resource usage, so it's possible for one server to have 3 builds running while other servers have none. That's both inefficient and a potential cause for slower builds.
  2. PHP has no real asynchronous abilities (we tried a few of the popular async libraries). There are a few things that really need to be asynchronous but are instead "hacked" in:
    • We need to periodically check if a build was cancelled mid-build
    • We need to send pipeline command output to storage (and to Pusher) in a way that is throttled
    • A quirk of using SQS FIFO queues: We need to periodically update the ProcessBuild job's VisibilityTimeout so a job doesn't get released for another worker to run (thus having a build being run more than once)
  3. Our more naive style of running builds can leave containers hanging forever or timeout during cleanup operations - this is mostly not visible to end-users but means we need to periodically log into servers and clean them up during heavy usage.

The Future

There are a lot of ways to improve the build system!

One is to put more responsibility into the container itself - a small script could run the pipeline scripts and send the output to the application, for example. That frees up the ProcessBuild job from having to manage that through PHP.

Another larger change we'll eventually make is to use a container scheduler. We'll have to evaluate what makes sense, but one strong candidate is Nomad.

Nomad seems to be both the simplest to manage and best for short-lived CI style jobs.

This should let us expand Chipper's offering with feature such as additional containers we can allow, server size options for higher paid tiers, "debug over ssh" support, and a host of other ideas we have

At the same time, we should be able to better use server resources (spreading usage across servers efficiently).

We're looking forward to it!