Our New Build System

September 20th, 2020

Our New Build System

We launched our new build system!

What does this mean for Chipper CI users?

  1. It's faster
  2. There's less competition for resources
  3. It's more resiliant and easily scalable

The new build system is an overhaul of Chipper CI's core task: The actual running of builds.

That means we're at feature parity with the older system - we haven't added features yet. However, we were able to upgrade the build servers, allowing us to allocate more RAM to each build.

This is the first step towarads adding planned features and enhancements - and the platform on which we'll do that is much more solid.

Old Build System

Our original build system is a bit more naive than we'd like. While its simplicity has mostly been a positive, the downsides began to affect development and build speed.

You can read the details of the old build system in the article linked above, but here are the basics:

  1. Each build server ran 3 queue workers
  2. Each worker ran one build at a time
  3. There were, therefore, up to 3 concurrent jobs per build server

To run builds, we fired a job into the queue. Whichever build server picked up the job ran the complete lifecycle of a build. It retrieved cache files, started the necessary containers, ran pipeline scripts, "streamed" script output, handled post-build tasks, managed error handling, and more.

It was simple and effective, but we could see it start to break down:

  1. We often saw some build servers fully utilized while others were not being used
  2. Development was getting slower as it was harder to not introduce bugs for more advanced features

New Build System

The new build system uses Nomad, a job scheduler that specializes in effectively using resources for batch jobs (such as CI builds) and long-lived services (such as a web application).

Nomad affords us a way to "fire and forget" a build job. Instead of one queue worker managing the entire lifecycle of a build, Chipper sends a request to Nomad's API and ... that's it. After that, Chipper just reacts to incoming data.

To achieve this, we had to rethink everything about the build system, but the results are much better.

Nomad

Nomad's focus is orchestrating the placement workloads (usually in containers). It's comparable to Kubernetes, although it's much simpler to operate (in exchange for having less features).

We decided on Nomad for the following reasons:

  • Its relative simplicity to use and operate
  • Its concept of Batch jobs (great for CI builds)
  • Its ability to fairly distribute jobs among build servers
  • It's a proven, battle-tested system

As an "end user" of Nomad, it's pretty simple to get started.

Within Nomad, you create a Job. Each Job has one or more Tasks. In our case, each Task describes a docker-based service, such as the build container, a database container, or a cache container.

Here's a very simple Job, described in HCL:

job "some-build" {
    datacenters = ["us-east-2a"]
    type = "batch"

    group = "some-build-group" {

        network {
            mode = "bridge"
        }

        task "build" {
            driver = "docker"
            leader = true
            config = {
                image = "chipperci/builder:latest"
            }
            
            env {
                BUILD_ID = "some-build"
                CI_COMMIT_SHA = "some-sha"
                AND = "all the other env vars"
            }

            resources {
                cpu = 2000
                memory = 2048
            }
        }

        task "mysql" {
            driver = "docker"
            
            config {
                image = "mysql:5.7"
            }
            
            resources {
                cpu    = 1000
                memory = 1024
            }

            env {
                MYSQL_ROOT_PASSWORD = "secret"
                MYSQL_DATABASE = "chipperci"
                MYSQL_USER = "chipperci"
                MYSQL_PASSWORD = "secret"
            }
        }
    }
}

This is what you might submit to Nomad to kick off a build requiring MySQL 5.7 to run tests.

Reality of course is a bit more complex. For example, we have some pre-start tasks to download cache files, and post-stop tasks to update the build cache, and store any Laravel logs/Dusk screenshot if the build failed.

So, this can run some containers. But how do we actually run a build?

Running Builds

The old build system had to run many docker exec... commands to run build steps. The new build system is more self-contained - builds run themselves!

In practice, this means the following:

  1. There's a utility application used to run build scripts against the build container
  2. This utility reports results to a microservice via GRPC (which, importantly, enables the streaming of build output)
  3. The microservice was designed to be simple. It's essentially a relay, receiving build data and converting it to queue jobs that Laravel can read.

This server infrastructure to run Nomad and support the microservice is a bit more complicated, but the ideas within it (and the microservices) are all fairly simple. Most importantly, it's so far shown to be less prone to failure than our original build system.

Currently, the new build system's feature is at parity with the old build system. However it's a major refactoring - the code complexity is reduced, and our feature velocity will benefit.

Golang

The utlities mentioned here are all written in Golang. We dipped our toes into using Golang early in 2020, but now we're pretty deep in!

I've been asked about why using Golang over more PHP. The reasons for Golang boil down to these:

  1. Golang (via exec.Command) runs bash scripts, including (especially?) NodeJS tasks, faster than PHP (via Symfony Process)
  2. It compiles into a single binary file, so you don't need an entire runtime installed wherever it's used (especially nice in containers)
  3. It's GRPC support is first class (even though GRPC is language-agnostic)
  4. It has real concurrency, required when you need to do a bunch of checks all while running commands

The details of why each of these is important get interesting, but I'll save those for future articles.