Chipper CI used to NOT guarantee the order that builds were run. This is because the vast majority of queue systems do not guarantee that jobs are run in any particular order.
Luckily, Amazon provides a FIFO (first in, first out) queue within the Simple Queue Service. Here's how we were able to take advantage of it to ensure builds were run in order!
The Old Setup
Chipper CI has a collection of build servers. These are EC2 servers that have the best mix of CPU and RAM at a cost we can afford given our monthly charge to customers (and that allowed us to move out of the t2/t3 server varieties and their CPU credit system).
Each build server has 3 workers on it, which means at any point in time, up to 3 concurrent builds might be running on a server (more on how we manage Docker, etc, in another post).
We didn't do anything particularly fancy for these workers - they were instance of the Chipper CI application running php artisan horizon --environment=production_builds
. We were using Laravel Horizon, which uses Redis under the hood.
This worked great, except it did not allow us to guarantee the order of builds:
It could get a bit grim.
Guaranteed Build Order
One way to guarantee build order is to create a system yourself. Some code could check "if the build before me is still pending
and not complete
, then release the job and try again later".
Another way is to use a messaging system (a bit different from a queueing system) such as Apache Kafka. Hosting and managing another distributed system isn't something David and I were interested in doing, so we kept looking for options.
Luckily AWS has us covered for guaranteeing queue job order. They have a queue type called "FIFO" queues (first in, first out). The benefit of using SQS is that it's (mostly) baked into Laravel already, and, most importantly, it's a system we don't need to manage ourselves.
SQS FIFO queues guarantee the following (that reqular SQS queues do not):
- Job order
- "Exactly-once processing" and de-duplication of jobs
So, SQS FIFO jobs will only come in the order they are sent. You will not be able to process the 3rd queue job in line before the 2nd job is completed.
Issues to Work Through
We had to resolve a few issues getting FIFO queues to work.
First, Laravel doesn't fully support FIFO queues out of the box, but this package adds support for FIFO's extra settings. That was, luckily, and easy hurdle!
Second, we had to work through a few logical issues related to Chipper CI.
For example, if every build job in Chipper CI ran in the FIFO queue, then we could only process one job at a time. Even if we had multiple workers (3 per build server in our case), SQS FIFO's job ordering would only allow one job to be run at a time, in sequence. Clearly that wouldn't do.
Luckily, we didn't need to do anything drastic (like create a queue per team). Instead, FIFO queues have a concept of a "group" - it will only guarantee job order within a group. In Chipper CI's case, this means that each team has their own group.
So, if you have 3 workers and 3 teams push up new builds, each worker could churn through the team's builds in the order they come in.
dispatch(
(new ProcessBuild($build))
->onMessageGroup($build->team_id)
->onConnection('ordered-builds')
);
This leads to our third issue. Chipper CI's teams each get 1 build at a time by default. However, you can upgrade to run concurrent builds. If we only grouped by just the "team", then that would lock in each team to one build at a time.
So, we need to group by something else - in our case, we chose to group by team id + project id
. This allows multiple concurrent builds and, by grouping by a project ID, also has the net effect of spreading builds across a team's multiple projects (and removing the potentially problematic behavior of running multiple builds within a single project at the same time).
Some Interesting Bits
With the FIFO queue logic details tied up, we find that we still need to to enforce build concurrency within the business logic of Chipper CI. This is something we wrangled prior to switching to the FIFO queues.
To control the number of concurrent builds per team, we use Laravel's Redis::funnel()
feature. It looks something like this:
$concurrencyKey = sprintf('team-%s', $team->id);
// We "releaseAfter" 1 hour (the max build time allowed)
Redis::funnel($concurrencyKey)
->limit($team->concurrency)
->releaseAfter(3600)->then(function () {
$this->processBuild(); // We're allowed to proceed!
}, function() {
$this->release(30); // Release the job and try again in 30 seconds
});
Dispatching the ProcessBuild
job now looks like this:
dispatch(
(new ProcessBuild($build))
->onMessageGroup(sprintf('%s-%s', $build->team_id, $build->project_id))
->onConnection('ordered-builds')
);
Builds are in order now!
Visibility Timeout
One quirk of SQS Queues is the concept of a Visibility Timeout.
Once a job is taken by a worker, it has a certain amount of time where that job is no longer "visible" to other workers. This is called the Visibility Timeout.
After that timeout expires, the job becomes available for another worker to take regardless of if the first worker has finished processing that job or not.
This means that the Visibility Timeout should be longer than the job is expected to take (preferably with a comfortable buffer).
Chipper CI jobs can range from a few seconds to a full hour. Additionally, we can't set a long Visibility Timeout for all jobs, as releasing a ProcessBuild
job to retry in a few seconds is very common thanks to the throtting of concurrent jobs. A long Visibility Timeout would cause builds to not get built too long after they were pushed.
In a language with asynchronous or concurrency support, the job itself could periodically increase the Visibility Timeout itself. However, we can't really do that cleanly in PHP.
There are two strategies we can try:
-
Delete the job very early - This allows you to completely ignore the Visibility Timeout. As a trade off, the job will only ever be attempted once (or, in our case, when a
ProcessBuild
job is allowed to run by theRedis::funnel()
feature). - Use a second queue job to track and extend the Visibility Timeout of the first. We chose this route as it gives us freedom to not worry about a few harder bits related to tracking if a build has failed and in dealing with Docker containers.
So, we introduce a new queue job! This one doesn't work within a FIFO queue necessarily - it can be dumped into the regular queue. We fire the job to track Visibility Timeout early within ProcessBuild
like this:
$concurrencyKey = sprintf('team-%s', $team->id);
// We "releaseAfter" 1 hour (the max build time allowed)
Redis::funnel($concurrencyKey)->limit($team->concurrency)->releaseAfter(3600)->then(function () {
// Dispatch immediately, avoiding the use of __destruct() that the dispatch() helper uses
app(Dispatcher::class)->dispatch(
(new ExtendProcessBuildVisibilityTimeout($this->job->getQueue(), $this->job->getSqsJob()['ReceiptHandle'], $this->build->id))->delay(30)
);
$this->processBuild(); // We're allowed to proceed!
}, function() {
$this->release(30); // Release the job and try again in 30 seconds
});
This uses the dispatcher to dispatch the job immediately. Otherwise Laravel uses a handy PendingDispatch
object that doesn't dispatch until the __destruct()
method is called (which may be too late in our case).
The ExtendProcessBuildVisibilityTimeout
queue job looks a little like this:
class ExtendProcessBuildVisibilityTimeout implements ShouldQueue {
public function __construct($buildQueue, $buildReceiptHandler, $buildId, $previousTimeoutSeconds=60) { /* boilerplate setters */ }
public function handle()
{
$build = Build::findOrFail($this->buildId);
if(/*build has been cancelled, is complete, or is still pending*/) {
return $this->reattempt($this->previousTimeoutSeconds)->delay(30);
}
// Calculate needed delay (double delay each attempt, up to an hour, 3600 seconds)
$newTimeoutSeconds = $this->previousTimeoutSeconds * 2;
if( $newTimeoutSeconds > 3600) {
$newTimeoutSeconds = 3600;
}
$this->extendVisibilityTimeout($newTimeoutSeconds);
if ($newTimeoutSeconds < 3600) {
// This jobs is re-run 60 seconds before the next visibility timeout to help ensure we don't
// run it *after* the ProcessBuild's visibility timeout has already been reached
$this->reattempt($newTimeoutSeconds)->delay($newTimeoutSeconds - 60);
}
}
// Reattempt fires a new job, it does not release this job
protected function reattempt($newTimeoutSeconds) {
return dispatch(new static($this->buildQueue, $this->buildReceiptHandler, $this->buildId, $newTimeoutSeconds));
}
protected function extendVisibilityTimeout($newTimeoutSeconds)
{
// Use the SQS client and call `changeMessageVisibility` to change the ProcessBuild's job visibility
}
}
This allows us to extend a job's Visibility Timeout to up to an hour that may be needed to complete a build.