Never Miss a Webhook

September 28th, 2020

Webhook Overload

Chipper CI accepts webhooks from the 3 major Git providers: GitHub, BitBucket, and GitLab. These get sent to us based on events that occur on user repositories (opening PR's, pushing up commits, tagging commits, etc).

The PHP/Laravel code for this within Chipper CI is extremely standard - setup some routes and controllers, and you're off to the races.

The code (with abstractions to handle webhooks from the various Git providers) looks mostly like this:

// routes.php
// Accept webhooks from GitHub & verify webhook signature
Route::post('/webhooks/github', 'GitHubWebhookController')
    ->middleware('webhook.github')
    ->name('webhooks.github');


// GitHubWebhookController.php
public function __invoke(Request $request)
{
    return with(WebhookEventFactory::fromRequest($request), function ($event){
        $event->handle();

        return response()->json(['event_type' => class_basename($event)]);
    });
}

This code handling the webhooks gets a bit complicated, but for this article, we're more concerned with making sure we don't miss a webhook.

Incoming webhooks are high volume and "spiky"

Webhooks from the Git providers tend to be relatively high volume.

For example, GitHub may send webhooks from every repo in an an entire organization (depending on how the GitHub App is "installed" during the oAuth flow).

In addition to high volume, we also see spikes in incoming webhooks. If we get enough webhooks at once, this can easily overload a server.

This has happened when a git provider had issues of their own. BitBucket, for example, tends to send a flood of incoming webhooks when their system recovers from an outage.

Missing webhooks means missing builds & deployments.

It's therefore extremely important to not miss (or lose) any webhooks if you can help it.

So, how do you plan for unpredictably spiky traffic patterns?

Let's first talk about what happens when you get a spike in webhooks.

PHP servers are fairly easy to overload. This isn't necessarily due to limited server resources, but instead by PHP-FPM's process model, where you have to define a max number of processes that can spin up. This limits the number of concurrent requests that can be handled.

The defaults are very low, and we bumped ours up to as much as the server could take. However, monitoring the Nginx access logs and PHP-FPM error logs showed that we did infact drop the occasional webhook.

This is most evident in PHP-FPM logs, which will show the following warnings:

WARNING: [pool xxx] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers)...

WARNING: [pool xxx] server reached pm.max_children setting, consider raising it

So we knew we needed to solve that problem, preferably in a way that let us never think about it again.

The initial ideas we had to get around this issue, and why we discarded them, are as follows:

  1. Larger servers: This is more money, and doesn't guarantee we'll never hit a limit again.
  2. Load Balancing: This is more expensive and more complicated than we wanted to get.
  3. Another Language (Golang): Certainly much higher concurrency can be achieved, but then we're managing more code, more deployments, and more stuff to monitor.
  4. Laravel Vapor: This could work, since it uses much of the same solution we ended up using. However we have certain setups that would make this painful, such as already have AWS infrastructure in place (Vapor doesn't currently let you add resources into an existing VPC).

API Gateway

AWS has hosted solutions that are great for this use case, and they are not expensive at our scale.

Using AWS's API Gateway, we were able to offload webhook handling.

The pricing is $3.50/million requests for first 333 million requests/month. At our scale, this is cheap!

AWS currently has 2 versions of API Gateway. Version 1 ("REST API's") and Version 2 ("HTTP API's"). Version 1 is a bit harder to work with but has better integrations with AWS services (at the time that we were coding this up, at least. I believe that's somewhat changed now).

We chose Version 1 as we wanted incoming webhooks to be transformed into an SQS job.

The SQS queue we created is a FIFO queue, in order to ensure we process webhooks in the order they came in.

And we set that up! It worked great. API Gateway allows you to read the incoming JSON data and perform some transformations on it, allowing us to create a SQS jobs with the JSON Payload included in with in it.

The only issue we came across was that Bitbucket, for some reason, would not properly encode it's JSON into the SQS job. I never tracked the exact reason down, but we ended up having to base64 encode the JSON payload into the SQS job (and even then, we had to do it in a super specific order of operations). That was hours down the drain :D

Laravel Implementation

One thing I like about our implementation is that we kept our Laravel routes, controllers, and tests for processing incoming webhooks.

We accomplished this by "faking" an HTTP request into the application when processing webhooks.

Each webhook fires off a ProcessQueuedWebhook job, which contains both the JSON payload and the HTTP request data. This means we could replay the request using an instance of the Illuminate\Http\HttpKernel class.

This includes the headers (if provided) that allow you to check the webhooks authenticity.

The code to do this is basically mimicking what you see in public/index.php, and also in Laravel's TestCase to test incoming HTTP requests programmatically.

// file app/Jobs/ProcessQueuedWebhook.php
class ProcessQueuedWebhook implements ShouldQueue {
    
    public function call($job, $data)
    {
        try {
            $response = $this->postJson($data);
        } catch (\Exception $e) {
            Log::error($e);
        }

        $job->delete();
    }

    protected function postJson($data)
    {
        // Not shown here
        $content = $this->retrieveJsonString($data);

        $headers = array_merge([
            'CONTENT_LENGTH' => mb_strlen($content, '8bit'),
            'CONTENT_TYPE' => 'application/json',
            'Accept' => 'application/json',
        ], $data['headers']);

        $kernel = $this->app->make(HttpKernel::class);
        $files = [];
        $parameters = [];
        $cookies = [];
        // Not shown here
        $server = $this->transformHeadersToServerVars($headers);
        $route = route(sprintf('webhooks.%s', $data['provider']));

        $symfonyRequest = SymfonyRequest::create(
            // Not shown here
            $this->prepareUrlForRequest($route), 'POST', $parameters,
            $cookies, $files, $server, $content
        );

        $response = $kernel->handle(
            $request = Request::createFromBase($symfonyRequest)
        );

        $kernel->terminate($request, $response);

        return $response;
    }

}

There's a few interesting things here.

  1. The job class is using a call() method (not handle()), which takes a $job object and $data array. This is NOT what you usually see in Laravel's queue system. The SQS job is actually created by API Gateway (and a Lambda function, which I describe later). The job data is crafted in a way that Laravel can process.
  2. The methods "not shown here" can be found by digging into Laravel's TestCase class to see similar implementation for sending in HTTP requests programmatically.

So the GitHub App and the GitLab/BitBucket oAuth apps were all configured to use the API Gateway endpoints for webhooks. These then created a queue job (in SQS) that Laravel was able to understand. The queue job then replayed the webhook HTTP data against the application.

This worked great, but we still hit some limitations!

SQS Job Limitations

Some time later, we found that we occasionally hit SQS's message size limitation of 256KB.

It turns out that some JSON payloads were quite large. The one that tipped was off was a PR created from a large number of commits.

The work around to this is was to save the payload to S3.

The SQS job would contain a reference to the S3 "key" (the filename), and the code running the job would have to grab the JSON payload from S3 before processing it.

This change had some extra benefits - we would never lose a JSON payload if we ever needed to test it out for bugs, use it to improve unit tests, or replay it for a customer. We could also use S3 lifecycle rules to delete old webhooks after a period of time.

Adding in Lambda

We found that API Gateway didn't allow us to save the payload to S3 AND fire off an SQS job.

One option would be to have API Gateway save the HTTP request to S3 and then use S3 events to create a SQS job when a new object was added to the bucket.

However, we needed control over how the SQS job data was set, as we needed it to be in a format that Laravel could read natively for artisan queue:work to function.

To do this, we used Lambda. As Vapor users know (or not, since Vapor is a great abstraction), Lambda integrates very well with API Gateway. Being able to run our own code for each webhook let us do some interesting things (cheaply and at scale).

We're still using API Gateway version 1, since we had all this setup already. However, if we were re-creating it now, we would use API Gateway V2, which is a bit simpler.

The Lamdba function used is our second bit of Golang to go into production. It reads some of the headers and a bit of the JSON payload to determine what Git provider the webhook is from, along with the project name. Then we save the JSON payload to S3 and fire off the Laravel-compatible SQS job.

Laravel's queue worker then picks the job up, gets the JSON from S3, and processes the payload.

This works great - it scales to a level we'll never really need, and makes sure our own servers don't require needlessly complicated or expensive setups.

API Gateway + Lambda at our scale is also much cheaper than a load balancer and extra web servers!

Between Lambda's free tier and API Gateway's pricing, we're spending roughly a dollar a month on this setup.

The Lambda Function

In our case, using API Gateway V1 as a proxy to a Lambda function, we could start from this boiler plate (using Golang).

Our own code is pretty simple. We read the HTTP headers, guess which Git provider sent the webhook, save the JSON payload to S3, and fire off a job to SQS.

The interesting part is creating an SQS job that is compatible with Laravel.

It looks a bit like this:

type MessageBody struct {
	Job  string          `json:"job"`
	Data MessageBodyData `json:"data"`
}

type MessageBodyData struct {
	Provider string            `json:"provider"`
	Method   string            `json:"method"`
	Query    string            `json:"query"`
	S3File   string            `json:"s3File"`
	Headers  map[string]string `json:"headers"`
}

func handleRequest(ctx context.Context, request events.APIGatewayProxyRequest) (events.APIGatewayProxyResponse, error) {
	// Omitted is our own boiler plate to detect Git provider, 
	// and upload the JSON payload to s3

	message, err := json.Marshal(&MessageBody{
		// The Laravel job class and method called
		Job: "App\\Jobs\\ProcessQueuedWebhook@call",
		Data: MessageBodyData{ // Add the job $data
			Provider: provider,
			Method:   request.HTTPMethod,
			Query:    stringifyQueryParams(request.QueryStringParameters),
			S3File:   s3FileName,
			Headers:  request.Headers,
		},
	})

	svc := sqs.New(sess)
	_, err = svc.SendMessage(&sqs.SendMessageInput{
		DelaySeconds: aws.Int64(0),
		MessageBody:  aws.String(string(message)),
		// FIFO queue requires a MessageGroupId
		MessageGroupId: aws.String(provider + "/" + repository),
		QueueUrl:     aws.String(os.Getenv("CHIPPER_SQS_URL")),
	})

	if err != nil {
		level.Error(logger).Log("message", fmt.Sprintf("could not create sqs job: %v", err))
		return standardResp(), nil
	}

	return standardResp(), nil 
}

Making a Laravel-ready queue job from outside of Laravel essentially means having a payload of:

{
    "Job": "App\\Jobs\\SomeClass@foo",
    "Data": {
        "foo": "bar"
    }
}

The SomeClass job class then needs a method foo() that accepts a SqsJob $job and $data array. The major caveat is that you'll need to manage the job life cycle. This usually just means calling $job->delete() and/or $job->release() in the right places.