There are a number of ways to host a .NET API or worker on the three major cloud providers. Even with all the different options available, such as PaaS and CaaS, the most common solution still used today is to have an application running on a virtual machine in the cloud.
The ever-increasing connectivity of the world doesn’t allow for the the luxury of downtime to perform an upgrade on the production environment. So what do you do when you’ve created a new version of your service but don’t want to blindly upgrade all your instances and hope for the best?
Canary deployments are one of the safest and cheapest ways to deploy to production while also ensuring that we don’t experience any major outages. The term comes from the “Canary in the Coal Mine” metaphor, which is taken from the real life situation where miners would take caged canaries with them into a mine they were working on. Canaries are extremely sensitive to environmental changes and would act as an early warning system that a potentially harmful gas was in the air. If a canary were to fall over, it would signal an immediate evacuation of the mine.
Much like the miners, a canary deployment involves upgrading one of our service instances to the latest version and monitoring it for any unexpected behaviours. At the first sign of trouble, we can revert the affected instance to the old version of the code. Otherwise we’d deploy to the remaining instances and complete any post-deployment work.
How It Applies to .NET Applications
.NET APIs are generally run in IIS on a set of machines behind a load balancer. That’s convenient because the load balancer is needed to enable canary deployments. They give us the flexibility and robustness to upgrade a single machine at a time and manage the traffic directed to our canary service.
Workers will generally be built around Console Applications or Windows Services and fed work by a queue. The queue mechanism makes it easy to scale the service across multiple instances since each instance of the service only dequeues a message when it’s ready to process it.
Let’s look at the process in-depth by way of an example. For our fictional scenario, we have a worker that processes messages from a queue when a user deposits an amount in their bank account. That message indicates the amount of the transaction and the account to debit. The worker reads the message and updates the affected account accordingly. This worker is deployed across three instances to share the load of incoming messages.
A few weeks later you add a new feature to the worker that writes the same amount to a journal log. You develop and test the feature as you normally would. The pull request gets merged, the build passes, and the artifact is deployed to your pre-production environments.
The steps are a bit different when it comes time to deploy to production:
- One of the instances — the canary — is removed from service and stops processing requests. For our worker example, this involves stopping the service so that it stops processing messages. For an API service, the equivalent operation would be to remove it from the load balancer.
- The new build is deployed to the canary instance from step 1.
- The instance is put back into service and starts processing requests.
- Validate the deployment is working as expected. This is ideally accomplished by running a suite of automated tests but can also be done by monitoring the incoming production traffic for errors. See below for more details on how to do this.
- Take a go/no-go decision for the rest of the deployment.
- If everything went well, it’s a simple matter of repeating steps 1 to 3 for all the instances running the service.
- If there were issues serious enough to warrant a rollback, repeat steps 1 to 3 to downgrade the canary instance to the old version of the code.
Validating And Monitoring Results
The simplest way to validate the deployment on the canary instance is to monitor production traffic coming into it once it’s operational again. You may want to provide the new feature to a small subset of the entire traffic coming in, which can be controlled through weighted traffic-routing. The three major clouds all support this in different ways.
A good monitoring tool (such as NewRelic) will come in handy if you’re going down the production traffic path to validate the deployment. They all provide visual ways to detect errors or crashes of your service. NewRelic supports ASP.NET applications out of the box and require a bit of extra work to monitor a non-IIS based application such as a Windows Service.
Running a suite of automated tests against the canary is ideal but requires a bit of planning ahead of time since you need a way to target the canary once it’s back in service. Otherwise you’ll have no idea if the test ran against the right instance. There’s a few different strategies to accomplish this but the easiest is probably to configure the load balancer to redirect traffic based on a header or IP address.
You’ll also want to invest some time to automate the deployment process. One of the most popular tools to perform deployments is Octopus Deploy, and they’ve written an extensive article on how to achieve Canary deployments.
There’s one aspect of Canary deployments which I’ve avoided so far and that’s what to do in the case of a breaking schema change to the database. It’s a bit more complicated and for that reason I’m going to write a separate article about it in the coming weeks.