Applications built in ASP.NET Core can be healthy or unhealthy, just like humans. A doctor performing a health check on a human typically checks vitals, and then prescribes further tests for anything that doesn’t seem right. It’s not quite as straightforward with an ASP.NET Core health check, since each application is unique, with different requirements and tolerances for failure.
The rules of thumb below are what I use to decide what should be verified in a health check. It’s a starting point that can be used to reflect on what should happen in your own applications. But before getting into that, let’s look at why health checks are needed in the first place.
Why You Might Need a Health Check
An ASP.NET Core application needs to be resilient to failures, but there are times when a failure can’t be prevented. Today’s .NET applications are typically hosted either in IIS behind a load balancer, a Docker container, or a Pod in Kubernetes. All of these hosts use health checks to know if an application is able to process incoming requests. Kubernetes even goes a step further by restarting an unhealthy container in the hopes of bringing it back online.
Here’s a technology-agnostic diagram of a host working with an application to determine its underlying health condition. (I used Kubernetes as the host system, but it could be replaced by anything else and the diagram would be the same)
A health check endpoint returns one of two status codes: 200 OK or 503 Service Unavailable. An OK response means the application is healthy, whereas a Service Unavailable indicates some kind of problem. At that point, traffic is directed to other, healthy instances by the host system. If there are no healthy instances, then a 500-level error is returned to the caller.
Health checks are executed periodically throughout the lifetime of the application. In that way, an application can go from healthy to unhealthy and back to healthy, as the state of its dependencies change.
It’s theoretically possible for one instance of a service to be healthy while another is not. But what usually happens is that an application that’s scaled to many instances will have the same health status across all its instances, since they all perform the same health checks.
With that in mind, let’s identify the different components of an application that can be checked.
What Should Be Checked
Before writing any code, take some time to map out all the dependencies in your application. Think about what you do with each of them and what would happen if there was a degradation of service, or worse a complete failure. Then ask yourself the following question for each dependency: can the application still perform its work and/or provide a meaningful response, even if not as complete?
- If the answer is no, it’s likely that some kind of health check is needed for that dependency. We’ll look at some examples in just a bit.
- If the answer is yes, it may not be necessary to verify the dependency’s state. The classic example of this is the Netflix app that provides personalized suggestions. If the personalized suggestions service isn’t available, the application can fall back to a pre-defined list of suggestions.
The majority of an application’s dependencies fall into three categories:
Let’s look at each of those categories separately to see what we should do about their health state.
The Web App or API Itself
One of the most common errors when deploying an application for the first time is that its configuration doesn’t work. Sometimes a connection string is incorrectly formatted, other times it hasn’t been replaced with the right value for the environment in question.
A health check is a great place to validate that your connection strings are all valid, be they for a database, messaging service, or anything else for that matter. You can make sure that every connection string is valid by opening a connection to the underlying service in the health check.
Other Internal Applications or APIs
Your application’s health often depends on the health of other services that you’ve developed yourself. Application A might need to call service B, say to authenticate a user’s credentials. In this scenario, Application A doesn’t need to check service B’s health. Service B should be responsible for checking its own health and signaling when it’s unhealthy.
You might consider adding a health check on an internal dependency if the caller has strict response time requirements. But even then, you’ll want to use some aggregate information rather than a single call to decide if the response time has exceeded its limits.
Cloud services are designed to be incredibly reliable. There are still occasional outages here and there, but for the most part, any resource provided by the big cloud providers is likely to be more reliable than application you or I will ever build. For that reason, I find it somewhat pointless to check the health of services in AWS, Azure or Google.
It’s much more likely that any failure with a cloud service is due to a configuration error. As mentioned previously, you could check the connection string or credentials to the underlying resource, but beware that transient errors such as timeouts can still happen. It’s not a bad idea to include simple retry logic to avoid generating false positives.
An application or API can rely on a service that’s run by someone else. Public APIs like Facebook and Twitter are extremely reliable, but if you’re using John Doe’s Random API, it’s worth it to set up an early warning via a health check that something has gone awry. While you’re at it, consider what the application should do when it does eventually go down: is there a way to degrade the experience without completely killing your own service?
Even after all this research, I still struggle to find the right balance between checking too little and too much. Put in too little, and your health checks are nothing more than a formality. Put in too much, and you’ll be burdening your service with extra I/O operations.
I’ll summarize my thoughts on the matter with the following bullet points:
- Think about about each dependency your service takes on, and how best to deal with them when they eventual fail.
- Implement checks for the dependencies that your service can’t live without.
- Implement checks for services that are more likely to fail.
- Find workarounds for less important dependencies so that your application isn’t brought to its knees when the dependency fails.