Intercepting Production Issues? Let Metrics and Alarms do the work

Photo by Malvestida on Unsplash

Once an application starts running in production, we typically hope that everything will run smoothly. After all, the team did their best, so what could go wrong?

However, the moment you are confident about your product is also the time that you have to be the most vigilant. After all, we have Murphy’s Law for a reason. We can do sanity checks and have the support team watch out for reported issues, but sometimes there are problems that don’t show themselves to users even as they break our application from the inside. For instance, how would you know about a piece of code that fails only when there are enough users using a certain feature at the same time? And if the service restarts gracefully, would you even get a chance to know what’s causing it, and that it’s even happening? Even with our five senses, such problems can live in our application rent-free. But knowing this doesn’t have to be a cause for anxiety.

At Mighty Bear Games, we aim to be proactive when it comes to production issues. We watch out not just for things to be fixed, but for opportunities to make the player experience better. But that doesn’t mean we keep people on-call around the clock — we also value quality rest, and that shouldn’t come at the price of keep an application healthy. This is where metrics and alarms come in handy.

We typically set them up on backend servers to give us the flexibility of focusing on other things while keeping us updated on current server states. We have an AWS Dashboard in Cloudwatch containing different statistics that we check regularly. These graphs show us the trend of metrics measured over time. Additionally, we’ve added alarms for any signs that would require immediate action. Let me share the metrics which have been most helpful to us recently and how they have, on one occasion, saved us from actual doomsday.

Maximum CPU Usage

CPU Utilisation measures the percentage of CPU units that are used in the service. We aggregated the measurement in a span of one minute for each backend services and took only the maximum. Our server instances are set to scale up and down based on CPU usage and this metric also gives us the trend for high and low activity periods. There is no alarm associated to it, but we do regular monitoring to observe if the trend goes way beyond what looks normal or is expected.

A few months back, in one of our server health checks, we noticed an increase in this metric for one of our services. There were no feature changes added to this service, so the trend looked highly unlikely. While it wasn’t causing any bugs, we still investigated right away because it looked so out of the ordinary.

CPU Utilization trend in a span of 30 days

As you can see in the image, the latter half trend is significantly and consistently increasing from the first half. In the middle, where the trend changed, was the time we deployed a new metric that would run for each incoming request. The rest of the changes that came with the same deployment were minor and affected only one part of the whole service. Among all the new code, the custom metrics was the most likely culprit as it added a logic that will run for every request.

After removing this custom metric, the trend went back to normal:

CPU Utilization after doing hotfix

In this image, the steep decrease was on the day we did the hotfix. This shows that the new custom metrics was using up significant resources, so we had to make a decision if it’s worth keeping or not. Without the metric telling us how the CPU is being utilised, we would not have been able to re-evaluate our internal tools and decide on which ones are practical.

Count of Log Errors

Count of log errors for different services

Every developer must have used ‘logError()’ and its many variations. Logging errors and additional messages help us investigate issues so we know the state of things when an error happens. This alone is helpful, but the level of detail doesn’t have to end there. In this metric, we monitor the number of logs with error level in each services. Each color you see in the graph corresponds to a service.

There are some errors that may happen occasionally, say a mismatch in the item and its currency, that we need to keep logging even though they don’t need to be fixed. In such cases, we rely on trends to know whether an error needs our attention. Having data like this gives us the confidence that services are working as expected even after deploying new changes.

In the image above, you only see two obvious colors, but if you look closely, there is another line in purple. When the metric was first set up, all the services are getting obvious count, as you can see in orange and green. However, a lot of those errors neither break the game nor require an action. This is an example of how a metric can be used incorrectly. By losing the real meaning, we may be interpreting the data differently. Therefore, we had to review the logs and make sure that we only use ‘logError()’ if we need to know about it. We don’t want to panic just because a player’s session expired, right?

Count of Http Response Code

Count or different http response codes from all services

This metric counts the number of times the endpoint sent back specific response codes. The most useful to us is code 500 which means something failed in the server while processing the request. However, since we do server health checks only on regular intervals, we needed an alarm for urgent cases. Unfortunately, we had to learn of this need this the hard way.

In Disney Melee Mania, we have game events which run periodically. The availability of rewards and items depends on which event is running for that period, so we need to make sure that there is always an event up. This created an issue when, during an event rollover, there was nothing scheduled next. Every time the client requested for an event, the server threws an error. Although this can be handled gracefully by adding a fallback, it would still be incorrect to not have an event.

We were lucky to discover this issue during a low activity period. However, we couldn’t rely just on luck all the time, so one of the action points moving forward was to set up an alarm for when the trend for 500 error count rises abnormally. The tricky part in this feature is that the count may vary from time to time depending on how many players are active. In other words, we couldn’t use a fixed number to tell if something’s off. What we found perfect for this scenario was Cloudwatch alarm’s anomaly band. This monitoring uses machine learning to follow a trend and all you have to do is supply the acceptable range within which the actual value may vary.

Alarm message when data is out of bounds

Alarms such as these are most helpful when the whole team is off enjoying studio breaks and holidays. We just let metrics and alarms do the work.

If you want to know more about our experience with production issues and how we handle them, Andrew Ching, our Senior Backend Engineer, wrote an article about our very first adventure after Disney Melee Mania went live. All these experiences shape us to become more proactive. If you‘ve had any experience such as this, we would love to hear about it. Go ahead and let us know in the comment section, and don’t forget to leave me some claps!

Intercepting Production Issues? Let Metrics and Alarms do the work was originally published in Mighty Bear Games on Medium, where people are continuing the conversation by highlighting and responding to this story.