Don’t DDoS your own servers
On the first day after the launch of Disney Melee Mania (which you should check out if you haven’t), one of our services went down at 6pm SGT. It recovered shortly after. On the second day, the same thing happened. Once is happenstance, twice is coincidence, right? Not quite.
Given how both outages happened at precisely 6pm, it was probably scheduled. We saw massive spikes in traffic the moment the clock struck 6pm, and that overwhelmed the service meant to handle these requests.
After some investigation, we found that our endpoints were getting flooded with repeated requests from numerous sources, which made us think we were under a DDoS attack. However, after digging a little bit more, we saw the same patterns in our internal development environment as well. This meant that it wasn’t a bad actor attacking us. We were doing it to ourselves.
As it turned out, our game client had a bug which, under certain circumstances, caused login requests to be resent dozens of times in the span of a few seconds. Unfortunately for us, the trigger conditions were very common among players who opened the game at 6pm. What happens at 6pm in our game? Players get a notification telling them that their shop’s items have refreshed. Consequently, many players would open the game around that time, and the two issues combined generated enough traffic to take out our servers for a short while.
We fixed the authentication issue, but the shop refresh design was still causing a smaller spike in traffic. These spikes can happen for any kind of scheduled event, and if it’s not correctly managed, troubles will ensue like they did for us. Fortunately, there are a few ways to manage the spikes by changing how the client makes the requests, or to eliminate the spikes altogether by changing the system’s design. Let’s talk about solutions (soon)!
DDoS stands for Distributed Denial of Service. Technically speaking, the term should only be used when load is maliciously generated for a targeted service, but inadvertently generating massive amounts of load for your own service does the same thing: degrading the service or even making it unavailable.
Whenever a player does something in Disney Melee Mania which needs a server response, a request is sent to the server containing data for the request, e.g. a request to unlock the Eve character. The server handles these requests as quickly as it can, and then responds to the client. Sometimes, these requests can fail for many reasons: the request may no longer be valid by the time it’s processed by the server, or maybe the device lost connectivity momentarily. Usually when a request fails, the client will then retry the request. Regardless of whether the request will eventually be successful, every request (retry or not) burdens the servers with more load.
While we can utilise autoscaling setups since we’re using a cloud provider, autoscaling won’t have enough time to react if the spike is large and grows fast enough, and the service will be affected anyway. It’s important to give any scaling setups enough time to react.
As with most load-related issues, eliminating the spike is usually best, and ideally that’s done by spreading the load over a much longer period of time (e.g. over the whole day rather than in a 1-minute period). Sometimes, that’s not possible, so we would have to spread the load by modifying how the traffic is generated, i.e. how the requests are sent.
Changing the request traffic pattern
Let’s talk about the less complicated solution: changing how the requests are sent. The goal is to spread out the incoming traffic over a slightly longer period of time, usually in the order of seconds or minutes. This can be accomplished by jittering the requests, making each client send the request a random amount of time later. For example, if the client was supposed to send it at 18:00:00, it would now send it at some time between 18:00:00 to 18:00:05 instead. If there were 1000 clients making the same request, the server would now need to handle 200 requests per second for 5 seconds, rather than the 1000 requests on the first second, which is much more manageable.
What if the load is still too much, and some requests are failing because of that? Remember that with each failed request, the client will keep retrying the request, causing additional load, and this can result in a cascade of retries that, when coupled with the original load, will overwhelm the server. The solution here is to implement an exponential backoff, which continuously reduces the rate of retries until a successful request or a maximum number of attempts is reached. For example, the first retry can be done after a delay of 2s, the second, 5s, the third, 10s, and increasing faster the more retries are made. This makes the additional load from retries spread out over a much longer duration.
And if the load is still too much? Perhaps load shedding will need to be in effect. Usually, requests are processed in the order that they come in. If there is a request that’s behind many others, it has to wait for its turn to be processed. Load shedding simply drops requests so that the queue of requests isn’t as large. There are many different strategies when choosing which requests to drop. Most commonly, the newest or oldest requests are dropped, but there can also be more intelligent control by prioritising certain types of (usually less important) requests to drop.
Changing the design
What we’ve covered so far handles the spikes at the networking level. We can go a bit higher in the abstraction and make changes to the system design to prevent these spikes from happening in the first place
Let’s take the example of refreshing each player’s shop daily. In the original design, when it’s 6pm, the client makes a request to the server, gets the new shop data and displays the new shop items to the player. Instead of retrieving the data at exactly 6pm, the implementation could be changed to store the next set of items on the device beforehand, so no request would have to be made at exactly 6pm. This does mean that changes will need to be made so that the client retrieves the new items before the switch, perhaps right after the client has authenticated with the server, but as a result, this change will distribute the load more evenly across the day!
How the design should be changed is very specific to the system. There really isn’t a one-size fits all solution, so it’s important to think about user behavior on a large scale while designing your features to see how it may affect the incoming traffic patterns.
We’ve seen how load spikes can potentially ruin the experience for players, and how bugs and design choices can lead to pretty disastrous results. Retries and backoff should be implemented for all network requests, and where appropriate, jitter too. Remember to spend some time thinking about how your players will act for each feature, so you don’t end up DDoSing your own servers!