What’s your Cloud strategy?
“Cloud” is no longer the buzzword it used to be. And this is because anyone who needs a server has the offerings of a bazillion cloud providers to choose from. It takes a few minutes of your time (and the last four digits of your credit card) to bring up a server. So which provider should you be using? Should you be using one of them or all of them? Should you be using managed services or building your own? I will try to answer some of your questions by describing the cloud strategy at Mighty Bear Games, how it has evolved over the last few years and the lessons we have learned along the way.
World of Cloud
Back in 2018 the team was hard at work on World of Legends — an open world MMORPG (some of us call it ma-morp-gah.) Like other games in the genre, WoL required a suite of online features to function. Players could roam shared worlds, randomly challenge players and battle them in an arena, interact with NPCs, level up their characters, defeat time-limited bosses, fight as a guild, compete on global leader boards, chat with players— you get the idea, WoL was massive.
Massive also happened to be the name of the server-side tech that would power WoL. Massive was built using Java and heavily inspired by Netflix OSS. Without going into details about the architecture — an entire post by itself — some of the components involved were Kafka, Redis, Eureka and SQL databases.
We used AWS for Massive (see how inspired we were by Netflix.) The cloud strategy at the time was simple — we didn’t want a vendor lock-in. This meant that we would only leverage vanilla AWS offerings such as EC2 instances that would allow us to run any workload we wanted. Anything that required use of AWS specific tech was a no-go. So in the scenario that all of AWS’ data centres had to shutdown or if we weren’t happy with AWS or if we wanted to run our services across multiple cloud providers, switching to Google Cloud or Microsoft Azure would be a breeze. Now that sounds like a sensible plan. Why would anyone want to be locked-in with a vendor, right? Some of you reading this will agree. The rest of you still have a chance at being sensible.
We had no idea the scope of the work we had agreed to undertake. Initially, it was just trying to get the individual components connected and running on EC2. Infrastructure-as-code using Ansible was a step in the right direction. But of course you have bugs and to debug them you need logs. But first we needed ssh keys for all the devs to securely access the EC2 instances so that they can grep the logs. Great, that’s one less problem to think about. But one night something went wrong and our players were unable to enter battles.
Dev 1: Hmmm…there’s exceptions in the logs, looks like a data migration issue in the last update. We need a way to alert us of such exceptions, how should we do it?
Dev 2: Ummm…how about we setup a cron job that greps for the words “Error” or “Exception” in the logs and send a notification to Slack with details.
Dev 1: Wow. That’s a fantastic idea ! Let’s get it in asap. That’s one less thing to worry about.
2 weeks later
Dev 1: Hey Dev 3, looks like our chat service is down and I can’t bring it up. Could you have a look ?
Dev 3: Yeah sure man. But I just got a new laptop because my previous one crashed and I’ve lost the ssh key with the crash. Could you help me setup this new key and I’ll get right to it.
Dev 1: Yeah sure man. Here you go.
2 hours later
Dev 3: So about the chat service — the instance has run out of space due to the log files. What I’ve done is that I wrote a cron job that removes any logs older than 2 weeks.
Dev 1: Awesome. I love the initiative you are taking ! Thanks ! Now that’s one less thing to worry about.
This kept on going. The few things that we were occasionally worrying about became a laundry list of worries that wouldn’t end.
- First it was the just logs and errors. But then errors were going undetected because the cron job didn’t account for all the scenarios, so we would have to constantly maintain the cron job.
- There would be random service crashes so a manual restart would be required each time. It occurred to nobody that there’s something called systemd.
- Kafka instances had to be regularly restarted (we still don’t know why) and once the instance was allocated a new IP, Zookeeper would refuse to function until it was manually configured with the correct IP.
- Crashes eventually turned into a memory leak that wouldn’t be diagnosed until Prometheus and Grafana were added for monitoring. Just another component that we decided to manage ourselves.
- Scaling instances was a fully manual process so if you had one night when the traffic spiked while you were asleep and the current set of instances couldn’t handle the traffic, well good luck with that.
Developers were spending time maintaining servers instead of developing new features. This was with a small team of 6 devs tasked with developing full-stack features and also maintaining the infrastructure. There were no dedicated infrastructure engineers or system admins. Costs were higher than they should have been. And despite all the efforts to avoid a vendor lock-in, we still ended up using AWS specific tech such as Route53 and S3.
Getting Butter At It
Fast forward to 2019 and we began experimenting with side projects. With all the learnings fresh in mind from World of Legends, we decided to shift our cloud strategy — why not use AWS’ managed services for some of our existing components? So instead of managing the Redis instance ourselves, why not use AWS ElastiCache. And instead of running Spring Boot services on EC2, why not containerise the services and run them on AWS ECS — their managed Kubernetes-like service.
Managed services massively (see what I did there) shot up our productivity.
- Devs didn’t have to worry about a crashed service because the ECS engine took care of restarting those services.
- Basic service metrics such as CPU and memory usage were easy to analyse and could be used to trigger scaling events.
- Logs were directly routed to CloudWatch and nobody worried about disk space.
- CloudWatch alerts were sent to Slack via Lambda and adding new alerts/metrics was easy enough that we decommissioned the Prometheus setup in favour of CloudWatch.
- There was no need for configuring instance access since there was no need to ssh.
At the risk of sounding like a sponsored segment (it’s not) ECS was a game changer.
At some point in 2019 we started working on Butter Royale and given the tight timeline for the project we knew we didn’t have the bandwidth to make the same mistakes as WoL. So we went all in on AWS with DynamoDb, CodePipeline, LoadBalancers and more. After almost a year in production with 99.9% uptime, regular content updates and a growing player base, it’s safe to say that we made the right decision.
Leveraging AWS’ offerings allowed the team to
- Focus on features that were important for the players
- Reduce maintenance overhead
- Reduce server costs
- Sleep soundly at night
The only real cost — we aren’t moving away from AWS anytime soon. And with our experience the benefits clearly outweigh the cost of avoiding vendor lock-in. It would be unfair to say that everyone should be taking the same decision. If we had a team as big as Netflix or the FBI to manage EC2 instances, we could certainly do more.
Given our circumstances however, embracing AWS was a deliberate decision that worked out very well for us. With each new project we continue to look deeper into AWS’ offerings. That’s not to say we pick anything we spot on AWS. Instead we evaluate what AWS has to offer and see where it fits with our needs. Given a choice between self-managing an infrastructure component or using an AWS managed component, there has to be a strong reason (that isn’t vendor lock-in) for us to disregard AWS.
If you enjoyed this article, here’s a great follow up in Multi-Cloud is the Worst Practice. If you disagree, I would love to hear from your experience!