Optimizing Gaijin Ent.'s Crossout for cost-efficiency and reliability using EKS, Karpenter and Agones

7 minute read
Content level: Advanced

This article will be helpful for game studios of any size planning to build a cost-efficient and reliable session-based multiplayer game.

Crossout teaser

Andrey Kazakov, Gaijin Entertainment
Mikhail Ishenin, AWS
Alexey Kazmin, AWS

Epic Games' Fortnite runs on AWS – awesome! But can a smaller studio benefit from running a multi-player game in the cloud?

Let's first define the top challenges such a studio faces today:

  • Driving down the costs, specifically costs paid upfront and people costs
  • Raising speed of development, testing, and deployment of new and improved features
  • Keeping player satisfaction high, with uptime being one of the top metrics here

To discuss these questions, we will invite several experts from different GameTech companies who will tell their stories.

AWS for Games, customer stories #1, Gaijin Ent.

For the costs component today, we've invited Andrey Kazakov from Gaijin Entertainment. Gaijin Entertainment has released over 30 game titles, including a widely popular War Thunder, Enlisted and many other games, and Andrey is a seasoned DevOps Lead running production for the Crossout.

Crossout is an MMO game where you can craft unique armored cars, ride them into battle and fight against other players in post-apocalyptic arenas. It has many game modes, but the ones with the highest demands on infrastructure are PvP and PvE, which require lots of resources to host their sessions.

After running the game with the help of a traditional hosting provider for some time, we discovered several problems:

  • We kept resources over-provisioned at off-peak time.
  • When we need to scale up in response to game servers' demand, we manually deploy servers, and that could take up to 24 hrs with dedicated hosting.
  • Cutting the peaks with AWS EC2 was fast enough, but quite expensive.

Andrey Kazakov, Gaijin Entertainment

In fact, infrastructure costs could be pretty high if you are using On-Demand instances. Savings plans and reserved instances can help here, but are not very efficient in a dynamic environment such as game server hosting, which is depends on player activity and influxes from marketing campaigns. They could be as high as 2x to the upper baseline.

We discussed this internally and also asked our AWS Account Team to address this problem, and we finally came up to use auto-scaling techniques. Which allowed us to save with AWS EC2 Spot instances.

Andrey Kazakov, Gaijin Entertainment

Let's make a quick detour here. AWS EC2 Spot instances can save up to 90% compared to On-Demand EC2 prices, but AWS can reclaim them anytime, with a 2-minute termination notice and a non-guaranteed rebalancing signal before that. It may sound not a good fit for a game server, where the session length could be 10-20 times greater than those 2 minutes, but many game companies do use them to host their game servers. Let's discover how!

There are several best practices regarding AWS EC2 Spot instances usage, but let's consider several most relevant ones in more details.

  1. Use an appropriate instance types. You can use Spot Instance Advisor to find an average frequency of interruption for a specific instance type for a specific region. The best instance types are those which have a frequency of interruption <5%.
  2. Use a diversified set of instance types. As you can notice in the Spot Instance Advisor, different instance types of the same instance family usually have a different frequency of interruption. For example, at the time of writing, in the us-east-1 region, t4g.2xlarge instance type has a frequency of interruption of >20%, however t4g.xlarge – <5%. If your workload scales horizontally, as most of the game servers do, you'd better take twice as many t4.xlarge and dramatically reduce the frequency of interruption.
  3. Use a diversified set of instance families. Let's say you know your workload runs smoothly on c6g.xlarge, but Spot Instance Advisor at the time of writing, in the us-east-1 region, shows that its frequency of interruption is 5-10%, which is greater than we'd want it to be. In this case, we can try c6gd.xlarge which is a tiny bit more expensive, but it has a frequency of interruption of <5%, a good way to go.
  4. Use an appropriate tooling. Always make sure that your tooling handles both termination signal and an EC2 instance rebalance recommendation signal, to gracefully shutdown an endangered instance before the actual termination signal comes.

Now, back to Andrey.

We've made a review of some solutions for auto-scaling, and went for Kubernetes, because it at once could allow us to:

  • Save money with proper cooked Spot Instances
  • Scale the server fleet aligned with the number of concurrent users playing right now (CCU)
  • Automate game servers provisioning in a unified way

All that can allow us to save both money and time.

To set up proper auto-scaling, we adopted Agones, an open-source platform for scaling and orchestrating multiplayer game servers on Kubernetes. Our developer (yes, single developer) integrated Agones SDK into our game server code. We started a canary test, gradually increasing the percentage of game sessions using Agones. Now, Agones handles most game sessions, eliminating the need for dedicated servers.

Game server lifecycle

Game Server Lifecycle looks like this. The scheme is pretty complicated, but the important thing here is that the Game Server must come to a Ready state when it is ready to serve game sessions. When the game session ends, Agones deletes the pod, and this allows us to avoid bugs and memory leaks in game servers.

Fleet and FleetAutoscaler, are Kubernetes CRDs provided by Agones. FleetAutoscaler dynamically scales the number of fleets based on demand, while Fleet represents a group of game servers ready to host games.

We specify the desired number of standby servers, and Agones ensures the fleet reaches that state. But we needed to scale resources that host game sessions: EC2 instances themselves, and for that we chose Karpenter. Karpenter provisions new nodes and selecting the most cost-effective Spot instances. It provisions new cluster nodes when there is high demand, and removes excessive nodes from the cluster when the demand is lower.

By specifying a list of preferred instances or instance families, Karpenter ensures that the instance with the lowest price is provisioned in the Kubernetes cluster.

Karpenter works in tandem with Node Termination Handler which handles EC2 instance rebalance recommendation signal, cordons the node at risk, waits for it to be drained and removes it from the cluster, while Karpenter spins up another node of different type.

Here is the more or less final scheme that we currently have. Karpenter manages EKS nodes, while Agones spawns new Game Server instances on those nodes.

Final schema

Kubernetes and developing a solution for it from scratch is not an peaceful walk. The MVP was ready in two weeks, but it took about 6 months for production. Was all the effort worth it? Let's look at the results:

  • Game servers are now set up automatically within 1.5 minutes, significantly reducing the latency compared to the previous max duration of 1 day – and this allows us > to take the most from incoming traffic, either organic or marketing-generated.
  • Servers' costs have been reduced by 80% compared to hosting them with a dedicated servers provider, totally acceptable for the unit economics of the game.
  • The time spent on server maintenance has significantly decreased. Once the cluster is set up, everything runs smoothly without frequent manual interventions.

So yes, all the objectives have been successfully accomplished.

Andrey Kazakov, Gaijin Entertainment

The main point of this story is simple: treat the cloud as a tool which can solve even heavily constrained tasks, if you use it right.

Why this project was a huge success:

  • Containerized game servers within Kubernetes
  • Agones to manage game sessions
  • Karpenter to manage Kubernetes nodes
  • Spot Instances to reduce costs, while using all the best practices to keep interruptions on a satisfactory low level

See you in our next interview and happy architecting!

profile pictureAWS
published 9 months ago1174 views