If this is the first time you’re hearing about Chaos Engineering, you might be wondering what it even is.
It was explained very well by Tammy Butow during a recent TestTalks interview.
She suggested we think of it as being like preventative medicine in that it’s a disciplined approach to identifying failures before they become outages.
Controlled Madness
The best approach to achieving chaos is to proactively test how the system responds under stress. You can then identify and fix failures before they impact your customers or cause damage to your reputation due to the poor publicity an outage could receive. (Remember all the bad press received by healthcare.gov when it first launched?)
The idea of Chaos Engineering is that it compares what you believe will happen in your distributed system to the reality of what really happens.
To learn how to build resilient software systems, you can use a chaos test tool to break things in your environment on purpose and see if it actually fails the way you believed it would.
Here’s the thing.
You can always sit in a room and draw diagrams on a whiteboard and form a hypothesis about how things “might” break, or fail, but you never really know until there’s an actual failure.
That’s the big idea here.
It’s all about carefully thought out experiments, rather than randomly inserting chaos or injecting failure into your data center. This test approach is not about simply walking into your office on Monday morning and saying that you're going to bring down production. That isn’t how it works.
Chaos Engineering (CE) is achieved by using thoughtful, planned out experiments that help to reveal a weakness in your systems.
In keeping with the vaccine analogy, CE injects a little bit of harm, but it's for the overall good.
Where did Chaos Engineering Start?
If you’re a tester, the term Chaos Monkey might sound familiar to you. It was heavily talked about back in 2011 when the Netflix team created it as a way to test server failures during their move from bare metal to Amazon Web Services (AWS) in the Cloud. What’s really cool is that Netflix open-sourced it, which means you can download it for free from GitHub.
That’s the good news.
The bad news is that it’s not the most user-friendly tool out there…especially for newbies. But no worries – I’ll let you know about a more accessible option later in this post.
Before we get to that, however, I want to explain why I think there is going to be more demand for Chaos Engineering than ever before.
Why do you Need Chaos Engineering?
As more and more companies move from using monolith architecture to a more micro-service architecture, many of the engineering teams I’ve spoken with aren’t even 100% certain what each service does or how one impacts another.
In some of the more extreme cases—especially when it comes to more complex systems— they don't even necessarily know what microservices dependencies they have in production.
As a tester you’ve probably noticed that it’s getting harder to keep up with the pace at which our companies are trying to develop and introduce software solutions to meet the demands of our customers.
It gets even worse when you begin to realize how expensive system downtime is for our companies. Not only are we wasting considerable funds on interruptions, but we’re most likely losing customer loyalty as well.
On the more human side of things, you've probably found that constant interruptions are causing some engineers to burn out pretty quickly having to support these outages.
Chaos engineering is something that can help with a lot of these scenarios.
Are you convinced yet?
If so, you might be wondering how to get started.
Prerequisites Before Starting Â
Before you start there are some prerequisites you’ll need to have in place:
•   Monitoring/observability
•   On-call and Incident Management
•   Cost of downtime per hour
The most important prerequisite is that you need monitoring and observer ability to know how your system is currently doing.
Without it you won’t know or be able to measure how your system behaves as you're performing chaos experiments.
You also should already have in place an On-Call and Incident Management program, as well as know your cost of downtime per hour.
Knowing the cost of downtime is crucial to talk about the value that chaos engineering is bringing into your organization to get your management buy-in.
When you have these things, you're also better able to know what services are the most highly critical in your infrastructure.
The CEO of Honeycomb recently said that “Chaos Engineering without observability… is just chaos.”
You want to know how your system is handling things without Chaos Engineering, and you want to know how your software system is going to handle the chaos experiment as it moves forward.
User Test Cases
There are many user test cases for this technique, but these here the ones that are most likely to occur in a real-world event:
- Outage reproduction
- On-call training
- Strengthen new products
- Battle test new infrastructure and services
- Logs, disk failure
- Prepare for launches or high traffic daysÂ
Tools to Use
Getting started with Chaos Engineering can be quite complicated (and a little scary). But once you've been doing it for a while, you'll become quite good at it, and you’ll no longer be afraid to run Chaos Engineering attacks. You'll understand that it’s a scientific process.
As I mentioned earlier, you can use Chaos Monkey, which is a great option.
But if you want a more intuitive way that has a friendly UI as well as an API that allows you to programmatically perform all of your Chaos Engineering experiments (including disaster recovery testing) then definitely check out Gremlin Free.
For a good “getting started” guide/demo, check out Ana Medina’s 2019 PerfGuild session on the subject.
One of the best things about PerfGuild is that if you missed it, you can still get the recordings of the live event now and start binge watching. Bonus: you’ll get to view Ana’s Q&A session on Chaos Engineering.
If you haven't already you can register to get instant access to PerfGuild 2019 now.