A large portion of security professionals think that their job is to prevent bad actors from gaining access to trusted resources. Yes, in isolation that is a true statement. But the implications of that position hide what is really supposed to happen. Instead, it is the job of infosec pros to ensure only appropriate actors can access trusted resources. One way this is accomplished is through what is called Security Chaos Engineering, which tests security resilience before some attack happens. It is an evolution of the pioneering work that was first done at Netflix many years ago. Now there are a number of similar products and related practitioners in this field.
The concept is simple to explain, but exceedingly hard to implement. One reason why this type of engineering mindset is needed has to do with the way that breaches are understood by corporate workers. Too often we don’t think about our IT infrastructure holistically, and when a breach happens we try to just plug the hole and move on. How many post-breach memos have you read where the author says, “we are taking steps to ensure this never happens again?” Technically that is the right approach: the next breach will happen somewhere else in our network, caused by some other “hole.” Another reason is that the average software stack has gotten so complex and distributed that it’s hard to comprehend and defend. It isn’t a matter of if you will have a breach, but when and how and what part of your systems will be compromised.
Adopting chaos engineering means that you look for potential points of failure across all of your IT systems. Part of this should be inherent in any lifecycle governance of your systems. But part is also being clever about how you test your systems. If you think you have this covered with penetration testing, you need to think again. The usual pen test engagement is a single moment in time when a SWAT team inhabits a conference room (perhaps now they do this virtually) and tries their mettle against your security defenses. Chaos engineering is a continuous practice, whereby your team is continuously testing your systems and software. Sadly, the old methods don’t work anymore. For example, just because you bought a firewall several years ago and have spent time defining a rule set doesn’t mean these rules are relevant or effective today. Your systems might be completely different and no longer protected. And these days, with rising cases of ransomware and data exfiltration, you want to catch these attacks before they do real damage.
Netflix was one of the first places to make overall chaos engineering popular several years ago with a tool they called Chaos Monkey. It was designed to test the company’s Amazon Web Services infrastructure by constantly – and randomly – shutting down various production servers. This always-on feature is important, because no single event will do enough damage or provide enough insight to harden your systems or find the weakest points in your infrastructure. Now that we live in the era of complex security events that leverage multiple malware techniques which are part of a coordinated campaign, we need to design and test for more sophisticated and longer-lasting attacks. We need better tools and that is where Security Chaos Engineering can help. In addition to the open source tools that came from Netflix, there are commercial products such as Verodin/Mandian’s Security Validation, SafeBreach’s Breach and Attack Simulation, and AttackIQ’s Security Optimization Platform, just to name a few of them.
Customers who have used these tools suggest the following best practices:
- Have an action plan: don’t change more than one variable at a time
- Define the rules of engagement (including the scaling up of your systems) so you maintain control when things go south
- Know your “blast radius” and the disruptive implications of your tests
- Use a tool that integrates with your SIEM logs (for example, SafeBreach can work with RSA’s NetWitness Platform)
This last item bears further explanation. A SIEM log can easily be overlooked, especially if you are hunting for a single entry in a massive dataset. Security Chaos Engineering tools can automatically find these entries and advise you about their implications – such as changing a too-loosely-defined software access roles policy, for example.
If you haven’t yet examined any of these chaos engineering tools – both for general systems analysis and for security-related issues – now might be the time to take a closer look. It is time for every security team to change their mindset from patching as a result of a security event to becoming more proactive in anticipating future attacks.