Being proactive is the key to staying safe online, especially for businesses and organizations that operate websites and mobile applications. If you wait for threats to appear, then in most cases it is too late to defend against them. Many data breaches come about this way, with hackers uncovering security gaps that had gone previously undetected.
The average web developer wants to assume that their code and projects will always function in the intended manner. Reality is a lot messier than that and organizations need to expect the unexpected. For years, cybersecurity experts recommended a practice known as penetration testing (and still do), where internal users pose as hackers and look for exposed areas of servers, applications, and websites.
The next evolution of penetration testing is something that is known as Chaos Engineering. The theory is that the only way to keep online systems secure is by introducing random experiments to test overall stability. In this article, we'll dive more into Chaos Engineering and the ways it can be implemented effectively.
Origin of Chaos Engineering
The cloud computing movement has revolutionized the technology industry but also brought with it a larger degree of complexity. Gone are the days when companies would run a handful of Windows servers from their local office. Now organizations of all sizes are leveraging the power of the cloud by hosting their data, applications, and services in shared data centers.
Back in 2010, Netflix was one of the first businesses to build their entire product offering around a cloud-based infrastructure. They deployed their video streaming technology in data centers around the world in order to deliver content at a high speed and quality level. But what Netflix engineers realized was that they had little control over the back-end hardware they were using in the cloud. Thus, Chaos Engineering was born.
The first experiment that Netflix ran was called Chaos Monkey, and it had a simple purpose. The tool would randomly select a server node within the company's cloud platform and completely shut it down. The idea was to simulate the kind of random server failures that happen in real life. Netflix believed that the only way they could be prepared for hardware issues was to initiate some themselves.
Tools to use
It's important not to rush into the practice of Chaos Engineering. If your experiments are not properly designed and planned, then the results can be disastrous and little helpful knowledge will be gained. Best practice is to nominate a small group of IT staff to lead the activities.
Every chaos experiment should begin with a hypothesis, where the team questions what might happen if their cloud-based platform experienced an issue or outage. Then a test should be designed with as small of a scope as possible in order to still provide helpful analysis.
One area where companies often need to focus their chaos experiments is in relation to global traffic handling. Because most of their development and testing is done in a local environment, it can be challenging to understand how cloud infrastructure will handle distributed loads.
A virtual private network (VPN) client can come in handy for this type of experimentation. While it encrypts all incoming and outgoing web traffic from a browser, which is a generally handy thing, it also changes the source IP address and location information. The latter is the important part here. You can use VPN services to simulate users from other countries or regions and add them as variables in a chaos experiment.
Benefits of chaos experiments
The philosophy of Chaos Engineering can seem counterproductive at first. Why would an organization want to break anything on purpose that could negatively impact the data and services they have running in live production systems? The answer is that although chaos experiments can have some short-term negative consequences, they more often identify larger risks that are looming in the future.
Avoiding service outages and cyberattacks is the key goal for any chaos experiment. You want to push your cloud infrastructure to its theoretical limits in order to understand how it will react to heavy traffic or unexpected failures. Analysis gained from a chaos experiment should inform IT decision making when it comes to cloud architecture. For example, resources may be scaled differently or firewall changes may be made to firm up potential vulnerabilities.
Also, chaos experiments offer a great opportunity for companies to see how their human staff will react in a time of crisis. With cloud-based systems, fire drills happen all the time due to issues with hardware, software, or networking. Chaos experiments can help to identify bottlenecks and problem points in any incident response process.
The future of Chaos Engineering
Typically, chaos experiments are run for a limited period of time and are not an everyday activity. That's because they require a significant amount of upfront planning and team participation among developers, quality assurance engineers, and cloud architects. It's simply not feasible or smart to be running chaos experiments all the time. The result might be, well, chaotic. Not in a good way.
So what is the future of Chaos Engineering? In all likelihood, the practice will soon be paired up with artificial intelligence and machine learning tools to help automate a lot of the analysis and recovery activities. These smart systems will be better at scanning networks and cloud environments to monitor performance and stability during tests. They may even be able to identify threats or dependencies that were unknown before.
The bottom line
The practice of Chaos Engineering has provided value to organizations of all sizes who base their technology resources in a cloud environment. Because of the complexity of these systems, it is vital to run experiments and tests in order to uncover potential areas of weakness or security gaps.
A chaos experiment should be designed to provide useful information without putting all systems at risk of failure. Cloud architectures are designed to be redundant so that if one node or piece goes down, the application or website can stay up and minimize the disaster for all users.