• Level Up Coding
  • Posts
  • LUC #39: Chaos Engineering: Embracing Chaos To Strengthen System Resilience and Reliability

LUC #39: Chaos Engineering: Embracing Chaos To Strengthen System Resilience and Reliability

Plus, how the most prominent API architecture styles work, Linux permissions explained, and an overview of the most popular deployment patterns

This week’s issue brings you:


A big thank you to our partner Postman who keeps this newsletter free to the reader.

You've probably used Postman to send HTTP requests. But did you know you can also send: GraphQL, gRPC, WebSocket, and MQTT requests. You can read more here.

Chaos Engineering: Embracing Chaos To Strengthen System Resilience

Deliberately inject faults into your system.

Yes, you read that correctly. This is what Chaos Engineering is all about.

By deliberately injecting faults into systems in a controlled manner, Chaos Engineering helps teams proactively identify and address potential failures before they occur.

Today we’ll look into the principles behind Chaos Engineering, key practices, how to implement it, and how it applies across various organization sizes.

Let’s dive in!

The Need for Chaos Engineering

Unscheduled downtime is never a good thing. The loss of user trust can have a devastating impact. And for some systems, unscheduled downtime can lead to major consequences. Think healthcare.

However, modern software systems are inherently complex. And complexity only increases when a system becomes distributed.

There are bound to be points of failure.

And that’s why proactively finding and addressing them is so important.

This is where Chaos Engineering comes in. It helps overcome complexity challenges through controlled fault injection, deliberately introducing disturbances to test and fortify an application's resilience.

Core Principles of Chaos Engineering

Chaos Engineering is grounded in five fundamental principles:

Hypothesis-driven approach

It begins with defining what normal system behavior looks like, establishing metrics that reflect the system's steady state.

This clear definition sets the stage for understanding the impact of introduced variables.

Real-world event simulation

This involves introducing variables that simulate real-world disruptions such as network outages or traffic spikes.

These simulations are not random; they mimic disruptions likely to occur in the actual environment, providing valuable insights into how the system copes.

Experiments in production

While testing in a controlled setting has its place, real-world conditions often reveal unforeseen vulnerabilities.

These tests in production (or as close to it as possible) should be carefully monitored to ensure accuracy without causing significant user impact.

Testing in production is controversial. There are contexts where it can work great and a lot where the risks are too great.

If it makes sense for an organization to run experiments in production, it must proceed with prudence and be done in a way that mitigates blast radius and user impact if something were to go wrong.

Approaches on how to do this shortly.

Automation at scale

As systems scale, manually conducting experiments becomes impractical.

Automated tools and scripts allow systematic testing across various parts of the system, making the process efficient and comprehensive.

Minimizing impact

This involves strategies like starting with smaller experiments to limit the 'blast radius' and gradually scaling up.

The goal is to learn and improve without compromising overall system stability, ensuring a balance between resilience testing and maintaining operational integrity.

Key Practices in Chaos Engineering

Implementing Chaos Engineering involves a series of systematic practices including:

Fault injection

Methods and tools are used to deliberately introduce faults, ranging from server crashes to network disruptions, to assess their impact on the system.

Load testing

This practice stresses the system beyond normal capacity to identify its breaking points and understand under what conditions it fails. This is very important for ensuring that the system can handle unexpected surges in demand.

Dependency testing

Essential for systems reliant on third-party services, this tests the system's response to failures or unpredictable behaviors of external services.

State transition testing

It ensures the system smoothly transitions between different states, maintaining stability during dynamic changes.

Monitoring and observability

This is a key aspect of chaos engineering. Effective monitoring and observability allow teams to track the system's performance in real-time and respond quickly to issues that arise during testing.

Game days

Collaborative exercises simulate and respond to incidents, enhancing the team's incident-handling skills. It’s a form of proactive training to prepare the team for real-world scenarios.

Postmortem and learning

After experiments, postmortem analysis is conducted to review, understand, and apply learnings for system improvement.

There are many other techniques and strategies employed in chaos engineering, each adding to the robustness and resilience of systems in different ways.

Implementing Chaos Engineering

Integrating Chaos Engineering into an organization's culture requires a thoughtful and strategic approach.

The first step is to identify the critical components of the system that would benefit most from enhanced resilience.

To do this, begin by conducting controlled experiments in a staging environment, carefully analyzing the impact on these components. This allows a safe exploration of the system's vulnerabilities without immediate risk to production environments.

Gradually move these experiments into the production setting as confidence and process understanding increase.

When running experiments in production, use techniques that limit the blast radius, use feature flags to quickly roll back changes, have a rollback plan, and use canary releases to test a small set of users.

These are just some tips to make running experiments safe. The main point is to ensure there is a strategy that combines multiple techniques and approaches to mitigate user impact if something breaks.

Alongside these technical steps, fostering a cultural shift within the organization is very important. A shift to where failures are viewed as invaluable learning opportunities and open communication and teamwork in analysis are prioritized. Sharing insights and learnings from each experiment helps in cultivating a shared responsibility for system robustness.

Applicability Across Company Sizes

Chaos Engineering has gained widespread adoption for enhancing system resilience, demonstrating its effectiveness in complex, large-scale environments.

However, the principles of Chaos Engineering are valuable to startups through to big tech, and everything in between.

Some of how it’s applied might look different at a startup compared with big tech, but a lot is still actually the same.

For example, a startup should still start small, in staging, and then move to production only when it’s safe to do, with a strategy for mitigating user impact in place.

Chaos Engineering fosters a cultural shift toward proactive problem-solving and resilience. This shift is highly beneficial for smaller organizations, helping them cultivate a mindset focused on continuous improvement and robust system design.

Wrapping Up

Whilst Chaos Engineering is quite a new practice with Netflix pioneering it around 2011 as a response to the challenges posed by distributed systems. It has since gained broad adoption as a sophisticated practice to ensure system resilience and reliability.

By embracing this practice, companies can further safeguard against unforeseen disruptions, ensuring that their systems are robust and reliable even under stress.

How Do The Most Prominent API Architecture Styles Work? (Recap)

REST: Utilizes HTTP methods for operations which provides a consistent API interface. Its stateless nature ensures scalability, while URI-based resource identification provides structure.

GraphQL: Unlike REST it uses a single endpoint. GraphQL uses a single endpoint, allowing users to specify exact data needs, and delivers the requested data in a single query.

SOAP: Once dominant, SOAP remains vital in enterprises for its security and transactional robustness. It’s XML-based, versatile across various transport protocols, and includes WS-Security for comprehensive message security.

gRPC: Offers bidirectional streaming and multiplexing using Protocol Buffers for efficient serialization. It supports various programming languages and diverse use cases across different domains.

WebSockets: Provides a full-duplex communication channel over a single, long-lived connection. It is ideal for applications requiring real-time communication.

MQTT: A lightweight messaging protocol optimized for high-latency or unreliable networks. It uses an efficient publish/subscribe model.

Linux Permissions Explained (Recap)

Linux is a multi-user OS that has robust built-in user and group permissions.

These permissions provide the ability to limit who has access to a file or directory and what actions (read, write, or execute) they are allowed to perform.

There are three permission types for each file and directory:

  • Read (r): Allows reading of a file or listing of the directory's contents.

  • Write (w): Allows you to modify the contents of a file or create or delete files from a directory.

  • Execute (x): Allows a file to be run as a program, or a directory to be entered into.

There are three types of users to whom permissions are assigned:

  • User (u): The owner of the file or directory.

  • Group (g): Other users who are members of the file's group.

  • Others (o): All other users who are not the owner or members of the group.

Blue/green deployment: Uses two environments to ensure zero downtime; one hosts the live version while the other tests new updates. This setup allows for an easy rollback if needed.

Canary deployment: Rolls out changes to a small subset of users first, enabling performance monitoring and gathering feedback. If successful, the update can be gradually extended to more users.

Rolling deployment: Updates the software in phases, ensuring most of the system remains operational. It’s ideal for systems that require continuous operation.

Feature toggles: Acts like switches for new features. They allow teams to deploy features quietly, turning them on for specific users when it makes sense.

A/B testing: Tests different feature versions with various user groups to identify the most effective one. Useful for validating user preference and effectiveness based on concrete data.

That wraps up this week’s issue of Level Up Coding’s newsletter!

Join us again next week where we’ll explore how the most prominent Git branching strategies work, tips and strategies for effective debugging, and how OAuth 2.0 works.