AWS Fault Injection Service — Live Health Dashboard
Chaos engineering is the practice of intentionally injecting failures into a system to test its ability to withstand unexpected disruptions. Rather than waiting for outages to happen in production, teams proactively simulate real-world failure scenarios — such as server crashes, network latency, or database failovers — in a controlled environment. The goal is to uncover weaknesses before they become customer-facing incidents, building confidence that the system can handle turbulent conditions gracefully.
The discipline was pioneered at Netflix with their famous Chaos Monkey tool, which randomly terminated production instances to ensure their services could tolerate infrastructure failures. Today, AWS Fault Injection Service (FIS) brings this practice to the cloud with managed experiment templates that let you safely inject faults across EC2, RDS, ECS, EKS, and networking layers. By running chaos experiments regularly, teams shift from a reactive posture to a proactive one — validating auto-scaling policies, failover mechanisms, and monitoring alerts actually work when it matters most.
This is a 3-tier web application deployed across 2 Availability Zones for demonstrating AWS Fault Injection Service (FIS) resilience testing.
The architecture is designed to be resilient — when faults are injected, you can observe how the system recovers automatically through ALB health checks, Auto Scaling replacement, and RDS Multi-AZ failover.
Application Load Balancer
Internet-facing, HTTP :80
Health checks every 10s
Auto Scaling Group
2–4 t3.small Flask instances
Across 2 AZs
RDS MySQL 8.0 Multi-AZ
db.t3.micro
Auto-failover enabled
Four pre-configured experiments are available. Run them from the FIS Console or via AWS CLI.
Terminates a single EC2 instance. Watch the ASG launch a replacement and the ALB reroute traffic within 3–5 minutes.
Simulates a full Availability Zone failure. Traffic shifts to the surviving AZ. RDS may failover if the primary was in the affected AZ (~60s).
Injects 500ms of network latency via SSM. Page loads slow down visibly. Auto-cleans up after 120 seconds.
Reboots the RDS primary with forced failover. The standby takes over as the new primary. DB connections drop briefly (~30–60s).