FIS Resilience Demo

📊 Live Instance Status

Instance ID

i-0139da614e8dc5796

Availability Zone

us-east-1b

Region

us-east-1

Database

✅ Connected

🤔 What is Chaos Engineering?

Chaos engineering is the practice of intentionally injecting failures into a system to test its ability to withstand unexpected disruptions. Rather than waiting for outages to happen in production, teams proactively simulate real-world failure scenarios — such as server crashes, network latency, or database failovers — in a controlled environment. The goal is to uncover weaknesses before they become customer-facing incidents, building confidence that the system can handle turbulent conditions gracefully.

The discipline was pioneered at Netflix with their famous Chaos Monkey tool, which randomly terminated production instances to ensure their services could tolerate infrastructure failures. Today, AWS Fault Injection Service (FIS) brings this practice to the cloud with managed experiment templates that let you safely inject faults across EC2, RDS, ECS, EKS, and networking layers. By running chaos experiments regularly, teams shift from a reactive posture to a proactive one — validating auto-scaling policies, failover mechanisms, and monitoring alerts actually work when it matters most.

📖 About This Demo

This is a 3-tier web application deployed across 2 Availability Zones for demonstrating AWS Fault Injection Service (FIS) resilience testing.

The architecture is designed to be resilient — when faults are injected, you can observe how the system recovers automatically through ALB health checks, Auto Scaling replacement, and RDS Multi-AZ failover.

🌐 Web Tier

Application Load Balancer

Internet-facing, HTTP :80

Health checks every 10s

🖥 App Tier

Auto Scaling Group

2–4 t3.small Flask instances

Across 2 AZs

💾 Data Tier

RDS MySQL 8.0 Multi-AZ

db.t3.micro

Auto-failover enabled

💥 FIS Experiments

Four pre-configured experiments are available. Run them from the FIS Console or via AWS CLI.

💣 Instance Failure Terminates 1 instance

Terminates a single EC2 instance. Watch the ASG launch a replacement and the ALB reroute traffic within 3–5 minutes.

🌊 AZ Outage Terminates all instances in 1 AZ

Simulates a full Availability Zone failure. Traffic shifts to the surviving AZ. RDS may failover if the primary was in the affected AZ (~60s).

📡 Network Disruption 500ms latency for 120s

Injects 500ms of network latency via SSM. Page loads slow down visibly. Auto-cleans up after 120 seconds.

💾 RDS Failover Forces Multi-AZ failover

Reboots the RDS primary with forced failover. The standby takes over as the new primary. DB connections drop briefly (~30–60s).

👀 What to Observe During Experiments

This page — Instance ID and AZ change when traffic is rerouted. Database status may briefly show errors during failover.
CloudWatch Dashboard — HealthyHostCount drops and recovers. TargetResponseTime spikes during network disruption.
EC2 Console — Terminated instances appear, new ones launch.
FIS Console — Experiment status transitions from running to completed.

🏗 Architecture Diagram

🧪 FIS Resilience Demo