🧪 FIS Resilience Demo

AWS Fault Injection Service — Live Health Dashboard

✅ Fleet Ready — Safe to run FIS experiments (2/2 healthy hosts, alarm: OK)

📊 Live Instance Status

Instance ID
i-0139da614e8dc5796
Availability Zone
us-east-1b
Region
us-east-1
Database
✅ Connected

🤔 What is Chaos Engineering?

Chaos engineering is the practice of intentionally injecting failures into a system to test its ability to withstand unexpected disruptions. Rather than waiting for outages to happen in production, teams proactively simulate real-world failure scenarios — such as server crashes, network latency, or database failovers — in a controlled environment. The goal is to uncover weaknesses before they become customer-facing incidents, building confidence that the system can handle turbulent conditions gracefully.

The discipline was pioneered at Netflix with their famous Chaos Monkey tool, which randomly terminated production instances to ensure their services could tolerate infrastructure failures. Today, AWS Fault Injection Service (FIS) brings this practice to the cloud with managed experiment templates that let you safely inject faults across EC2, RDS, ECS, EKS, and networking layers. By running chaos experiments regularly, teams shift from a reactive posture to a proactive one — validating auto-scaling policies, failover mechanisms, and monitoring alerts actually work when it matters most.

📖 About This Demo

This is a 3-tier web application deployed across 2 Availability Zones for demonstrating AWS Fault Injection Service (FIS) resilience testing.

The architecture is designed to be resilient — when faults are injected, you can observe how the system recovers automatically through ALB health checks, Auto Scaling replacement, and RDS Multi-AZ failover.

🌐 Web Tier

Application Load Balancer

Internet-facing, HTTP :80

Health checks every 10s

🖥 App Tier

Auto Scaling Group

2–4 t3.small Flask instances

Across 2 AZs

💾 Data Tier

RDS MySQL 8.0 Multi-AZ

db.t3.micro

Auto-failover enabled

💥 FIS Experiments

Four pre-configured experiments are available. Run them from the FIS Console or via AWS CLI.

💣 Instance Failure Terminates 1 instance

Terminates a single EC2 instance. Watch the ASG launch a replacement and the ALB reroute traffic within 3–5 minutes.

🌊 AZ Outage Terminates all instances in 1 AZ

Simulates a full Availability Zone failure. Traffic shifts to the surviving AZ. RDS may failover if the primary was in the affected AZ (~60s).

📡 Network Disruption 500ms latency for 120s

Injects 500ms of network latency via SSM. Page loads slow down visibly. Auto-cleans up after 120 seconds.

💾 RDS Failover Forces Multi-AZ failover

Reboots the RDS primary with forced failover. The standby takes over as the new primary. DB connections drop briefly (~30–60s).

👀 What to Observe During Experiments

🏗 Architecture Diagram

FIS Resilience Demo Architecture

Auto-refreshes every 5 seconds · Served by i-0139da614e8dc5796 in us-east-1b