Building Self-Healing Systems with AWS Lambda and CloudWatch
As systems grow in size and complexity, the number of alerts received by on-call SRE teams can become overwhelming, leading to alert fatigue. Many of these pages are triggered by recurring, trivial issues (such as a full temporary disk directory or a memory leak in a legacy app) that require a simple, standard response (such as cleaning up logs or restarting the process).
If an incident can be resolved by a known sequence of steps, those steps should be automated. By combining CloudWatch Alarms, EventBridge, and AWS Lambda, SREs can build self-healing systems that detect and resolve common anomalies without human intervention.
The Architecture of Self-Healing
The pattern for automated self-healing is straightforward:
[System Anomaly] ──> [CloudWatch Alarm] ──> [EventBridge Rule] ──> [Lambda Function] ──> [System Restored]
Let's break down the components of this workflow:
- Detection (CloudWatch): A CloudWatch alarm monitors a specific system metric (e.g. disk usage on an EC2 instance, or memory utilization in a container).
- Notification & Routing (EventBridge): When the alarm changes state to
ALARM, it publishes an event to Amazon EventBridge. - Execution (Lambda): EventBridge matches the event against a rule and triggers an AWS Lambda function.
- Resolution (AWS Systems Manager / API Call): The Lambda function executes remediation actions (e.g. runs an AWS Systems Manager Command to clean up log files or restarts a service).
Real-World Example 1: Automated Disk Space Cleanup
One of the most common SRE pages is high disk space utilization. Here is how to automate the cleanup:
- CloudWatch Alarm: Set an alarm for when an EC2 instance's
DiskSpaceUtilizationexceeds 85%. - EventBridge Rule: Configure an EventBridge rule that triggers when the specific disk space alarm moves to
ALARMstate. - Lambda Remediation: The Lambda function receives the event containing the EC2 Instance ID. It calls the AWS Systems Manager (SSM) API to run a shell script on the target instance:
import boto3
def lambda_handler(event, context):
ssm = boto3.client('ssm')
instance_id = event['detail']['configuration']['metrics'][0]['metricStat']['metric']['dimensions']['InstanceId']
# Run cleanup commands via SSM RunCommand
response = ssm.send_command(
InstanceIds=[instance_id],
DocumentName="AWS-RunShellScript",
Parameters={'commands': [
'find /var/log -name "*.log.gz" -mtime +7 -delete',
'docker system prune -af --volumes'
]}
)
return response
This clears old compressed logs and unused docker resources, bringing disk utilization down before an SRE is ever paged.
Real-World Example 2: Mitigating Memory Leaks in Containers
If an application suffers from a slow memory leak and cannot be refactored immediately, you can implement a self-healing restarts cycle:
- CloudWatch Alarm: Set an alarm for when ECS task memory utilization exceeds 90% for more than 10 minutes.
- Lambda Remediation: The Lambda function calls the ECS API
UpdateServicewith theforceNewDeployment: trueparameter, triggering a rolling deployment that gracefully boots new tasks and terminates the leaking ones with zero customer-facing downtime.
Crucial Safeguards for Self-Healing Systems
Automated mitigation is powerful, but if designed poorly, it can exacerbate outages:
- Limit Remediation Frequency: Prevent runbook loops. If a Lambda function runs a disk cleanup, but the disk fills up again within 5 minutes, do not run it again. Fall back to paging a human.
- Implement Rate Limiting: Ensure your self-healing scripts do not restart all your servers simultaneously, crashing your entire service.
- Log Everything: Ensure all automated actions post notifications to your team’s Slack channel or logging platforms so the on-call team is aware of what the system did.
Summary
Self-healing infrastructure minimizes operational toil and keeps your platform stable. By automating standard operating procedures using CloudWatch, EventBridge, and Lambda, you can free up your SRE team to focus on proactive engineering instead of manual incident response.