Infrastructure as Code: Drift Detection and Mitigation

Infrastructure as Code (IaC) is designed to ensure that your environment configurations match your source code. You write a Terraform configuration, execute terraform apply, and provision your resources.

However, over time, a common operational problem arises: Infrastructure Drift.

Drift occurs when changes are made directly to active cloud resources outside of the IaC workflow—such as an engineer modifying a security group port via the AWS console during a late-night debugging session, or an automated AWS patching process changing a resource setting.

If drift is left undetected, future Terraform executions can fail, overwrite critical hotfixes, or introduce silent security gaps.

The Consequences of Infrastructure Drift

Drift introduces serious operational risks:

  1. Security Vulnerabilities: An engineer temporarily opening port 22 or 3306 to troubleshoot an issue might forget to close it, leaving a database or server exposed to the public internet indefinitely.
  2. Brittle Deployments: If a manual configuration change is missing from your git repository, rebuilding the environment from scratch (e.g. during a disaster recovery scenario) will result in a broken setup.
  3. Broken Plans: The next time a developer runs a legitimate CI/CD pipeline, the terraform plan will attempt to undo manual changes, potentially causing unexpected service interruptions.

Implementing Automated Drift Detection

To mitigate drift, SRE teams must implement continuous audit processes rather than relying on occasional manual runs.

Here are the primary strategies to detect drift:

1. Scheduled CI/CD Verification Jobs

Configure a nightly runner in your CI/CD platform (e.g., GitHub Actions, GitLab CI) that executes a dry-run plan:

terraform plan -detailed-exitcode -no-color

The -detailed-exitcode flag changes the CLI exit code behavior:

  • 0: Succeeded, no changes.
  • 1: Failed with an error.
  • 2: Succeeded, but there is a drift (changes detected).

If the pipeline exits with code 2, the CI runner triggers a Slack notification or pages the on-call team, attaching the diff output.

2. GitOps Reconcile Loops

For Kubernetes or advanced cloud infrastructure, move from push-based pipelines to pull-based reconcile loops using tools like Flux, ArgoCD, or Crossplane.

  • These GitOps controllers continuously monitor your active state (every few minutes) and compare it against the target state in git.
  • If drift is detected, they can be configured to auto-reconcile (automatically overwrite manual changes and restore the state to match git).

3. AWS Config and CloudTrail

Enable AWS Config to track resource changes. If a security group is modified, AWS Config records the change and evaluates it against compliance rules. Combine this with CloudTrail logs to identify which user or IAM role executed the manual change.


Resolving Drift Safely

When drift is detected, the SRE team has two paths:

  1. Rollback (Align Cloud to Git): Execute terraform apply to overwrite the manual change and restore the infrastructure back to the state defined in your git repository. This is the preferred path for security drift.
  2. Import (Align Git to Cloud): If the manual change was a legitimate upgrade, update your Terraform code to match the cloud state. In modern versions of Terraform, you can use the import block to safely import resources without manual state manipulation.
import {
  to = aws_security_group.app_sg
  id = "sg-0123456789abcdef"
}

Summary

Infrastructure drift degrades platform consistency. By implementing nightly scheduled validation jobs, utilizing GitOps reconcile loops, and establishing clear guidelines for resolving drift, SRE teams can maintain reliable, secure, and fully auditable cloud environments.