Demystifying Cloud Networking: AWS VPC Architecture Patterns for SREs
Cloud networking is the foundation upon which all modern cloud infrastructures are built. Yet, it is often treated as secondary to application deployment. For Site Reliability Engineers (SREs), a poorly designed virtual network is a ticking time bomb—leading to routing complexities, security vulnerabilities, single points of failure, and unexpectedly high data transfer costs.
To build reliable and secure systems on AWS, we must design Virtual Private Clouds (VPCs) with scalability, isolation, and cost-efficiency in mind.
Here is an SRE-focused guide to modern AWS VPC architecture patterns.
1. Subnet Tiering: The Foundation of Network Isolation
A secure network enforces strict isolation between different components of a system. A standard best practice is to divide your VPC CIDR block (commonly a /16 block like 10.0.0.0/16) into three distinct subnet tiers across at least three Availability Zones (AZs) for high availability:
Public Subnet Tier
- Routing: Has a direct route to the Internet Gateway (IGW) via
0.0.0.0/0. - Resources: Application Load Balancers (ALBs), NAT Gateways, and public bastion hosts.
- SRE Guideline: Never run application servers or database instances in the public subnet. This tier should strictly serve as the ingress and egress interface.
Private (Application) Subnet Tier
- Routing: No direct inbound internet route. Outbound internet access is routed through a NAT Gateway in the public subnet.
- Resources: Compute workloads, such as Amazon ECS tasks, EKS pods, and internal EC2 microservices.
- SRE Guideline: Workloads in this tier can fetch external dependencies (e.g., packages, external APIs) securely while remaining shielded from direct public internet exposure.
Isolated (Data) Subnet Tier
- Routing: No route to the Internet Gateway or the NAT Gateway. Communication is strictly internal within the VPC.
- Resources: Databases (Amazon RDS, ElastiCache clusters) and highly sensitive queue systems.
- SRE Guideline: By completely severing outbound internet routes, you prevent database servers from initiating outbound connections, mitigating data exfiltration risks in the event of a container compromise.
2. Optimizing Outbound Routing: NAT Gateways vs. NAT Instances
For private subnet resources to reach the internet (e.g., to perform package updates or call external APIs), they must translate their private IPs to a public IP. AWS offers two primary options: NAT Gateways and NAT Instances.
| Criteria | AWS NAT Gateway (Managed) | EC2 NAT Instance (Self-Managed) |
|---|---|---|
| Maintenance | None. Fully managed by AWS. | High. SREs must manage patching and OS updates. |
| High Availability | Built-in within an AZ. Scales up to 45 Gbps. | Requires scripting/auto-scaling groups for failover. |
| Cost | Expensive ($0.045/hour + $0.045/GB processed). | Low. Costs only the price of a small EC2 instance. |
| Use Case | Production & mission-critical workloads. | Development, staging, and low-budget environments. |
Cost Optimization Tip for SREs:
NAT Gateway fees are a common source of "cloud bill shock." Because AWS charges per NAT Gateway per Availability Zone, running three NAT Gateways in production is necessary for fault tolerance. However, for staging and development environments, SREs can deploy a single NAT Gateway across all AZs or migrate to Graviton-based NAT Instances running lightweight Linux routing daemons to save thousands of dollars monthly.
3. Keep Traffic Internal with VPC Endpoints
By default, when an application in a private subnet communicates with another AWS service (like Amazon S3, DynamoDB, or Secrets Manager), the traffic is routed out through the NAT Gateway to the public internet before hitting the service's public endpoint. This incurs NAT Gateway data processing fees and adds latency.
VPC Endpoints solve this by enabling private connections between your VPC and supported AWS services without leaving the Amazon network backbone.
There are two types of VPC Endpoints:
1. Gateway Endpoints (Free)
- Supported Services: Amazon S3 and Amazon DynamoDB.
- How it works: Modifies your subnet route tables to direct S3/DynamoDB traffic directly to the service.
- SRE Guideline: There is no hourly charge or data transfer cost for Gateway Endpoints. SREs should always configure Gateway Endpoints in every VPC to instantly reduce NAT processing fees.
2. Interface Endpoints (AWS PrivateLink)
- Supported Services: Secrets Manager, Systems Manager, ECS/ECR, CloudWatch, and most other AWS services.
- How it works: Provisions a Elastic Network Interface (ENI) with a private IP address inside your subnet.
- SRE Guideline: While PrivateLink charges a small hourly fee ($0.01 per AZ) plus processing fees, it is crucial for isolated subnets that need to fetch container images from ECR or retrieve database credentials from Secrets Manager.
4. Multi-Layered Security: SGs vs. NACLs
Securing your network requires a defense-in-depth strategy combining Security Groups and Network Access Control Lists (NACLs):
- Security Groups (Stateful / Resource-level):
- Act as a firewall for associated resources (like an EC2 instance).
- They support allow rules only.
- Being stateful, if you allow inbound traffic on port 443, the outbound response is automatically allowed, regardless of outbound rules.
- NACLs (Stateless / Subnet-level):
- Act as a firewall at the subnet boundary.
- They support both allow and deny rules.
- Being stateless, you must explicitly allow both inbound and outbound traffic.
The SRE Workflow:
Use Security Groups for day-to-day microservice authorization (e.g., "Allow App ALB Security Group to connect to App Containers on port 8080"). Use NACLs as a broad safety net to block malicious IP ranges, enforce subnet-level quarantine during incidents, or block unauthorized protocols across the entire network boundary.
Summary
Building a reliable VPC is a balancing act between security, resilience, and cost. By structuring your VPC into public, private, and isolated subnets, auditing NAT data processing fees, and implementing Gateway Endpoints for S3 and DynamoDB, you establish a high-performance network that protects your workloads while keeping cloud expenditures under control.