← Back to Recent Work

VPC Networking Resilience

Executive Summary

After diagnosing sporadic connectivity failures within a virtual private network, I led a full architectural review and remediation effort targeting egress fragility. By reworking the network pathing and introducing true availability zone (AZ) redundancy, we significantly hardened the platform's external communications and reduced silent failure exposure.

Situation

Production workloads experienced transient outbound failures with no clear alarm triggers. Investigation revealed single-path egress dependency via a NAT gateway in one AZ — a hidden single point of failure exacerbated by ENI degradation and NAT saturation thresholds.

Challenge

While the infrastructure was distributed across multiple AZs, routing assumptions created traffic concentration and fragile fallback logic. Failures manifested in ways that evaded CloudWatch triggers and only surfaced through external telemetry. Fixes required minimal impact and full backward compatibility.

Actions Taken

- Audited ENI metrics, NAT gateway saturation, and AZ failover readiness.
- Deployed redundant NAT gateways per AZ to eliminate single-egress risks.
- Rebuilt routing tables to enable automatic fallback between AZs without packet loss.
- Introduced detailed logging and flow metrics with custom alarms on retry rates and blackhole patterns.
- Simulated subnet isolation to confirm deterministic failover and flow continuity.
- Refactored all networking IaC to codify AZ isolation and egress parity.

Results

Egress-related packet loss dropped to near-zero levels. Subnet or AZ-specific failures triggered automatic routing realignment in under a second. No user-facing impact was recorded during subsequent cloud-hosted outage simulations. The network layer now forms a resilient foundation for broader multi-region expansion.

Reflections

Real availability isn't guaranteed by distributing resources — it's earned by testing how failures propagate. This project highlighted the critical role of invisible dependencies like ENIs and NAT gateways, and the need to validate assumptions through active disruption. Diagrams don’t deliver resilience — architecture under pressure does.

← Back to Recent Work