Critical Lambda Crash Recovery

Executive Summary

A production-critical serverless function began terminating silently under secure network constraints, with no logs, no errors, and no telemetry. When standard diagnostics failed, I led a full forensic teardown, reengineered the runtime environment, and isolated the fault—restoring system stability without replatforming or missing release targets.

Situation

A cloud-native function within a secure network segment failed mid-execution during outbound HTTPS calls. No exceptions were thrown, no logs recorded, and no traffic was emitted. Despite the green status from platform observability, the workload was silently terminating—mid-flight.

Challenge

Built-in metrics showed only "successful" invocations. Native networking modules like https and net failed silently. With vendor-side support unable to access runtime diagnostics, the issue appeared untraceable. Platform-level causes were suspected but unconfirmed.

Actions Taken

- Rebuilt deployment artefacts in a containerised Linux environment to match system-level dependencies.
- Installed a clean runtime matching production node versions to ensure consistency.
- Created isolated test functions using only native modules to eliminate framework interference.
- Conducted outbound packet tracing to verify zero external traffic initiation.
- Deployed control tests outside the secure segment to confirm environmental variance.
- Removed all native module dependencies to prevent binary incompatibility.
- Hardened deployment via clean production-only builds with locked package states.

Results

Confirmed that the function was terminating during socket initialisation due to runtime instability in isolated network conditions. Refactored networking logic and rebuilt the deployment pipeline, avoiding full migration or downtime. This preserved uptime, maintained trust with stakeholders, and prevented a costly platform change.

For full technical detail, read the full investigation report.

Reflections

Diagnosing silent runtime failure requires discipline, tooling independence, and calm under pressure. This case reinforced the value of isolated reproducibility and vendor-agnostic validation when system telemetry fails to deliver truth.

← Back to Recent Work Privacy | Terms and Conditions