Debugging Memory Leaks in Production

2024

Tracing a 16GB heap down to 250MB.

The Symptom

Production alerts showed:

  • Heap usage climbing from 250MB → 16GB over 4 days
  • Response times degrading
  • Container OOM kills

Investigation

Step 1: Heap Dumps

Captured multiple heap snapshots using Chrome DevTools Protocol.

Step 2: Timeline Analysis

Found objects accumulating after batch job runs.

Step 3: Root Cause

Discovered unclosed WebSocket connections in our real-time service:

// Bug: connection not closed on errors
socket.on('error', () => {
  // Missing: socket.close();
  console.log('error');
});

The Fix

// Correct: always close connection
socket.on('error', () => {
  socket.close();
});
socket.on('close', () => {
  cleanup();
});

Results

MetricBeforeAfter
Heap (steady state)16GB250MB
Container memory16GB2GB
Response timesdegradingstable

Key Takeaways

  1. Always cleanup resources - especially async connections
  2. Monitor trends - not just absolute values
  3. Heap snapshots - invaluable for root cause analysis