Debugging Memory Leaks in Production
2024
Tracing a 16GB heap down to 250MB.
The Symptom
Production alerts showed:
- Heap usage climbing from 250MB → 16GB over 4 days
- Response times degrading
- Container OOM kills
Investigation
Step 1: Heap Dumps
Captured multiple heap snapshots using Chrome DevTools Protocol.
Step 2: Timeline Analysis
Found objects accumulating after batch job runs.
Step 3: Root Cause
Discovered unclosed WebSocket connections in our real-time service:
// Bug: connection not closed on errors
socket.on('error', () => {
// Missing: socket.close();
console.log('error');
});
The Fix
// Correct: always close connection
socket.on('error', () => {
socket.close();
});
socket.on('close', () => {
cleanup();
});
Results
| Metric | Before | After |
|---|---|---|
| Heap (steady state) | 16GB | 250MB |
| Container memory | 16GB | 2GB |
| Response times | degrading | stable |
Key Takeaways
- Always cleanup resources - especially async connections
- Monitor trends - not just absolute values
- Heap snapshots - invaluable for root cause analysis