Memory Leaks in Production: Detection, Diagnosis, and Prevention

Memory leaks have a distinctive signature in production: everything works fine after a deployment, then gradually slows down over hours or days, until the process is restarted and the cycle begins again. The restart masks the symptom, but the leak continues.

In garbage-collected languages, leaks typically mean objects that should be freed are being held by references that aren't being cleaned up. In languages with manual memory management, it's often a missing free call or improper resource cleanup. In both cases, the process grows indefinitely until it crashes or becomes too slow to function.

Recognizing Memory Leaks

Before diagnosing, recognize the pattern:

Leak indicators:
  - Memory usage grows monotonically over time
  - Restarts temporarily fix performance issues
  - Memory usage after GC still trends upward
  - Process eventually OOMKilled by the OS or container runtime
  - Response times degrade over a process's lifetime

Not necessarily a leak:
  - Memory usage increases with traffic (expected — caches fill)
  - Memory spikes during peak hours, returns to baseline at night
  - High memory after a large batch job that caches results

The key distinction: a leak grows continuously without bound. Normal memory growth plateaus.

Monitoring for Memory Leaks

Process-Level Memory Tracking

// Node.js: track memory over time
const MEMORY_CHECK_INTERVAL = 30 * 1000;  // 30 seconds
const MEMORY_ALERT_THRESHOLD = 500 * 1024 * 1024;  // 500MB

setInterval(() => {
  const usage = process.memoryUsage();

  const metrics = {
    rss:              Math.round(usage.rss / 1024 / 1024),            // Resident Set Size (total)
    heapTotal:        Math.round(usage.heapTotal / 1024 / 1024),     // Heap allocated
    heapUsed:         Math.round(usage.heapUsed / 1024 / 1024),      // Heap in use
    external:         Math.round(usage.external / 1024 / 1024),      // C++ objects
    arrayBuffers:     Math.round(usage.arrayBuffers / 1024 / 1024),  // ArrayBuffers
    uptimeMinutes:    Math.round(process.uptime() / 60)
  };

  logger.info('memory_usage', metrics);

  if (usage.heapUsed > MEMORY_ALERT_THRESHOLD) {
    logger.warn('high_memory_usage', {
      heapUsedMB: metrics.heapUsed,
      threshold: MEMORY_ALERT_THRESHOLD / 1024 / 1024
    });
  }
}, MEMORY_CHECK_INTERVAL);

Alerting on Growth Rate

Absolute memory usage isn't as informative as the growth rate:

# Python: track memory with growth rate detection
import psutil
import time
from collections import deque

class MemoryLeakDetector:
    def __init__(self, window_minutes=60):
        self.window = deque(maxlen=window_minutes * 2)  # Every 30 seconds
        self.process = psutil.Process()

    def check(self):
        current_mb = self.process.memory_info().rss / 1024 / 1024
        self.window.append((time.time(), current_mb))

        if len(self.window) < 10:
            return  # Not enough data

        # Calculate growth rate over the window
        oldest_time, oldest_mb = self.window[0]
        newest_time, newest_mb = self.window[-1]
        elapsed_minutes = (newest_time - oldest_time) / 60

        if elapsed_minutes > 0:
            growth_rate_per_hour = (newest_mb - oldest_mb) / elapsed_minutes * 60

            if growth_rate_per_hour > 50:  # Growing > 50MB/hour
                alert(f'Potential memory leak: +{growth_rate_per_hour:.1f}MB/hour')

Container-Level Monitoring

# Kubernetes: set memory limits and track OOMKills
apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
        - name: api
          resources:
            requests:
              memory: "256Mi"
            limits:
              memory: "512Mi"  # Container killed if it exceeds this

# Check for OOMKill events
kubectl get events --field-selector reason=OOMKilling

# Check container restart count (repeated restarts = likely leak)
kubectl get pods -o json | jq '
.items[] |
{
  name: .metadata.name,
  restarts: .status.containerStatuses[0].restartCount,
  lastState: .status.containerStatuses[0].lastState
}' | jq 'select(.restarts > 5)'

Diagnosing Node.js Memory Leaks

Heap Snapshots

A heap snapshot captures all objects in memory and their references at a point in time. Compare two snapshots (before and after a suspected leak) to find what's growing:

// Take heap snapshot via HTTP endpoint (internal only)
const v8 = require('v8');
const fs = require('fs');

app.get('/internal/heap-snapshot', (req, res) => {
  // Require internal auth token
  if (req.headers['x-internal-token'] !== process.env.INTERNAL_TOKEN) {
    return res.status(403).end();
  }

  const filename = `/tmp/heap-${Date.now()}.heapsnapshot`;
  const snapshotStream = v8.writeHeapSnapshot(filename);

  res.json({ file: snapshotStream, size: fs.statSync(snapshotStream).size });
});

# From a running production pod
kubectl exec -it api-pod-abc123 -- node -e "
const v8 = require('v8');
const f = v8.writeHeapSnapshot('/tmp/heap1.heapsnapshot');
console.log('Written to:', f);
"

# Download the snapshot
kubectl cp api-pod-abc123:/tmp/heap1.heapsnapshot ./heap1.heapsnapshot

# Open in Chrome DevTools:
# DevTools → Memory → Load → select heapsnapshot file

Take a snapshot, wait for the leak to grow, take another. Use the "Comparison" view in Chrome DevTools to see what objects grew.

Common Node.js Leak Patterns

Unbounded in-memory caches:

// Leak: cache grows without limit
const cache = new Map();

function getUser(userId) {
  if (!cache.has(userId)) {
    cache.set(userId, db.query('SELECT * FROM users WHERE id = ?', [userId]));
  }
  return cache.get(userId);
}
// After 1 million unique users, cache holds 1 million objects

// Fix: use LRU cache with size limit
const LRU = require('lru-cache');
const cache = new LRU({
  max: 1000,         // Maximum 1000 items
  ttl: 1000 * 300,  // 5 minute TTL
});

Event listener accumulation:

// Leak: listener added on every request but never removed
app.get('/stream', (req, res) => {
  // BUG: new listener added for every incoming request
  emitter.on('data', (data) => {
    res.write(data);
  });

  // When request ends, listener is never cleaned up
  // After 1000 requests: 1000 listeners firing for every event
});

// Fix: clean up listeners when connection closes
app.get('/stream', (req, res) => {
  const handler = (data) => res.write(data);
  emitter.on('data', handler);

  // Critical: remove listener when client disconnects
  req.on('close', () => {
    emitter.off('data', handler);
  });
});

Closures capturing large objects:

// Leak: closure captures entire large object
function processLargeDataset(data) {
  const summary = { count: data.length, total: data.reduce((s, d) => s + d.value, 0) };

  // BUG: timer closure captures 'data' (potentially hundreds of MB)
  setTimeout(() => {
    console.log('Processing complete for dataset size:', data.length);
    // 'data' is kept alive until this timer fires
  }, 60000);

  return summary;
}

// Fix: only capture what you need
function processLargeDataset(data) {
  const summary = { count: data.length, total: data.reduce((s, d) => s + d.value, 0) };
  const dataLength = data.length;  // Capture only the primitive

  setTimeout(() => {
    console.log('Processing complete for dataset size:', dataLength);
    // 'data' is now eligible for GC
  }, 60000);

  return summary;
}

Diagnosing Python Memory Leaks

tracemalloc

Python's built-in memory tracing:

import tracemalloc
import linecache

def take_snapshot(label):
    """Print the top 10 memory allocations."""
    snapshot = tracemalloc.take_snapshot()
    stats = snapshot.statistics('lineno')

    print(f"\n=== Memory snapshot: {label} ===")
    for stat in stats[:10]:
        frame = stat.traceback[0]
        filename = frame.filename
        lineno = frame.lineno
        line = linecache.getline(filename, lineno).strip()
        print(f"{stat.size / 1024:.1f} KB: {filename}:{lineno}: {line}")

# Usage:
tracemalloc.start()
handle_requests_for_a_while()
take_snapshot('after_load_test')
tracemalloc.stop()

memory_profiler for Function-Level Analysis

from memory_profiler import profile

@profile
def process_orders(order_ids: list[int]):
    results = []
    for order_id in order_ids:
        order = Order.objects.select_related('user', 'items').get(id=order_id)
        results.append(transform_order(order))
    return results

# Run with:
# python -m memory_profiler my_script.py

# Output:
# Line #  Mem usage   Increment   Line Contents
# ==============================================
#  5      45.2 MiB    45.2 MiB   def process_orders(order_ids):
#  6      45.2 MiB    0.0 MiB     results = []
#  7      45.2 MiB    0.0 MiB     for order_id in order_ids:
#  8     182.7 MiB  137.5 MiB       order = Order.objects...get(id=order_id)
# ← Memory grows here each iteration, not released

Django ORM Leak: Iterator vs All

# Leak: loads all 1M orders into memory at once
def export_orders():
    orders = Order.objects.filter(status='completed').all()
    for order in orders:  # All 1M orders already in RAM
        write_to_csv(order)

# Fix: use iterator() for memory-efficient streaming
def export_orders():
    orders = Order.objects.filter(status='completed').iterator(chunk_size=1000)
    for order in orders:  # Loads 1000 at a time, releases after processing
        write_to_csv(order)

Diagnosing Go Memory Leaks

// Go: expose pprof endpoint for production profiling
import (
    "net/http"
    _ "net/http/pprof"
    "runtime"
)

func main() {
    // Enable pprof on internal port
    go func() {
        http.ListenAndServe("127.0.0.1:6060", nil)
    }()

    // Enable GC debug info
    runtime.SetGCPercent(20)  // Trigger GC more aggressively
    // Default is 100 (GC when heap doubles)
    // Lower = more frequent GC, higher = fewer GC pauses

    startApplication()
}

# Capture heap profile
go tool pprof http://localhost:6060/debug/pprof/heap

(pprof) top20          # Show top allocators by memory
(pprof) list myFunc    # Annotated source: bytes allocated per line
(pprof) web            # Open flame graph in browser

# Compare two heap profiles to find what's growing
go tool pprof -base heap1.pb.gz heap2.pb.gz
(pprof) top20  # Shows allocations that grew between snapshots

Common Go leak: goroutine leak

// Leak: goroutine started, never exits
func handleRequest(conn net.Conn) {
    go func() {
        // This goroutine blocks forever if channel is never closed
        data := <-dataChannel
        process(data)
    }()
    // If dataChannel is never closed, goroutine accumulates
}

// Check goroutine count
http.Get("http://localhost:6060/debug/pprof/goroutine?debug=1")
// Growing goroutine count = goroutine leak

// Fix: use context cancellation
func handleRequest(ctx context.Context, conn net.Conn) {
    go func() {
        select {
        case data := <-dataChannel:
            process(data)
        case <-ctx.Done():
            return  // Goroutine exits when request context is cancelled
        }
    }()
}

Prevention Practices

Code Review Checklist

Review these patterns for memory issues:

□ Caches: does every cache have a size limit and/or TTL?
□ Event listeners: is every addEventListener paired with removeEventListener?
□ Database results: is the result set size bounded? (LIMIT, pagination)
□ Background goroutines/threads: do they have exit conditions?
□ Long-lived objects: do they hold references to large data?
□ Connection pools: are connections returned after use?
□ File handles: are files/sockets closed in finally/defer blocks?
□ Timers: are recurring timers cleared when components unmount?

Load Testing for Memory

Run load tests long enough to surface leaks:

# k6: sustained load test to surface memory leaks
k6 run --duration 1h --vus 50 \
    --out influxdb=http://influxdb:8086/k6 \
    load-test.js

# Monitor memory during test:
# kubectl top pods -w  # Watch memory over the hour
# Alert if any pod grows > 200MB above baseline

A 30-minute load test often reveals leaks that wouldn't appear in a 5-minute smoke test. If memory is still climbing at the end of 1 hour, you have a leak.

Memory leaks punish you quietly and then suddenly. The process that's been slowly bloating for three days crashes at 3 AM on a Friday. Building memory tracking, alerting, and regular heap analysis into your workflow is the difference between finding leaks in development and finding them during incidents.

Building something that needs to scale? We help teams architect systems that grow with their business. scopeforged.com