Resilience Patterns Guide | Building Fault-Tolerant Microservices

Microservices introduce distributed system complexity where failures are inevitable. Resilience patterns help your system survive partial failures gracefully. Here's how to implement patterns that keep your services running when things go wrong.

Why Resilience Matters

Failure Modes

In distributed systems, partial failures are the norm rather than the exception. Without proper resilience patterns, a single slow or failing service can cascade through your entire system, bringing everything down. Understanding common failure modes helps you design appropriate defenses.

The following diagram illustrates how failures propagate differently with and without resilience patterns. You can see how a single slow service can block all dependent operations, versus how timeouts and fallbacks keep the system functional.

Distributed system failures:
├── Network partitions
├── Service crashes
├── Slow responses (worse than failures)
├── Resource exhaustion
├── Cascading failures
└── Partial degradation

Without resilience:
Order Service → User Service (slow)
              → Payment Service (waiting)
              → Inventory Service (waiting)
              → All services stuck, system down

With resilience:
Order Service → User Service (timeout, fallback)
              → Payment Service (continues)
              → Inventory Service (continues)
              → Degraded but functional

Slow responses are often worse than failures because they tie up resources waiting. A fast failure lets you recover quickly, while a slow response keeps threads and connections occupied indefinitely.

Circuit Breaker

Implementation

The circuit breaker pattern prevents cascading failures by failing fast when a downstream service is unhealthy. Like an electrical circuit breaker, it opens when too many failures occur, allowing the failing service time to recover while protecting your system from resource exhaustion.

This implementation tracks failures and successes, transitioning between three states: Closed (normal operation), Open (failing fast), and HalfOpen (testing if the service has recovered). You can customize the thresholds and timeout based on your service's characteristics.

class CircuitBreaker
{
    private string $name;
    private int $failureThreshold;
    private int $successThreshold;
    private int $timeout;

    private CircuitState $state = CircuitState::Closed;
    private int $failureCount = 0;
    private int $successCount = 0;
    private ?DateTimeImmutable $lastFailureTime = null;

    public function call(callable $operation, callable $fallback = null): mixed
    {
        if ($this->state === CircuitState::Open) {
            if ($this->shouldAttemptReset()) {
                $this->state = CircuitState::HalfOpen;
            } else {
                return $this->handleOpen($fallback);
            }
        }

        try {
            $result = $operation();
            $this->recordSuccess();
            return $result;
        } catch (Throwable $e) {
            $this->recordFailure();
            return $this->handleFailure($e, $fallback);
        }
    }

    private function recordSuccess(): void
    {
        $this->failureCount = 0;

        if ($this->state === CircuitState::HalfOpen) {
            $this->successCount++;

            if ($this->successCount >= $this->successThreshold) {
                $this->state = CircuitState::Closed;
                $this->successCount = 0;
            }
        }
    }

    private function recordFailure(): void
    {
        $this->failureCount++;
        $this->lastFailureTime = new DateTimeImmutable();
        $this->successCount = 0;

        if ($this->failureCount >= $this->failureThreshold) {
            $this->state = CircuitState::Open;
        }
    }

    private function shouldAttemptReset(): bool
    {
        $timeSinceFailure = time() - $this->lastFailureTime->getTimestamp();
        return $timeSinceFailure >= $this->timeout;
    }

    private function handleOpen(callable $fallback = null): mixed
    {
        if ($fallback) {
            return $fallback();
        }
        throw new CircuitOpenException("Circuit breaker {$this->name} is open");
    }
}

enum CircuitState
{
    case Closed;   // Normal operation
    case Open;     // Failing fast
    case HalfOpen; // Testing recovery
}

The three states manage the lifecycle of failure handling. Closed means normal operation, Open means failing fast without calling the downstream service, and HalfOpen allows a limited number of test requests to check if the service has recovered.

Usage

Integrating a circuit breaker into a service call requires defining failure and success thresholds, a timeout period, and an optional fallback behavior. This example shows how a payment service might use a circuit breaker to protect against payment gateway outages.

class PaymentService
{
    private CircuitBreaker $circuitBreaker;

    public function __construct()
    {
        $this->circuitBreaker = new CircuitBreaker(
            name: 'payment-gateway',
            failureThreshold: 5,
            successThreshold: 3,
            timeout: 30
        );
    }

    public function processPayment(Order $order): PaymentResult
    {
        return $this->circuitBreaker->call(
            operation: fn() => $this->gateway->charge($order->total),
            fallback: fn() => $this->queueForRetry($order)
        );
    }

    private function queueForRetry(Order $order): PaymentResult
    {
        Queue::push(new ProcessPaymentJob($order));

        return new PaymentResult(
            status: PaymentStatus::Pending,
            message: 'Payment queued for processing'
        );
    }
}

The fallback function provides graceful degradation. Instead of failing the entire order, this implementation queues the payment for later processing and informs the user that their payment is pending.

Retry Pattern

Exponential Backoff

Retries help recover from transient failures, but naive retry strategies can overwhelm an already struggling service. Exponential backoff with jitter spreads out retry attempts, giving the downstream service time to recover while avoiding retry storms.

This retry handler increases the delay between attempts exponentially and adds random jitter to prevent synchronized retries from multiple clients. You can configure which exception types should trigger retries versus immediate failure.

class RetryHandler
{
    public function execute(
        callable $operation,
        int $maxRetries = 3,
        int $baseDelayMs = 100,
        float $multiplier = 2.0,
        array $retryableExceptions = [TransientException::class]
    ): mixed {
        $attempt = 0;
        $lastException = null;

        while ($attempt < $maxRetries) {
            try {
                return $operation();
            } catch (Throwable $e) {
                if (!$this->isRetryable($e, $retryableExceptions)) {
                    throw $e;
                }

                $lastException = $e;
                $attempt++;

                if ($attempt < $maxRetries) {
                    $delay = $this->calculateDelay($attempt, $baseDelayMs, $multiplier);
                    usleep($delay * 1000);
                }
            }
        }

        throw new MaxRetriesExceededException(
            "Max retries ({$maxRetries}) exceeded",
            previous: $lastException
        );
    }

    private function calculateDelay(int $attempt, int $baseDelayMs, float $multiplier): int
    {
        // Exponential backoff with jitter
        $exponentialDelay = $baseDelayMs * pow($multiplier, $attempt - 1);
        $jitter = random_int(0, (int)($exponentialDelay * 0.1));

        return (int)$exponentialDelay + $jitter;
    }

    private function isRetryable(Throwable $e, array $retryableExceptions): bool
    {
        foreach ($retryableExceptions as $retryable) {
            if ($e instanceof $retryable) {
                return true;
            }
        }
        return false;
    }
}

// Usage
$result = $retry->execute(
    operation: fn() => $this->httpClient->get('/api/data'),
    maxRetries: 3,
    retryableExceptions: [
        ConnectionException::class,
        TimeoutException::class,
        RateLimitException::class,
    ]
);

The jitter prevents synchronized retries from multiple clients that failed at the same time. Without jitter, all failed requests would retry at exactly the same intervals, creating repeated load spikes.

Idempotency for Safe Retries

Retries are only safe if the operation is idempotent, meaning it can be executed multiple times with the same result. For non-idempotent operations like payments, you need to track whether an operation has already been processed using an idempotency key.

This pattern uses a cache-based lock to ensure that concurrent requests with the same idempotency key don't result in duplicate processing. The double-check after acquiring the lock handles race conditions.

class IdempotentPaymentProcessor
{
    public function process(string $idempotencyKey, PaymentRequest $request): PaymentResult
    {
        // Check if already processed
        $existing = $this->getExistingResult($idempotencyKey);
        if ($existing) {
            return $existing;
        }

        // Acquire lock to prevent concurrent processing
        $lock = Cache::lock("payment:{$idempotencyKey}", 30);

        if (!$lock->get()) {
            // Another process is handling this request
            return $this->waitForResult($idempotencyKey);
        }

        try {
            // Double-check after acquiring lock
            $existing = $this->getExistingResult($idempotencyKey);
            if ($existing) {
                return $existing;
            }

            $result = $this->processPayment($request);
            $this->storeResult($idempotencyKey, $result);

            return $result;
        } finally {
            $lock->release();
        }
    }

    private function storeResult(string $key, PaymentResult $result): void
    {
        Cache::put("payment_result:{$key}", $result, now()->addHours(24));
    }
}

The double-check pattern after acquiring the lock handles race conditions where another process completed the operation while we were waiting for the lock. Store idempotency keys with results long enough for all possible retries to complete.

Timeout Pattern

Configurable Timeouts

Every external call should have a timeout to prevent indefinite waiting. Different services have different latency profiles, so configure timeouts specifically for each dependency rather than using a global default.

This service client maintains a mapping of service-specific timeout configurations. You can tune connection and read timeouts independently based on each service's expected behavior.

class HttpClientWithTimeout
{
    public function request(string $method, string $url, array $options = []): Response
    {
        $options = array_merge([
            'connect_timeout' => 5,   // Connection timeout
            'timeout' => 30,          // Read timeout
            'read_timeout' => 30,     // Read timeout (alias)
        ], $options);

        return $this->client->request($method, $url, $options);
    }
}

// Service-specific timeouts
class ServiceClient
{
    private array $timeouts = [
        'user-service' => ['connect' => 2, 'read' => 5],
        'payment-service' => ['connect' => 5, 'read' => 30],
        'inventory-service' => ['connect' => 2, 'read' => 10],
    ];

    public function call(string $service, string $endpoint): Response
    {
        $timeout = $this->timeouts[$service] ?? ['connect' => 5, 'read' => 15];

        return $this->client->request('GET', "{$service}{$endpoint}", [
            'connect_timeout' => $timeout['connect'],
            'timeout' => $timeout['read'],
        ]);
    }
}

Connection timeouts should be shorter than read timeouts since they only need to establish the initial connection. Read timeouts depend on expected response time for the specific operation.

Deadline Propagation

When a request passes through multiple services, each service should know how much time remains to complete the overall request. Deadline propagation passes this information through the call chain, allowing services to fail fast when the deadline has already passed.

This implementation tracks the absolute deadline and provides methods to check remaining time. You can pass the context through service calls to ensure downstream services don't waste resources on expired requests.

class RequestContext
{
    private float $deadline;

    public function __construct(float $timeout)
    {
        $this->deadline = microtime(true) + $timeout;
    }

    public function remainingTime(): float
    {
        return max(0, $this->deadline - microtime(true));
    }

    public function hasExpired(): bool
    {
        return $this->remainingTime() <= 0;
    }
}

// Propagate deadline through service calls
class OrderService
{
    public function createOrder(array $data, RequestContext $context): Order
    {
        if ($context->hasExpired()) {
            throw new DeadlineExceededException();
        }

        // Call downstream services with remaining time
        $user = $this->userService->get($data['user_id'], $context);

        if ($context->hasExpired()) {
            throw new DeadlineExceededException();
        }

        $inventory = $this->inventoryService->reserve($data['items'], $context);

        // Continue with remaining operations
    }
}

Check the deadline before each potentially slow operation to avoid wasting resources on a request that the caller has already abandoned. This is especially important for expensive operations like database writes.

Bulkhead Pattern

Thread Pool Isolation

The bulkhead pattern isolates failures by limiting the resources available to any single component. If one downstream service causes all its allocated threads to block, other services can continue operating normally using their separate resource pools.

This bulkhead manager uses semaphores to limit concurrent access to each named resource. When the bulkhead is full, new requests are rejected immediately rather than waiting indefinitely.

class BulkheadManager
{
    private array $semaphores = [];

    public function execute(string $name, callable $operation, int $maxConcurrent = 10): mixed
    {
        $semaphore = $this->getSemaphore($name, $maxConcurrent);

        if (!$semaphore->acquire(timeout: 1000)) {
            throw new BulkheadRejectedException("Bulkhead {$name} is full");
        }

        try {
            return $operation();
        } finally {
            $semaphore->release();
        }
    }

    private function getSemaphore(string $name, int $maxConcurrent): Semaphore
    {
        if (!isset($this->semaphores[$name])) {
            $this->semaphores[$name] = new Semaphore($maxConcurrent);
        }
        return $this->semaphores[$name];
    }
}

// Usage: Isolate external service calls
class ExternalApiClient
{
    public function fetch(string $endpoint): array
    {
        return $this->bulkhead->execute(
            name: 'external-api',
            operation: fn() => $this->doFetch($endpoint),
            maxConcurrent: 20
        );
    }
}

Size bulkheads based on the downstream service's capacity and your service's tolerance for queued requests. Too small means rejecting requests unnecessarily; too large reduces the isolation benefit.

Queue-Based Bulkhead

A queue-based bulkhead provides backpressure by accepting requests into a queue when all workers are busy, then rejecting requests once the queue is full. This smooths out load spikes while still protecting against overload.

This implementation adds a bounded queue in front of the processing workers. Requests wait in the queue during transient overload but are rejected when sustained overload fills the queue.

class QueueBulkhead
{
    private int $maxQueueSize;
    private int $processingCapacity;
    private SplQueue $queue;
    private int $currentlyProcessing = 0;

    public function submit(callable $task): mixed
    {
        if ($this->queue->count() >= $this->maxQueueSize) {
            throw new BulkheadRejectedException('Queue is full');
        }

        $promise = new Deferred();
        $this->queue->enqueue(['task' => $task, 'promise' => $promise]);
        $this->processQueue();

        return $promise->promise();
    }

    private function processQueue(): void
    {
        while ($this->currentlyProcessing < $this->processingCapacity
               && !$this->queue->isEmpty()) {
            $item = $this->queue->dequeue();
            $this->currentlyProcessing++;

            async(function () use ($item) {
                try {
                    $result = ($item['task'])();
                    $item['promise']->resolve($result);
                } catch (Throwable $e) {
                    $item['promise']->reject($e);
                } finally {
                    $this->currentlyProcessing--;
                    $this->processQueue();
                }
            });
        }
    }
}

The queue size determines how much latency variation you're willing to accept. Larger queues can absorb bigger spikes but increase maximum response time under load.

Fallback Pattern

Graceful Degradation

When a service fails, you often have multiple fallback strategies of decreasing quality but increasing reliability. This example shows a cascading fallback from primary API to cache to database to a placeholder response.

Each fallback level provides less complete data but higher availability. The flags on the returned object indicate data quality so the UI can communicate this to users.

class ProductService
{
    public function getDetails(string $productId): ProductDetails
    {
        try {
            // Try primary service
            return $this->productApi->getDetails($productId);
        } catch (ServiceUnavailableException $e) {
            // Fallback 1: Try cache
            $cached = Cache::get("product:{$productId}");
            if ($cached) {
                return $cached->withStaleWarning();
            }

            // Fallback 2: Try database
            $product = Product::find($productId);
            if ($product) {
                return ProductDetails::fromModel($product)->withLimitedData();
            }

            // Fallback 3: Return placeholder
            return ProductDetails::placeholder($productId);
        }
    }
}

class ProductDetails
{
    public bool $isStale = false;
    public bool $isLimited = false;
    public bool $isPlaceholder = false;

    public function withStaleWarning(): self
    {
        $clone = clone $this;
        $clone->isStale = true;
        return $clone;
    }

    public function withLimitedData(): self
    {
        $clone = clone $this;
        $clone->isLimited = true;
        return $clone;
    }

    public static function placeholder(string $id): self
    {
        $details = new self();
        $details->id = $id;
        $details->name = 'Product information unavailable';
        $details->isPlaceholder = true;
        return $details;
    }
}

The flags indicating stale, limited, or placeholder data let the UI communicate data quality to users. A slightly stale price is usually better than showing nothing, but users should know the information might not be current.

Feature Toggle Fallbacks

For complex features that depend on external services, you can check service health proactively and route to fallback implementations before users experience failures. This pattern combines health checks with feature toggles.

By checking service health before attempting the primary implementation, you can avoid the latency penalty of waiting for a timeout. The fallback provides immediate response, even if it's less sophisticated.

class FeatureService
{
    public function getRecommendations(User $user): array
    {
        // If ML service is down, use rule-based fallback
        if (!$this->healthCheck->isHealthy('ml-service')) {
            return $this->getRuleBasedRecommendations($user);
        }

        try {
            return $this->mlService->getRecommendations($user);
        } catch (ServiceException $e) {
            Log::warning('ML service failed, using fallback', [
                'user_id' => $user->id,
                'error' => $e->getMessage(),
            ]);

            return $this->getRuleBasedRecommendations($user);
        }
    }

    private function getRuleBasedRecommendations(User $user): array
    {
        // Simple rule-based recommendations
        return Product::query()
            ->where('category', $user->preferredCategory)
            ->orderBy('popularity', 'desc')
            ->limit(10)
            ->get()
            ->toArray();
    }
}

Proactive health checks avoid the latency penalty of waiting for a timeout when you already know a service is unhealthy. The fallback recommendation algorithm may be less sophisticated, but it responds instantly.

Rate Limiting

Token Bucket

Rate limiting protects both your service and downstream dependencies from being overwhelmed. The token bucket algorithm provides smooth rate limiting with the ability to handle short bursts, making it more user-friendly than strict rate limits.

This implementation replenishes tokens over time up to a maximum capacity. You can acquire multiple tokens for expensive operations, and the burst capacity allows handling short traffic spikes.

class RateLimiter
{
    private int $capacity;
    private int $refillRate; // tokens per second
    private float $tokens;
    private float $lastRefill;

    public function __construct(int $capacity, int $refillRate)
    {
        $this->capacity = $capacity;
        $this->refillRate = $refillRate;
        $this->tokens = $capacity;
        $this->lastRefill = microtime(true);
    }

    public function tryAcquire(int $tokens = 1): bool
    {
        $this->refill();

        if ($this->tokens >= $tokens) {
            $this->tokens -= $tokens;
            return true;
        }

        return false;
    }

    private function refill(): void
    {
        $now = microtime(true);
        $elapsed = $now - $this->lastRefill;
        $tokensToAdd = $elapsed * $this->refillRate;

        $this->tokens = min($this->capacity, $this->tokens + $tokensToAdd);
        $this->lastRefill = $now;
    }
}

// Usage with circuit breaker
class ResilientClient
{
    public function call(string $service, callable $operation): mixed
    {
        // Check rate limit first
        if (!$this->rateLimiter->tryAcquire()) {
            throw new RateLimitExceededException();
        }

        // Then circuit breaker
        return $this->circuitBreaker->call($operation);
    }
}

The capacity determines how much burst traffic is allowed, while the refill rate controls sustained throughput. Setting capacity equal to refill rate gives you one second of burst capacity.

Health Checks

Dependency Health

A comprehensive health check system monitors not just whether your service is running, but whether all its dependencies are accessible. This information drives circuit breakers, load balancer decisions, and operational alerting.

This implementation registers named health checks and runs them on demand, capturing both success status and latency. The results provide a complete picture of service health for operational dashboards and automated systems.

class HealthCheckService
{
    private array $checks = [];
    private array $results = [];

    public function register(string $name, callable $check): void
    {
        $this->checks[$name] = $check;
    }

    public function check(): HealthStatus
    {
        $results = [];
        $healthy = true;

        foreach ($this->checks as $name => $check) {
            try {
                $start = microtime(true);
                $check();
                $duration = microtime(true) - $start;

                $results[$name] = [
                    'status' => 'healthy',
                    'latency_ms' => round($duration * 1000, 2),
                ];
            } catch (Throwable $e) {
                $healthy = false;
                $results[$name] = [
                    'status' => 'unhealthy',
                    'error' => $e->getMessage(),
                ];
            }
        }

        return new HealthStatus($healthy, $results);
    }
}

// Register checks
$health->register('database', fn() => DB::select('SELECT 1'));
$health->register('redis', fn() => Redis::ping());
$health->register('payment-api', fn() => Http::get('https://api.stripe.com/health'));

Include latency measurements in health check results to catch performance degradation before it causes failures. A dependency responding in 5 seconds instead of 50ms is technically healthy but effectively broken for your application.

Conclusion

Resilience patterns protect your microservices from cascade failures. Use circuit breakers to fail fast when dependencies are unhealthy. Implement retries with exponential backoff and idempotency for safe recovery. Set appropriate timeouts and propagate deadlines through the call chain. Isolate failures with bulkheads to prevent resource exhaustion. Provide fallbacks for graceful degradation. These patterns together create systems that survive partial failures and maintain availability when things go wrong.

Resilience Patterns for Distributed Systems