Database Sharding Strategies

Database sharding distributes data across multiple database instances to handle scale that a single server cannot. When your dataset grows beyond what fits on one machine, or when query load exceeds what one server can handle, sharding provides horizontal scalability. But this power comes with significant complexity that affects every layer of your application.

The fundamental tradeoff is between scale and simplicity. A single database gives you ACID transactions, straightforward queries, and simple operations. Sharding gives you near-unlimited scalability but fragments that simplicity. Queries that span shards become complex. Transactions across shards require distributed coordination. Operations that were trivial become engineering challenges.

Understanding sharding strategies helps you make informed decisions about when sharding is necessary and how to implement it with minimum pain.

Shard Keys and Distribution

The shard key determines which shard holds each piece of data. Choosing the right key is the most critical decision in sharding. A good key distributes data evenly, keeps related data together, and supports your query patterns.

Hash-based sharding applies a hash function to the shard key and uses the result to select a shard. This provides even distribution regardless of key values. If you shard by user ID using consistent hashing, users spread evenly across shards even if IDs are sequential.

The following router demonstrates the basic hash-based sharding pattern. You'll use this class to direct queries to the appropriate database connection based on the shard key.

class ShardRouter
{
    private array $shards;
    private int $shardCount;

    public function getShard(string $shardKey): DatabaseConnection
    {
        $hash = crc32($shardKey);
        $shardIndex = $hash % $this->shardCount;

        return $this->shards[$shardIndex];
    }

    public function getShardForUser(int $userId): DatabaseConnection
    {
        return $this->getShard("user:{$userId}");
    }
}

Range-based sharding assigns contiguous key ranges to each shard. Users 1-1000000 go to shard 1, users 1000001-2000000 to shard 2, and so on. This keeps sequential IDs together, which can be useful for range queries, but risks hotspots if recent data is accessed more frequently.

Directory-based sharding uses a lookup table to map keys to shards. This provides maximum flexibility but adds a lookup step to every query. The directory itself can become a bottleneck or single point of failure.

Choosing a Shard Key

The ideal shard key has high cardinality, even distribution, and matches your access patterns. User ID works well for user-centric applications where most queries are scoped to a single user. Tenant ID works for multi-tenant SaaS where each tenant's data is independent.

Poor shard keys create hotspots. Sharding by timestamp concentrates all recent activity on one shard. Sharding by country puts most traffic on a few shards while others sit idle. Sharding by a low-cardinality field like status creates severe imbalance.

This repository example shows how shard key selection affects query efficiency. When you shard by user_id, user-scoped queries hit a single shard, while date-based queries must fan out to all shards.

// Good: High cardinality, even distribution
class OrderRepository
{
    public function save(Order $order): void
    {
        // Shard by user_id - queries are usually user-scoped
        $shard = $this->router->getShardForUser($order->userId);
        $shard->insert('orders', $order->toArray());
    }

    public function findByUser(int $userId): array
    {
        // Single shard query - efficient
        $shard = $this->router->getShardForUser($userId);
        return $shard->select('orders', ['user_id' => $userId]);
    }
}

// Problematic: Requires scatter-gather for common queries
class OrderRepository
{
    public function findByDateRange(DateTime $start, DateTime $end): array
    {
        // Must query ALL shards and merge results
        $results = [];
        foreach ($this->router->getAllShards() as $shard) {
            $results = array_merge(
                $results,
                $shard->select('orders', ['created_at' => ['between', $start, $end]])
            );
        }
        return $results;
    }
}

The contrast between these two methods highlights the importance of aligning your shard key with your most common access patterns.

Cross-Shard Queries

Some queries inevitably span multiple shards. A report aggregating all orders, a search across all users, or any query not filtered by the shard key requires scatter-gather: query all shards, collect results, merge them in the application.

Scatter-gather queries are expensive. They scale linearly with shard count; more shards mean more parallel queries to execute and more results to merge. They're also less predictable; one slow shard delays the entire query.

Design your schema to minimize cross-shard queries. Denormalize data so common queries can be satisfied from a single shard. Use secondary indexes in a search system for queries that don't align with your shard key. Accept that some queries will be slow and plan accordingly.

When you must perform cross-shard aggregations, execute shard queries in parallel and merge results in your application layer. This pattern shows how to aggregate statistics across all shards efficiently.

class CrossShardQuery
{
    public function aggregateOrderStats(): array
    {
        $shards = $this->router->getAllShards();

        // Execute in parallel for better performance
        $promises = [];
        foreach ($shards as $shard) {
            $promises[] = $this->async->execute(function () use ($shard) {
                return $shard->query("
                    SELECT
                        COUNT(*) as order_count,
                        SUM(total) as total_revenue,
                        AVG(total) as average_order
                    FROM orders
                    WHERE created_at > ?
                ", [now()->subDays(30)]);
            });
        }

        // Wait for all shards and merge
        $results = $this->async->all($promises);

        return [
            'order_count' => array_sum(array_column($results, 'order_count')),
            'total_revenue' => array_sum(array_column($results, 'total_revenue')),
            'average_order' => array_sum(array_column($results, 'average_order')) / count($results),
        ];
    }
}

Notice that aggregations like SUM and COUNT can be combined across shards, but AVG requires special handling since you need to weight by count, not just average the averages.

Transactions Across Shards

Single-database transactions don't work across shards. When data spans shards, you need distributed transaction protocols or alternative consistency approaches.

Two-phase commit (2PC) coordinates transactions across shards but has significant drawbacks: blocking behavior, coordinator failure vulnerability, and performance overhead. It's rarely used for application-level sharding.

Saga patterns break distributed operations into a sequence of local transactions with compensating actions for rollback. If transferring money between users on different shards, debit from one shard, credit to another, and have compensation logic to reverse if either fails.

The following saga implementation shows how to handle a cross-shard money transfer with proper compensation for failures. Each step is a local transaction, and failures trigger compensating actions to maintain consistency.

class TransferSaga
{
    public function transfer(int $fromUserId, int $toUserId, Money $amount): void
    {
        $fromShard = $this->router->getShardForUser($fromUserId);
        $toShard = $this->router->getShardForUser($toUserId);

        // Step 1: Debit source account
        $fromShard->beginTransaction();
        try {
            $fromShard->decrement('accounts', 'balance', $amount->cents(), [
                'user_id' => $fromUserId,
            ]);
            $debitId = $fromShard->insertGetId('transactions', [
                'user_id' => $fromUserId,
                'amount' => -$amount->cents(),
                'status' => 'pending',
            ]);
            $fromShard->commit();
        } catch (Exception $e) {
            $fromShard->rollback();
            throw $e;
        }

        // Step 2: Credit destination account
        $toShard->beginTransaction();
        try {
            $toShard->increment('accounts', 'balance', $amount->cents(), [
                'user_id' => $toUserId,
            ]);
            $toShard->insert('transactions', [
                'user_id' => $toUserId,
                'amount' => $amount->cents(),
                'status' => 'complete',
            ]);
            $toShard->commit();

            // Mark source transaction complete
            $fromShard->update('transactions', ['status' => 'complete'], ['id' => $debitId]);

        } catch (Exception $e) {
            $toShard->rollback();

            // Compensate: reverse the debit
            $this->compensateDebit($fromShard, $fromUserId, $amount, $debitId);

            throw $e;
        }
    }
}

This pattern trades strong consistency for eventual consistency, but provides a clear failure recovery path. You'll want to add idempotency keys and retry logic for production use.

Resharding

Eventually, you'll need to add shards or rebalance data. Resharding is one of sharding's most challenging operations. Data must move between shards while the system continues serving requests.

Online resharding migrates data incrementally while the system remains available. New writes go to both old and new locations during migration. Reads check both until migration completes. This requires careful coordination and application awareness of the migration state.

Consistent hashing simplifies resharding by minimizing data movement. When adding a shard, only data that maps to the new shard moves. Virtual nodes provide better distribution without moving data unnecessarily when physical nodes change.

When Not to Shard

Sharding's complexity is substantial. Before sharding, exhaust simpler scaling options. Vertical scaling, read replicas, caching, and query optimization often provide sufficient scale with far less complexity.

If your data fits on one machine and query load can be handled by read replicas, you probably don't need sharding. If you can archive old data to reduce active dataset size, sharding might be unnecessary. If caching can reduce database load sufficiently, that's simpler than sharding.

Consider sharding when you've truly exhausted other options, when you need write scalability (not just read), or when data volume exceeds single-machine capacity. Even then, consider managed sharding solutions that handle the operational complexity.

Conclusion

Database sharding enables horizontal scalability for datasets and workloads that exceed single-server capacity. The shard key determines how data distributes and must align with your access patterns. Cross-shard queries and transactions require careful handling that adds application complexity.

Sharding is a significant architectural commitment. It affects how you design schemas, write queries, handle transactions, and operate your infrastructure. The complexity cost is high enough that sharding should be a last resort after simpler approaches are exhausted. But when you genuinely need it, sharding provides a path to virtually unlimited scale.