Service discovery enables services to find and communicate with each other dynamically. In distributed systems where services scale up and down, hard-coded addresses don't work. Here's how to implement service discovery effectively.
Why Service Discovery?
The Problem
In dynamic environments, services start and stop frequently. Manual configuration can't keep up with the pace of change in modern infrastructure. The contrast between traditional static configuration and service discovery illustrates why this capability is essential for modern architectures.
Without service discovery:
- Hard-coded IPs that change when services restart
- Manual configuration updates
- No automatic failover
- Can't scale dynamically
With service discovery:
- Services register themselves
- Clients discover services by name
- Automatic removal of unhealthy instances
- Dynamic scaling just works
Discovery Patterns
There are two main approaches to service discovery. Each has trade-offs in terms of complexity, performance, and client requirements. Understanding these patterns helps you choose the right approach for your architecture.
Client-side discovery:
Client → Registry → Get addresses → Client calls service directly
Server-side discovery:
Client → Load Balancer → Registry → Route to service
Client-side discovery gives you more control and eliminates the load balancer as a potential bottleneck, but requires smarter clients. Server-side discovery keeps clients simple but adds infrastructure complexity.
DNS-Based Discovery
Kubernetes DNS
Kubernetes provides built-in service discovery through DNS. When you create a Service, Kubernetes automatically creates DNS records that resolve to the service's ClusterIP. This approach is elegant because it requires no special client libraries or configuration.
The following Service definition creates DNS records that allow any pod in the cluster to reach your service by name.
# Service creates DNS records automatically
apiVersion: v1
kind: Service
metadata:
name: user-service
namespace: production
spec:
selector:
app: user-service
ports:
- port: 80
targetPort: 8080
# DNS records created:
# user-service.production.svc.cluster.local → ClusterIP
# user-service.production.svc → ClusterIP
# user-service → ClusterIP (within same namespace)
Notice that Kubernetes creates multiple DNS names with varying levels of specificity. Within the same namespace, you can use just the service name for simplicity.
Your application code doesn't need to know about service discovery at all. Just use the service name, and DNS handles the resolution. This simplicity is one of Kubernetes' greatest strengths.
# Application code - just use service name
import requests
def get_user(user_id: str):
# DNS resolves to service ClusterIP
response = requests.get(f"http://user-service/users/{user_id}")
return response.json()
Headless Services
For stateful workloads where you need to connect to specific pods, headless services return individual pod IPs instead of a single ClusterIP. You'll use this pattern for databases, caches, and other stateful systems where clients need to distinguish between instances.
# Headless service returns pod IPs directly
apiVersion: v1
kind: Service
metadata:
name: database
spec:
clusterIP: None # Headless
selector:
app: postgres
ports:
- port: 5432
# DNS returns individual pod IPs:
# database.default.svc.cluster.local → Pod IP 1, Pod IP 2, Pod IP 3
# Useful for stateful workloads where client needs to connect to specific pods
The key difference is clusterIP: None, which tells Kubernetes not to allocate a virtual IP. Instead, DNS queries return all pod IPs directly.
Database replication and leader election scenarios often use headless services to address specific replicas directly. This allows clients to route writes to the primary and reads to replicas.
AWS Cloud Map
AWS Cloud Map integrates service discovery with ECS and other AWS services. It creates DNS records in a private hosted zone that your services query. This Terraform configuration shows how to set up Cloud Map for an ECS service.
# Terraform - AWS Cloud Map
resource "aws_service_discovery_private_dns_namespace" "main" {
name = "internal.example.com"
vpc = aws_vpc.main.id
}
resource "aws_service_discovery_service" "user_service" {
name = "user-service"
dns_config {
namespace_id = aws_service_discovery_private_dns_namespace.main.id
dns_records {
ttl = 10
type = "A"
}
routing_policy = "MULTIVALUE" # Return all healthy instances
}
health_check_custom_config {
failure_threshold = 1
}
}
# ECS service registers automatically
resource "aws_ecs_service" "user_service" {
name = "user-service"
cluster = aws_ecs_cluster.main.id
task_definition = aws_ecs_task_definition.user_service.arn
service_registries {
registry_arn = aws_service_discovery_service.user_service.arn
}
}
# Applications use DNS:
# user-service.internal.example.com
The MULTIVALUE routing policy returns all healthy instances, enabling client-side load balancing. The short TTL ensures clients see new instances quickly. When ECS scales your service, Cloud Map automatically updates DNS records.
Consul Service Discovery
Service Registration
Consul uses agents running on each node to register services and perform health checks. This JSON configuration defines a service with an HTTP health check. You can place this file in Consul's config directory for automatic registration on agent startup.
// consul-service.json
{
"service": {
"name": "user-service",
"id": "user-service-1",
"port": 8080,
"tags": ["v1", "production"],
"meta": {
"version": "1.2.3",
"protocol": "http"
},
"check": {
"http": "http://localhost:8080/health",
"interval": "10s",
"timeout": "5s"
}
}
}
The tags and meta fields let you add arbitrary metadata that clients can use for routing decisions. For example, you might route traffic based on version tags during a canary deployment.
Services can also register programmatically using the Consul API or client libraries. This is useful when service details aren't known until runtime.
# Register via API
curl -X PUT -d @consul-service.json \
http://consul:8500/v1/agent/service/register
Service Discovery
Consul clients query the catalog or health endpoints to find available service instances. This Go example shows client-side service discovery with random load balancing. You'll typically wrap this in a client library that handles caching and connection pooling.
// Go client with Consul
import (
"github.com/hashicorp/consul/api"
)
type ServiceDiscovery struct {
client *api.Client
}
func (sd *ServiceDiscovery) GetHealthyInstances(serviceName string) ([]*api.ServiceEntry, error) {
entries, _, err := sd.client.Health().Service(serviceName, "", true, nil)
if err != nil {
return nil, err
}
return entries, nil
}
func (sd *ServiceDiscovery) GetServiceURL(serviceName string) (string, error) {
entries, err := sd.GetHealthyInstances(serviceName)
if err != nil || len(entries) == 0 {
return "", fmt.Errorf("no healthy instances of %s", serviceName)
}
// Simple round-robin (in practice, use proper load balancing)
entry := entries[rand.Intn(len(entries))]
return fmt.Sprintf("http://%s:%d", entry.Service.Address, entry.Service.Port), nil
}
// Usage
url, err := discovery.GetServiceURL("user-service")
if err != nil {
return err
}
resp, err := http.Get(url + "/users/123")
Notice the true parameter in Health().Service() which filters for only passing health checks. In production, cache the service list and refresh it periodically rather than querying Consul for every request.
Consul DNS Interface
Consul also exposes service discovery through DNS, which means any application that can do DNS lookups can discover services without Consul-specific client code. This approach works well when you can't modify application code or when using off-the-shelf software.
# Consul provides DNS interface
# Query for healthy instances
dig @consul user-service.service.consul
# Query for specific tag
dig @consul v1.user-service.service.consul
# In application - just use DNS
curl http://user-service.service.consul:8080/users/123
Consul Connect (Service Mesh)
For encrypted service-to-service communication, Consul Connect provides a service mesh with automatic mTLS. Sidecar proxies handle encryption transparently. This configuration defines a service that communicates with a database through the mesh.
# Sidecar proxy for mTLS and service mesh
service {
name = "user-service"
port = 8080
connect {
sidecar_service {
proxy {
upstreams {
destination_name = "database"
local_bind_port = 5432
}
}
}
}
}
# Application connects to localhost:5432
# Consul proxy handles routing and mTLS
Your application connects to localhost, and the sidecar proxy handles service discovery, load balancing, and encryption. This keeps your application code simple and security concerns out of your application logic.
etcd Service Discovery
Registration
etcd provides a distributed key-value store that can serve as a service registry. The lease mechanism ensures stale registrations are automatically removed. This pattern is commonly used in Kubernetes, which uses etcd as its backing store.
The following Go code demonstrates how to register a service with a lease that expires if not renewed.
// etcd-based service registration
import (
clientv3 "go.etcd.io/etcd/client/v3"
)
type EtcdRegistry struct {
client *clientv3.Client
lease clientv3.LeaseID
}
func (r *EtcdRegistry) Register(serviceName, instanceID, address string) error {
// Create lease for TTL
lease, err := r.client.Grant(context.Background(), 30) // 30 second TTL
if err != nil {
return err
}
r.lease = lease.ID
// Register service with lease
key := fmt.Sprintf("/services/%s/%s", serviceName, instanceID)
value := fmt.Sprintf(`{"address": "%s", "port": 8080}`, address)
_, err = r.client.Put(context.Background(), key, value, clientv3.WithLease(lease.ID))
if err != nil {
return err
}
// Keep lease alive
ch, err := r.client.KeepAlive(context.Background(), lease.ID)
if err != nil {
return err
}
go func() {
for range ch {
// Lease renewed
}
}()
return nil
}
func (r *EtcdRegistry) Deregister(serviceName, instanceID string) error {
key := fmt.Sprintf("/services/%s/%s", serviceName, instanceID)
_, err := r.client.Delete(context.Background(), key)
return err
}
The KeepAlive goroutine renews the lease periodically. If the service crashes, the lease expires and etcd automatically removes the registration. This self-healing behavior is crucial for reliable service discovery.
Discovery with Watch
etcd's watch feature enables real-time updates when services register or deregister. This eliminates polling and ensures clients have current service lists. You'll use this pattern when service changes need to propagate immediately.
func (r *EtcdRegistry) Watch(serviceName string, callback func([]ServiceInstance)) {
prefix := fmt.Sprintf("/services/%s/", serviceName)
// Initial fetch
resp, _ := r.client.Get(context.Background(), prefix, clientv3.WithPrefix())
instances := parseInstances(resp)
callback(instances)
// Watch for changes
watchCh := r.client.Watch(context.Background(), prefix, clientv3.WithPrefix())
go func() {
for watchResp := range watchCh {
for _, event := range watchResp.Events {
switch event.Type {
case clientv3.EventTypePut:
// Instance added or updated
case clientv3.EventTypeDelete:
// Instance removed
}
}
// Refresh full list
resp, _ := r.client.Get(context.Background(), prefix, clientv3.WithPrefix())
instances := parseInstances(resp)
callback(instances)
}
}()
}
The callback pattern lets you update a local cache whenever the service list changes. This approach provides the lowest latency for service discovery updates.
Health Checking
Active Health Checks
Health checks verify that services can actually handle requests. Return detailed status when healthy, and appropriate error codes when dependencies are down. This allows load balancers and service meshes to route around problems.
// Health check endpoint
func healthHandler(w http.ResponseWriter, r *http.Request) {
// Check dependencies
checks := map[string]bool{
"database": checkDatabase(),
"cache": checkRedis(),
"disk": checkDiskSpace(),
}
allHealthy := true
for _, healthy := range checks {
if !healthy {
allHealthy = false
break
}
}
if allHealthy {
w.WriteHeader(http.StatusOK)
json.NewEncoder(w).Encode(map[string]interface{}{
"status": "healthy",
"checks": checks,
})
} else {
w.WriteHeader(http.StatusServiceUnavailable)
json.NewEncoder(w).Encode(map[string]interface{}{
"status": "unhealthy",
"checks": checks,
})
}
}
Including individual check results in the response helps with debugging. Operations teams can quickly see which dependency is causing problems.
Kubernetes Probes
Kubernetes supports three types of probes, each serving a different purpose. Using them correctly ensures smooth deployments and reliable self-healing. Misconfiguring probes is a common source of deployment issues.
apiVersion: apps/v1
kind: Deployment
spec:
template:
spec:
containers:
- name: app
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 3
startupProbe:
httpGet:
path: /health/startup
port: 8080
failureThreshold: 30
periodSeconds: 10
The initialDelaySeconds gives your application time to start before probes begin. The failureThreshold determines how many consecutive failures trigger action.
Implement each probe to check what it's actually asking. Liveness checks if the process is alive. Readiness checks if it can handle traffic. Startup gives slow-starting apps time to initialize. Getting these right prevents unnecessary restarts while ensuring traffic only reaches healthy instances.
// Different health endpoints for different purposes
func livenessHandler(w http.ResponseWriter, r *http.Request) {
// Is the process alive?
w.WriteHeader(http.StatusOK)
}
func readinessHandler(w http.ResponseWriter, r *http.Request) {
// Can we handle traffic?
if dbConnected && cacheConnected && !shuttingDown {
w.WriteHeader(http.StatusOK)
} else {
w.WriteHeader(http.StatusServiceUnavailable)
}
}
func startupHandler(w http.ResponseWriter, r *http.Request) {
// Has initialization completed?
if initialized {
w.WriteHeader(http.StatusOK)
} else {
w.WriteHeader(http.StatusServiceUnavailable)
}
}
Keep liveness checks simple. A liveness probe that checks external dependencies can cause cascading failures when those dependencies have issues.
Client-Side Load Balancing
gRPC with Consul
gRPC supports client-side load balancing natively. Register a custom resolver that queries your service registry, and gRPC handles the rest. This gives you fine-grained control over load balancing without external infrastructure.
import (
"google.golang.org/grpc"
"google.golang.org/grpc/resolver"
_ "github.com/hashicorp/consul/api/grpc" // Register consul resolver
)
// Register consul resolver
resolver.Register(consulResolver)
// Connect with client-side load balancing
conn, err := grpc.Dial(
"consul://user-service",
grpc.WithDefaultServiceConfig(`{"loadBalancingPolicy":"round_robin"}`),
grpc.WithInsecure(),
)
// Client automatically discovers and load balances across instances
client := pb.NewUserServiceClient(conn)
The round-robin policy distributes requests evenly. Other policies like pick_first or custom weighted algorithms are also available. The connection automatically adapts as services scale up or down.
HTTP Client with Service Discovery
For HTTP clients, wrap your HTTP client to perform service discovery before each request. Cache the service list to avoid discovery latency on every call. This pattern works with any HTTP client library.
type DiscoveryHTTPClient struct {
discovery ServiceDiscovery
client *http.Client
}
func (c *DiscoveryHTTPClient) Get(serviceName, path string) (*http.Response, error) {
// Get healthy instances
instances, err := c.discovery.GetHealthyInstances(serviceName)
if err != nil || len(instances) == 0 {
return nil, fmt.Errorf("no instances available for %s", serviceName)
}
// Simple round-robin
instance := instances[rand.Intn(len(instances))]
url := fmt.Sprintf("http://%s:%d%s", instance.Address, instance.Port, path)
return c.client.Get(url)
}
// Usage
resp, err := client.Get("user-service", "/users/123")
In production, maintain a local cache of service instances and refresh it in the background. This avoids discovery overhead on the critical path while ensuring reasonably current service information.
Graceful Shutdown
When a service shuts down, it must deregister from service discovery before stopping. This prevents other services from sending requests to a dying instance. The shutdown sequence is critical for zero-downtime deployments.
func main() {
// Register service
registry := NewServiceRegistry()
registry.Register("my-service", "instance-1", "10.0.0.1:8080")
// Start server
server := &http.Server{Addr: ":8080"}
go server.ListenAndServe()
// Wait for shutdown signal
quit := make(chan os.Signal, 1)
signal.Notify(quit, syscall.SIGINT, syscall.SIGTERM)
<-quit
// Graceful shutdown
log.Println("Shutting down...")
// 1. Mark as not ready (stop receiving new traffic)
isReady.Store(false)
// 2. Deregister from service discovery
registry.Deregister("my-service", "instance-1")
// 3. Wait for in-flight requests
time.Sleep(5 * time.Second)
// 4. Shutdown server
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
server.Shutdown(ctx)
log.Println("Shutdown complete")
}
The sleep before server shutdown gives time for service discovery caches and load balancers to remove this instance from their pools. Without this delay, you'll see connection errors during deployments.
Conclusion
Service discovery is essential for dynamic distributed systems. DNS-based discovery (Kubernetes DNS, Cloud Map) is simple and works well for most cases. Consul provides advanced features like health checking, service mesh, and cross-datacenter discovery. etcd offers strong consistency for critical coordination. Implement health checks that accurately reflect service readiness, use client-side load balancing for better resilience, and ensure graceful shutdown to prevent dropped connections. The right choice depends on your infrastructure, but all approaches share the goal of decoupling service location from service identity.