Infrastructure testing validates that your infrastructure code works correctly before deployment. As infrastructure-as-code becomes standard practice, treating infrastructure with the same rigor as application code; including testing; becomes essential. Untested infrastructure changes have caused countless outages, from misconfigured security groups to broken networking to resource limits that only manifest at scale.
Testing infrastructure differs from testing application code. You're testing declarations of desired state, not algorithms. The "execution environment" is a cloud provider API, not a local runtime. Tests are slower and potentially more expensive. These differences require adapted testing strategies.
Unit Testing Infrastructure Code
Unit tests validate individual infrastructure components in isolation. For Terraform, this means testing modules. For CloudFormation, testing nested stacks. For Pulumi or CDK, testing constructs or components.
Unit tests verify that given certain inputs, your infrastructure code produces expected outputs. They don't actually create resources; they validate the configuration that would be created.
# Testing Terraform with pytest and terraform-py
import pytest
from terraform import Terraform
class TestVpcModule:
@pytest.fixture
def tf(self, tmp_path):
# Set up Terraform with test configuration
tf = Terraform(working_dir=str(tmp_path))
tf.init()
return tf
def test_vpc_creates_expected_subnets(self, tf, tmp_path):
# Write test configuration
config = """
module "vpc" {
source = "../modules/vpc"
cidr_block = "10.0.0.0/16"
availability_zones = ["us-east-1a", "us-east-1b"]
environment = "test"
}
"""
(tmp_path / "main.tf").write_text(config)
# Plan and validate
plan = tf.plan(output=True)
# Assert expected resources
assert "aws_vpc.main" in plan
assert "aws_subnet.public" in plan
assert plan["aws_vpc.main"]["cidr_block"] == "10.0.0.0/16"
def test_vpc_tags_include_environment(self, tf, tmp_path):
config = """
module "vpc" {
source = "../modules/vpc"
environment = "production"
}
"""
(tmp_path / "main.tf").write_text(config)
plan = tf.plan(output=True)
tags = plan["aws_vpc.main"]["tags"]
assert tags["Environment"] == "production"
Policy-as-code tools like Open Policy Agent (OPA), Checkov, or Sentinel validate infrastructure against organizational policies. These catch compliance violations before deployment: ensuring encryption is enabled, required tags are present, or prohibited resource types aren't used.
# OPA policy for Terraform plans
package terraform
# Deny EC2 instances without encryption
deny[msg] {
resource := input.resource_changes[_]
resource.type == "aws_instance"
not resource.change.after.root_block_device[_].encrypted
msg := sprintf("EC2 instance %s must have encrypted root volume", [resource.address])
}
# Require specific tags on all resources
required_tags := {"Environment", "Team", "CostCenter"}
deny[msg] {
resource := input.resource_changes[_]
tags := resource.change.after.tags
missing := required_tags - {tag | tags[tag]}
count(missing) > 0
msg := sprintf("Resource %s missing required tags: %v", [resource.address, missing])
}
Integration Testing
Integration tests create real infrastructure in an isolated environment and validate it works correctly. This catches issues that unit tests miss: IAM permission problems, networking configuration errors, service interactions, and cloud provider behavior differences.
The cost is real resources. Integration tests are slower and incur cloud charges. Balance coverage against cost by testing critical paths thoroughly while using lighter validation for simpler components.
// Terratest for integration testing
package test
import (
"testing"
"time"
"github.com/gruntwork-io/terratest/modules/terraform"
"github.com/gruntwork-io/terratest/modules/http-helper"
"github.com/stretchr/testify/assert"
)
func TestVpcModule(t *testing.T) {
t.Parallel()
terraformOptions := &terraform.Options{
TerraformDir: "../modules/vpc",
Vars: map[string]interface{}{
"environment": "test",
"cidr_block": "10.99.0.0/16",
},
}
// Clean up resources when test completes
defer terraform.Destroy(t, terraformOptions)
// Create the infrastructure
terraform.InitAndApply(t, terraformOptions)
// Validate outputs
vpcId := terraform.Output(t, terraformOptions, "vpc_id")
assert.NotEmpty(t, vpcId)
publicSubnetIds := terraform.OutputList(t, terraformOptions, "public_subnet_ids")
assert.Len(t, publicSubnetIds, 2)
// Validate actual AWS resources
vpc := aws.GetVpc(t, vpcId, "us-east-1")
assert.Equal(t, "10.99.0.0/16", vpc.CidrBlock)
assert.Equal(t, "available", vpc.State)
}
func TestWebApplicationStack(t *testing.T) {
t.Parallel()
terraformOptions := &terraform.Options{
TerraformDir: "../stacks/web-app",
Vars: map[string]interface{}{
"environment": "test",
},
}
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
// Get the load balancer URL
albUrl := terraform.Output(t, terraformOptions, "alb_url")
// Validate the application responds
http_helper.HttpGetWithRetry(
t,
albUrl,
nil,
200,
"OK",
30,
5*time.Second,
)
}
Contract Testing
Contract tests validate that infrastructure meets the expectations of applications that will run on it. The application defines what it needs; specific ports open, IAM permissions, environment variables available. Tests verify the infrastructure provides these.
# Application defines its infrastructure contract
class InfrastructureContract:
required_environment_variables = [
"DATABASE_URL",
"REDIS_URL",
"AWS_REGION",
]
required_ports = {
"database": 5432,
"redis": 6379,
"https": 443,
}
required_iam_actions = [
"s3:GetObject",
"s3:PutObject",
"sqs:SendMessage",
"sqs:ReceiveMessage",
]
# Test infrastructure against contract
def test_infrastructure_meets_contract(deployed_infrastructure):
contract = InfrastructureContract()
# Check environment variables
task_def = deployed_infrastructure.ecs_task_definition
env_vars = {e["name"] for e in task_def["containerDefinitions"][0]["environment"]}
for required_var in contract.required_environment_variables:
assert required_var in env_vars, f"Missing environment variable: {required_var}"
# Check security group allows required ports
sg_rules = deployed_infrastructure.security_group_rules
for name, port in contract.required_ports.items():
assert any(
rule["from_port"] <= port <= rule["to_port"]
for rule in sg_rules
), f"Port {port} ({name}) not allowed by security group"
# Check IAM permissions
iam_policy = deployed_infrastructure.task_role_policy
allowed_actions = extract_actions(iam_policy)
for required_action in contract.required_iam_actions:
assert action_matches(required_action, allowed_actions), \
f"Missing IAM permission: {required_action}"
Ephemeral Environments
Ephemeral environments spin up isolated copies of infrastructure for testing, then tear them down when complete. This enables safe testing without affecting shared environments.
Namespace isolation keeps test resources separate. Unique naming prefixes, separate AWS accounts, or Kubernetes namespaces prevent test resources from colliding with each other or with production.
# CI pipeline with ephemeral infrastructure testing
name: Infrastructure Tests
on:
pull_request:
paths:
- 'terraform/**'
jobs:
test:
runs-on: ubuntu-latest
env:
TF_VAR_environment: "pr-${{ github.event.pull_request.number }}"
steps:
- uses: actions/checkout@v4
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
- name: Terraform Init
run: terraform init
working-directory: terraform/
- name: Terraform Plan
run: terraform plan -out=tfplan
working-directory: terraform/
- name: Policy Validation
run: |
terraform show -json tfplan > plan.json
conftest test plan.json -p policies/
- name: Create Test Environment
run: terraform apply -auto-approve tfplan
working-directory: terraform/
- name: Run Integration Tests
run: |
go test -v ./tests/integration/...
- name: Destroy Test Environment
if: always()
run: terraform destroy -auto-approve
working-directory: terraform/
Chaos and Resilience Testing
Infrastructure should be resilient to failures. Chaos testing validates that your infrastructure handles component failures gracefully. Kill an availability zone, terminate instances, saturate network links, and verify the system continues operating.
# Chaos test for multi-AZ infrastructure
def test_survives_az_failure(infrastructure):
"""Verify application survives loss of one availability zone."""
# Get initial state
initial_response = http_get(infrastructure.load_balancer_url)
assert initial_response.status_code == 200
# Simulate AZ failure by stopping all instances in one AZ
az_to_fail = infrastructure.availability_zones[0]
failed_instances = infrastructure.stop_instances_in_az(az_to_fail)
try:
# Wait for health checks to mark instances unhealthy
time.sleep(60)
# Verify application still responds
response = http_get(infrastructure.load_balancer_url)
assert response.status_code == 200
# Verify auto-scaling launched replacement instances
new_instances = infrastructure.get_running_instances()
assert len(new_instances) >= infrastructure.min_instances
finally:
# Restore failed instances
infrastructure.start_instances(failed_instances)
Continuous Validation
Production infrastructure should be continuously validated, not just at deployment time. Configuration drift, manual changes, and cloud provider behavior changes can invalidate assumptions.
# Continuous infrastructure validation
class InfrastructureValidator:
def validate_security_groups(self):
"""Verify no security groups allow unrestricted access."""
for sg in self.ec2.describe_security_groups()["SecurityGroups"]:
for rule in sg["IpPermissions"]:
for ip_range in rule.get("IpRanges", []):
if ip_range.get("CidrIp") == "0.0.0.0/0":
if rule.get("FromPort") not in [80, 443]:
self.alert(
f"Security group {sg['GroupId']} allows "
f"unrestricted access on port {rule.get('FromPort')}"
)
def validate_encryption(self):
"""Verify all EBS volumes are encrypted."""
for volume in self.ec2.describe_volumes()["Volumes"]:
if not volume["Encrypted"]:
self.alert(f"Volume {volume['VolumeId']} is not encrypted")
def validate_backups(self):
"""Verify backup retention policies are active."""
for db in self.rds.describe_db_instances()["DBInstances"]:
if db["BackupRetentionPeriod"] < 7:
self.alert(
f"Database {db['DBInstanceIdentifier']} has "
f"backup retention of only {db['BackupRetentionPeriod']} days"
)
Conclusion
Infrastructure testing prevents outages caused by misconfiguration. Unit tests validate infrastructure code logic. Policy tests enforce organizational standards. Integration tests verify real resource behavior. Contract tests ensure infrastructure meets application requirements. Chaos tests validate resilience.
The testing strategy should match your risk tolerance and change velocity. Critical infrastructure serving production traffic deserves comprehensive testing. Development utilities might need less rigor. But some testing is always better than none, and infrastructure failures are often more impactful than application bugs.