Infrastructure Testing Strategies

Reverend Philip Jan 14, 2026 6 min read

Test your infrastructure code before deployment. Learn Terratest, InSpec, and policy testing with OPA.

Infrastructure testing validates that your infrastructure code works correctly before deployment. As infrastructure-as-code becomes standard practice, treating infrastructure with the same rigor as application code; including testing; becomes essential. Untested infrastructure changes have caused countless outages, from misconfigured security groups to broken networking to resource limits that only manifest at scale.

Testing infrastructure differs from testing application code. You're testing declarations of desired state, not algorithms. The "execution environment" is a cloud provider API, not a local runtime. Tests are slower and potentially more expensive. These differences require adapted testing strategies.

Unit Testing Infrastructure Code

Unit tests validate individual infrastructure components in isolation. For Terraform, this means testing modules. For CloudFormation, testing nested stacks. For Pulumi or CDK, testing constructs or components.

Unit tests verify that given certain inputs, your infrastructure code produces expected outputs. They don't actually create resources; they validate the configuration that would be created.

# Testing Terraform with pytest and terraform-py
import pytest
from terraform import Terraform

class TestVpcModule:
    @pytest.fixture
    def tf(self, tmp_path):
        # Set up Terraform with test configuration
        tf = Terraform(working_dir=str(tmp_path))
        tf.init()
        return tf

    def test_vpc_creates_expected_subnets(self, tf, tmp_path):
        # Write test configuration
        config = """
        module "vpc" {
          source = "../modules/vpc"

          cidr_block = "10.0.0.0/16"
          availability_zones = ["us-east-1a", "us-east-1b"]
          environment = "test"
        }
        """
        (tmp_path / "main.tf").write_text(config)

        # Plan and validate
        plan = tf.plan(output=True)

        # Assert expected resources
        assert "aws_vpc.main" in plan
        assert "aws_subnet.public" in plan
        assert plan["aws_vpc.main"]["cidr_block"] == "10.0.0.0/16"

    def test_vpc_tags_include_environment(self, tf, tmp_path):
        config = """
        module "vpc" {
          source = "../modules/vpc"
          environment = "production"
        }
        """
        (tmp_path / "main.tf").write_text(config)

        plan = tf.plan(output=True)

        tags = plan["aws_vpc.main"]["tags"]
        assert tags["Environment"] == "production"

Policy-as-code tools like Open Policy Agent (OPA), Checkov, or Sentinel validate infrastructure against organizational policies. These catch compliance violations before deployment: ensuring encryption is enabled, required tags are present, or prohibited resource types aren't used.

# OPA policy for Terraform plans
package terraform

# Deny EC2 instances without encryption
deny[msg] {
    resource := input.resource_changes[_]
    resource.type == "aws_instance"
    not resource.change.after.root_block_device[_].encrypted

    msg := sprintf("EC2 instance %s must have encrypted root volume", [resource.address])
}

# Require specific tags on all resources
required_tags := {"Environment", "Team", "CostCenter"}

deny[msg] {
    resource := input.resource_changes[_]
    tags := resource.change.after.tags
    missing := required_tags - {tag | tags[tag]}
    count(missing) > 0

    msg := sprintf("Resource %s missing required tags: %v", [resource.address, missing])
}

Integration Testing

Integration tests create real infrastructure in an isolated environment and validate it works correctly. This catches issues that unit tests miss: IAM permission problems, networking configuration errors, service interactions, and cloud provider behavior differences.

The cost is real resources. Integration tests are slower and incur cloud charges. Balance coverage against cost by testing critical paths thoroughly while using lighter validation for simpler components.

// Terratest for integration testing
package test

import (
    "testing"
    "time"

    "github.com/gruntwork-io/terratest/modules/terraform"
    "github.com/gruntwork-io/terratest/modules/http-helper"
    "github.com/stretchr/testify/assert"
)

func TestVpcModule(t *testing.T) {
    t.Parallel()

    terraformOptions := &terraform.Options{
        TerraformDir: "../modules/vpc",
        Vars: map[string]interface{}{
            "environment": "test",
            "cidr_block":  "10.99.0.0/16",
        },
    }

    // Clean up resources when test completes
    defer terraform.Destroy(t, terraformOptions)

    // Create the infrastructure
    terraform.InitAndApply(t, terraformOptions)

    // Validate outputs
    vpcId := terraform.Output(t, terraformOptions, "vpc_id")
    assert.NotEmpty(t, vpcId)

    publicSubnetIds := terraform.OutputList(t, terraformOptions, "public_subnet_ids")
    assert.Len(t, publicSubnetIds, 2)

    // Validate actual AWS resources
    vpc := aws.GetVpc(t, vpcId, "us-east-1")
    assert.Equal(t, "10.99.0.0/16", vpc.CidrBlock)
    assert.Equal(t, "available", vpc.State)
}

func TestWebApplicationStack(t *testing.T) {
    t.Parallel()

    terraformOptions := &terraform.Options{
        TerraformDir: "../stacks/web-app",
        Vars: map[string]interface{}{
            "environment": "test",
        },
    }

    defer terraform.Destroy(t, terraformOptions)
    terraform.InitAndApply(t, terraformOptions)

    // Get the load balancer URL
    albUrl := terraform.Output(t, terraformOptions, "alb_url")

    // Validate the application responds
    http_helper.HttpGetWithRetry(
        t,
        albUrl,
        nil,
        200,
        "OK",
        30,
        5*time.Second,
    )
}

Contract Testing

Contract tests validate that infrastructure meets the expectations of applications that will run on it. The application defines what it needs; specific ports open, IAM permissions, environment variables available. Tests verify the infrastructure provides these.

# Application defines its infrastructure contract
class InfrastructureContract:
    required_environment_variables = [
        "DATABASE_URL",
        "REDIS_URL",
        "AWS_REGION",
    ]

    required_ports = {
        "database": 5432,
        "redis": 6379,
        "https": 443,
    }

    required_iam_actions = [
        "s3:GetObject",
        "s3:PutObject",
        "sqs:SendMessage",
        "sqs:ReceiveMessage",
    ]

# Test infrastructure against contract
def test_infrastructure_meets_contract(deployed_infrastructure):
    contract = InfrastructureContract()

    # Check environment variables
    task_def = deployed_infrastructure.ecs_task_definition
    env_vars = {e["name"] for e in task_def["containerDefinitions"][0]["environment"]}

    for required_var in contract.required_environment_variables:
        assert required_var in env_vars, f"Missing environment variable: {required_var}"

    # Check security group allows required ports
    sg_rules = deployed_infrastructure.security_group_rules
    for name, port in contract.required_ports.items():
        assert any(
            rule["from_port"] <= port <= rule["to_port"]
            for rule in sg_rules
        ), f"Port {port} ({name}) not allowed by security group"

    # Check IAM permissions
    iam_policy = deployed_infrastructure.task_role_policy
    allowed_actions = extract_actions(iam_policy)

    for required_action in contract.required_iam_actions:
        assert action_matches(required_action, allowed_actions), \
            f"Missing IAM permission: {required_action}"

Ephemeral Environments

Ephemeral environments spin up isolated copies of infrastructure for testing, then tear them down when complete. This enables safe testing without affecting shared environments.

Namespace isolation keeps test resources separate. Unique naming prefixes, separate AWS accounts, or Kubernetes namespaces prevent test resources from colliding with each other or with production.

# CI pipeline with ephemeral infrastructure testing
name: Infrastructure Tests

on:
  pull_request:
    paths:
      - 'terraform/**'

jobs:
  test:
    runs-on: ubuntu-latest
    env:
      TF_VAR_environment: "pr-${{ github.event.pull_request.number }}"

    steps:
      - uses: actions/checkout@v4

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3

      - name: Terraform Init
        run: terraform init
        working-directory: terraform/

      - name: Terraform Plan
        run: terraform plan -out=tfplan
        working-directory: terraform/

      - name: Policy Validation
        run: |
          terraform show -json tfplan > plan.json
          conftest test plan.json -p policies/

      - name: Create Test Environment
        run: terraform apply -auto-approve tfplan
        working-directory: terraform/

      - name: Run Integration Tests
        run: |
          go test -v ./tests/integration/...

      - name: Destroy Test Environment
        if: always()
        run: terraform destroy -auto-approve
        working-directory: terraform/

Chaos and Resilience Testing

Infrastructure should be resilient to failures. Chaos testing validates that your infrastructure handles component failures gracefully. Kill an availability zone, terminate instances, saturate network links, and verify the system continues operating.

# Chaos test for multi-AZ infrastructure
def test_survives_az_failure(infrastructure):
    """Verify application survives loss of one availability zone."""

    # Get initial state
    initial_response = http_get(infrastructure.load_balancer_url)
    assert initial_response.status_code == 200

    # Simulate AZ failure by stopping all instances in one AZ
    az_to_fail = infrastructure.availability_zones[0]
    failed_instances = infrastructure.stop_instances_in_az(az_to_fail)

    try:
        # Wait for health checks to mark instances unhealthy
        time.sleep(60)

        # Verify application still responds
        response = http_get(infrastructure.load_balancer_url)
        assert response.status_code == 200

        # Verify auto-scaling launched replacement instances
        new_instances = infrastructure.get_running_instances()
        assert len(new_instances) >= infrastructure.min_instances

    finally:
        # Restore failed instances
        infrastructure.start_instances(failed_instances)

Continuous Validation

Production infrastructure should be continuously validated, not just at deployment time. Configuration drift, manual changes, and cloud provider behavior changes can invalidate assumptions.

# Continuous infrastructure validation
class InfrastructureValidator:
    def validate_security_groups(self):
        """Verify no security groups allow unrestricted access."""
        for sg in self.ec2.describe_security_groups()["SecurityGroups"]:
            for rule in sg["IpPermissions"]:
                for ip_range in rule.get("IpRanges", []):
                    if ip_range.get("CidrIp") == "0.0.0.0/0":
                        if rule.get("FromPort") not in [80, 443]:
                            self.alert(
                                f"Security group {sg['GroupId']} allows "
                                f"unrestricted access on port {rule.get('FromPort')}"
                            )

    def validate_encryption(self):
        """Verify all EBS volumes are encrypted."""
        for volume in self.ec2.describe_volumes()["Volumes"]:
            if not volume["Encrypted"]:
                self.alert(f"Volume {volume['VolumeId']} is not encrypted")

    def validate_backups(self):
        """Verify backup retention policies are active."""
        for db in self.rds.describe_db_instances()["DBInstances"]:
            if db["BackupRetentionPeriod"] < 7:
                self.alert(
                    f"Database {db['DBInstanceIdentifier']} has "
                    f"backup retention of only {db['BackupRetentionPeriod']} days"
                )

Conclusion

Infrastructure testing prevents outages caused by misconfiguration. Unit tests validate infrastructure code logic. Policy tests enforce organizational standards. Integration tests verify real resource behavior. Contract tests ensure infrastructure meets application requirements. Chaos tests validate resilience.

The testing strategy should match your risk tolerance and change velocity. Critical infrastructure serving production traffic deserves comprehensive testing. Development utilities might need less rigor. But some testing is always better than none, and infrastructure failures are often more impactful than application bugs.

Share this article

Related Articles

Distributed Locking Patterns

Coordinate access to shared resources across services. Implement distributed locks with Redis, ZooKeeper, and databases.

Jan 16, 2026

API Design First Development

Design APIs before implementing them. Use OpenAPI specifications, mock servers, and contract-first workflows.

Jan 15, 2026

Need help with your project?

Let's discuss how we can help you build reliable software.