Infrastructure Testing Strategies

Infrastructure testing validates that your infrastructure code works correctly before deployment. As infrastructure-as-code becomes standard practice, treating infrastructure with the same rigor as application code; including testing; becomes essential. Untested infrastructure changes have caused countless outages, from misconfigured security groups to broken networking to resource limits that only manifest at scale.

Testing infrastructure differs from testing application code. You're testing declarations of desired state, not algorithms. The "execution environment" is a cloud provider API, not a local runtime. Tests are slower and potentially more expensive. These differences require adapted testing strategies.

Unit Testing Infrastructure Code

Unit tests validate individual infrastructure components in isolation. For Terraform, this means testing modules. For CloudFormation, testing nested stacks. For Pulumi or CDK, testing constructs or components.

Unit tests verify that given certain inputs, your infrastructure code produces expected outputs. They don't actually create resources; they validate the configuration that would be created.

The following example demonstrates how to unit test a Terraform VPC module using pytest. You'll define test configurations that exercise your module with specific inputs, then assert that the planned output matches your expectations without actually creating any cloud resources.

# Testing Terraform with pytest and terraform-py
import pytest
from terraform import Terraform

class TestVpcModule:
    @pytest.fixture
    def tf(self, tmp_path):
        # Set up Terraform with test configuration
        tf = Terraform(working_dir=str(tmp_path))
        tf.init()
        return tf

    def test_vpc_creates_expected_subnets(self, tf, tmp_path):
        # Write test configuration
        config = """
        module "vpc" {
          source = "../modules/vpc"

          cidr_block = "10.0.0.0/16"
          availability_zones = ["us-east-1a", "us-east-1b"]
          environment = "test"
        }
        """
        (tmp_path / "main.tf").write_text(config)

        # Plan and validate
        plan = tf.plan(output=True)

        # Assert expected resources
        assert "aws_vpc.main" in plan
        assert "aws_subnet.public" in plan
        assert plan["aws_vpc.main"]["cidr_block"] == "10.0.0.0/16"

    def test_vpc_tags_include_environment(self, tf, tmp_path):
        config = """
        module "vpc" {
          source = "../modules/vpc"
          environment = "production"
        }
        """
        (tmp_path / "main.tf").write_text(config)

        plan = tf.plan(output=True)

        tags = plan["aws_vpc.main"]["tags"]
        assert tags["Environment"] == "production"

Notice how the tests use Terraform's plan output rather than actually applying infrastructure. This keeps tests fast and free from cloud provider costs while still validating that your module produces the correct configuration.

Policy-as-code tools like Open Policy Agent (OPA), Checkov, or Sentinel validate infrastructure against organizational policies. These catch compliance violations before deployment: ensuring encryption is enabled, required tags are present, or prohibited resource types aren't used.

You can write OPA policies in the Rego language to enforce organizational standards across all your Terraform plans. The following example shows how to deny unencrypted EC2 instances and require specific tags on all resources.

# OPA policy for Terraform plans
package terraform

# Deny EC2 instances without encryption
deny[msg] {
    resource := input.resource_changes[_]
    resource.type == "aws_instance"
    not resource.change.after.root_block_device[_].encrypted

    msg := sprintf("EC2 instance %s must have encrypted root volume", [resource.address])
}

# Require specific tags on all resources
required_tags := {"Environment", "Team", "CostCenter"}

deny[msg] {
    resource := input.resource_changes[_]
    tags := resource.change.after.tags
    missing := required_tags - {tag | tags[tag]}
    count(missing) > 0

    msg := sprintf("Resource %s missing required tags: %v", [resource.address, missing])
}

These policies integrate into your CI/CD pipeline to block non-compliant infrastructure before it reaches production. The clear error messages help developers quickly understand and fix violations.

Integration Testing

Integration tests create real infrastructure in an isolated environment and validate it works correctly. This catches issues that unit tests miss: IAM permission problems, networking configuration errors, service interactions, and cloud provider behavior differences.

The cost is real resources. Integration tests are slower and incur cloud charges. Balance coverage against cost by testing critical paths thoroughly while using lighter validation for simpler components.

Terratest is a popular Go library for integration testing Terraform modules. The following example shows how to create real AWS resources, validate they work correctly, and clean up afterward. You'll want to run these tests in dedicated test accounts with appropriate budgets.

// Terratest for integration testing
package test

import (
    "testing"
    "time"

    "github.com/gruntwork-io/terratest/modules/terraform"
    "github.com/gruntwork-io/terratest/modules/http-helper"
    "github.com/stretchr/testify/assert"
)

func TestVpcModule(t *testing.T) {
    t.Parallel()

    terraformOptions := &terraform.Options{
        TerraformDir: "../modules/vpc",
        Vars: map[string]interface{}{
            "environment": "test",
            "cidr_block":  "10.99.0.0/16",
        },
    }

    // Clean up resources when test completes
    defer terraform.Destroy(t, terraformOptions)

    // Create the infrastructure
    terraform.InitAndApply(t, terraformOptions)

    // Validate outputs
    vpcId := terraform.Output(t, terraformOptions, "vpc_id")
    assert.NotEmpty(t, vpcId)

    publicSubnetIds := terraform.OutputList(t, terraformOptions, "public_subnet_ids")
    assert.Len(t, publicSubnetIds, 2)

    // Validate actual AWS resources
    vpc := aws.GetVpc(t, vpcId, "us-east-1")
    assert.Equal(t, "10.99.0.0/16", vpc.CidrBlock)
    assert.Equal(t, "available", vpc.State)
}

func TestWebApplicationStack(t *testing.T) {
    t.Parallel()

    terraformOptions := &terraform.Options{
        TerraformDir: "../stacks/web-app",
        Vars: map[string]interface{}{
            "environment": "test",
        },
    }

    defer terraform.Destroy(t, terraformOptions)
    terraform.InitAndApply(t, terraformOptions)

    // Get the load balancer URL
    albUrl := terraform.Output(t, terraformOptions, "alb_url")

    // Validate the application responds
    http_helper.HttpGetWithRetry(
        t,
        albUrl,
        nil,
        200,
        "OK",
        30,
        5*time.Second,
    )
}

The defer terraform.Destroy pattern ensures cleanup happens even if tests fail. Running tests in parallel with t.Parallel() reduces total test time when you have multiple independent modules to validate.

Contract Testing

Contract tests validate that infrastructure meets the expectations of applications that will run on it. The application defines what it needs; specific ports open, IAM permissions, environment variables available. Tests verify the infrastructure provides these.

This approach inverts the typical relationship between infrastructure and application testing. Instead of testing infrastructure in isolation, you test it against the actual requirements of its consumers. The following example shows how an application can declare its infrastructure contract and verify that deployed infrastructure satisfies it.

# Application defines its infrastructure contract
class InfrastructureContract:
    required_environment_variables = [
        "DATABASE_URL",
        "REDIS_URL",
        "AWS_REGION",
    ]

    required_ports = {
        "database": 5432,
        "redis": 6379,
        "https": 443,
    }

    required_iam_actions = [
        "s3:GetObject",
        "s3:PutObject",
        "sqs:SendMessage",
        "sqs:ReceiveMessage",
    ]

# Test infrastructure against contract
def test_infrastructure_meets_contract(deployed_infrastructure):
    contract = InfrastructureContract()

    # Check environment variables
    task_def = deployed_infrastructure.ecs_task_definition
    env_vars = {e["name"] for e in task_def["containerDefinitions"][0]["environment"]}

    for required_var in contract.required_environment_variables:
        assert required_var in env_vars, f"Missing environment variable: {required_var}"

    # Check security group allows required ports
    sg_rules = deployed_infrastructure.security_group_rules
    for name, port in contract.required_ports.items():
        assert any(
            rule["from_port"] <= port <= rule["to_port"]
            for rule in sg_rules
        ), f"Port {port} ({name}) not allowed by security group"

    # Check IAM permissions
    iam_policy = deployed_infrastructure.task_role_policy
    allowed_actions = extract_actions(iam_policy)

    for required_action in contract.required_iam_actions:
        assert action_matches(required_action, allowed_actions), \
            f"Missing IAM permission: {required_action}"

Contract tests catch a common failure mode: infrastructure changes that break application assumptions. When the contract is explicit and tested, you'll discover these issues in CI rather than in production.

Ephemeral Environments

Ephemeral environments spin up isolated copies of infrastructure for testing, then tear them down when complete. This enables safe testing without affecting shared environments.

Namespace isolation keeps test resources separate. Unique naming prefixes, separate AWS accounts, or Kubernetes namespaces prevent test resources from colliding with each other or with production.

The following GitHub Actions workflow demonstrates how to create an ephemeral infrastructure environment for each pull request. Resources are namespaced by PR number, validated with policy checks and integration tests, then destroyed regardless of test outcome.

# CI pipeline with ephemeral infrastructure testing
name: Infrastructure Tests

on:
  pull_request:
    paths:
      - 'terraform/**'

jobs:
  test:
    runs-on: ubuntu-latest
    env:
      TF_VAR_environment: "pr-${{ github.event.pull_request.number }}"

    steps:
      - uses: actions/checkout@v4

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3

      - name: Terraform Init
        run: terraform init
        working-directory: terraform/

      - name: Terraform Plan
        run: terraform plan -out=tfplan
        working-directory: terraform/

      - name: Policy Validation
        run: |
          terraform show -json tfplan > plan.json
          conftest test plan.json -p policies/

      - name: Create Test Environment
        run: terraform apply -auto-approve tfplan
        working-directory: terraform/

      - name: Run Integration Tests
        run: |
          go test -v ./tests/integration/...

      - name: Destroy Test Environment
        if: always()
        run: terraform destroy -auto-approve
        working-directory: terraform/

The if: always() condition on the destroy step ensures cleanup happens even when tests fail. The PR-based naming convention makes it easy to identify and manually clean up orphaned resources if something goes wrong.

Chaos and Resilience Testing

Infrastructure should be resilient to failures. Chaos testing validates that your infrastructure handles component failures gracefully. Kill an availability zone, terminate instances, saturate network links, and verify the system continues operating.

The following chaos test simulates an availability zone failure and verifies that your application survives. You'll stop all instances in one AZ, wait for health checks to detect the failure, then verify that the application remains accessible and that auto-scaling replaces the lost capacity.

# Chaos test for multi-AZ infrastructure
def test_survives_az_failure(infrastructure):
    """Verify application survives loss of one availability zone."""

    # Get initial state
    initial_response = http_get(infrastructure.load_balancer_url)
    assert initial_response.status_code == 200

    # Simulate AZ failure by stopping all instances in one AZ
    az_to_fail = infrastructure.availability_zones[0]
    failed_instances = infrastructure.stop_instances_in_az(az_to_fail)

    try:
        # Wait for health checks to mark instances unhealthy
        time.sleep(60)

        # Verify application still responds
        response = http_get(infrastructure.load_balancer_url)
        assert response.status_code == 200

        # Verify auto-scaling launched replacement instances
        new_instances = infrastructure.get_running_instances()
        assert len(new_instances) >= infrastructure.min_instances

    finally:
        # Restore failed instances
        infrastructure.start_instances(failed_instances)

The try/finally block ensures instances are restored even if assertions fail. Run chaos tests in production-like environments to validate that your resilience mechanisms work as expected under realistic conditions.

Continuous Validation

Production infrastructure should be continuously validated, not just at deployment time. Configuration drift, manual changes, and cloud provider behavior changes can invalidate assumptions.

This continuous validator runs as a scheduled job to detect security violations that may have been introduced through manual changes or configuration drift. When it finds issues, it alerts your team immediately rather than waiting for the next deployment.

# Continuous infrastructure validation
class InfrastructureValidator:
    def validate_security_groups(self):
        """Verify no security groups allow unrestricted access."""
        for sg in self.ec2.describe_security_groups()["SecurityGroups"]:
            for rule in sg["IpPermissions"]:
                for ip_range in rule.get("IpRanges", []):
                    if ip_range.get("CidrIp") == "0.0.0.0/0":
                        if rule.get("FromPort") not in [80, 443]:
                            self.alert(
                                f"Security group {sg['GroupId']} allows "
                                f"unrestricted access on port {rule.get('FromPort')}"
                            )

    def validate_encryption(self):
        """Verify all EBS volumes are encrypted."""
        for volume in self.ec2.describe_volumes()["Volumes"]:
            if not volume["Encrypted"]:
                self.alert(f"Volume {volume['VolumeId']} is not encrypted")

    def validate_backups(self):
        """Verify backup retention policies are active."""
        for db in self.rds.describe_db_instances()["DBInstances"]:
            if db["BackupRetentionPeriod"] < 7:
                self.alert(
                    f"Database {db['DBInstanceIdentifier']} has "
                    f"backup retention of only {db['BackupRetentionPeriod']} days"
                )

Run these validators on a schedule that matches your risk tolerance. Critical production environments might warrant hourly checks, while development environments might only need daily validation.

Conclusion

Infrastructure testing prevents outages caused by misconfiguration. Unit tests validate infrastructure code logic. Policy tests enforce organizational standards. Integration tests verify real resource behavior. Contract tests ensure infrastructure meets application requirements. Chaos tests validate resilience.

The testing strategy should match your risk tolerance and change velocity. Critical infrastructure serving production traffic deserves comprehensive testing. Development utilities might need less rigor. But some testing is always better than none, and infrastructure failures are often more impactful than application bugs.