Terraform Best Practices for Production Infrastructure

Terraform Best Practices for Production Infrastructure

Terraform is deceptively simple to start and surprisingly complex to run well at scale. These patterns come from real production experience managing infrastructure across AWS, GCP, and Azure.

State Management

Remote State (Non-Negotiable)

Never use local state in a team environment:

terraform {
  backend "s3" {
    bucket         = "mycompany-terraform-state"
    key            = "prod/networking/terraform.tfstate"
    region         = "ap-south-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"
  }
}

Key decisions:

  • One state file per logical component (networking, compute, database)
  • State locking via DynamoDB (AWS) or GCS (GCP) to prevent concurrent modifications
  • Encryption at rest always enabled

State File Organization

Structure state by environment and component:

states/
├── prod/
│   ├── networking/
│   ├── compute/
│   ├── database/
│   └── monitoring/
├── staging/
│   └── ...
└── shared/
    ├── dns/
    └── iam/

Small state files are faster to plan, safer to modify, and easier to recover.

Module Design

Keep Modules Focused

One module = one logical thing:

modules/
├── vpc/           # networking only
├── eks-cluster/   # cluster + node groups
├── rds/           # database instance + security group
└── monitoring/    # dashboards + alerts

Input Validation

variable "environment" {
  type = string
  validation {
    condition     = contains(["dev", "staging", "prod"], var.environment)
    error_message = "Environment must be dev, staging, or prod."
  }
}

Output What Consumers Need

output "vpc_id" {
  value       = aws_vpc.main.id
  description = "VPC ID for use by dependent modules"
}

output "private_subnet_ids" {
  value       = aws_subnet.private[*].id
  description = "Private subnet IDs for workload placement"
}

CI/CD Integration

Pipeline Structure

# GitHub Actions example
plan:
  - terraform init
  - terraform validate
  - terraform plan -out=plan.tfplan
  - Post plan output as PR comment

apply:
  - Only on merge to main
  - terraform apply plan.tfplan
  - Notify on success/failure

Safety Rules

  • plan runs on every PR — reviewers see what changes
  • apply only runs after merge (never on branch push)
  • Require manual approval for production applies
  • Store plan file as artifact — apply exactly what was reviewed

Common Pitfalls

1. Ignoring Drift

Terraform only knows about resources it manages. Manual changes create drift:

# Check regularly
terraform plan -detailed-exitcode
# Exit code 2 = drift detected

2. Hardcoding Values

Bad:

resource "aws_instance" "web" {
  ami           = "ami-0abcdef1234567890"
  instance_type = "t3.medium"
}

Better:

data "aws_ami" "ubuntu" {
  most_recent = true
  owners      = ["099720109477"]
  filter {
    name   = "name"
    values = ["ubuntu/images/hvm-ssd/ubuntu-*-amd64-server-*"]
  }
}

resource "aws_instance" "web" {
  ami           = data.aws_ami.ubuntu.id
  instance_type = var.instance_type
}

3. Giant Monolithic State

If terraform plan takes more than 30 seconds, your state is too large. Split it.

4. Not Using moved Blocks

When refactoring, use moved blocks to avoid destroy/recreate:

moved {
  from = aws_instance.web
  to   = module.compute.aws_instance.web
}

5. Secrets in State

Terraform state contains sensitive values in plaintext. Always:

  • Encrypt state at rest
  • Restrict state bucket access
  • Never commit state files to Git
  • Use sensitive = true on outputs

Workspace vs Directory Strategy

ApproachBest For
WorkspacesSame infra, different scale (dev/staging/prod with identical structure)
DirectoriesDifferent infra per environment (prod has additional security layers)

Most mature teams use directories — environments rarely stay identical.

Testing

Validate

terraform validate    # syntax check
terraform fmt -check  # formatting

Plan Analysis

terraform plan -json | jq '.resource_changes[] | select(.change.actions | contains(["delete"]))'

Integration Tests

Tools like Terratest or terraform test (native, added in 1.6) verify actual infrastructure behavior.

Import Existing Resources

When inheriting un-managed infrastructure:

terraform import aws_instance.legacy i-1234567890abcdef0

Then write the corresponding HCL to match. Use terraform plan to verify zero diff before managing it.


Terraform rewards discipline. Small state files, focused modules, CI/CD gates, and regular drift checks keep production infrastructure predictable.

Practice with our Terraform interview questions or browse infrastructure roles.

← All articles Browse jobs