Back to all posts
TerraformIaCDevOpsCloud

Terraform Best Practices at Scale

Module design, remote state management, Terraform Cloud workflows, and secrets handling for enterprise infrastructure

March 10, 202518 min read

Module Design & Reusability

At scale, Terraform modules are the building blocks of your infrastructure. A well-designed module library eliminates duplication, enforces standards, and makes onboarding new services trivial. Key principles for module design: - Single Responsibility: Each module should manage one concern — an EKS node group, a Helm release, a Vault policy. Don't create "god modules" that provision entire environments. - Input Validation: Use variable validation blocks to catch misconfigurations at plan time. Enum-style constraints (condition = contains(["staging", "production"], var.environment)) prevent drift before it starts. - Feature Toggles: Use boolean variables (usesVault, requireHpas, requireNetworkPolicies) to enable/disable features per service. This lets one module serve 50+ microservices with different requirements. - Dynamic Blocks: Use dynamic blocks with for_each for optional, repeatable configurations like taints, labels, and firewall rules. This keeps modules flexible without conditional resource proliferation. - Locals for Environment Mapping: Use locals maps to derive environment-specific values (Vault paths, Sentry configs, Argo labels) from a single environment variable input. A good test: onboarding a new microservice should require one module call with ~5 required variables, not 50 lines of boilerplate.
1# Reusable EKS Node Group Module with Dynamic Taints
2# modules/infrastructure/eks-node-group/main.tf
3
4resource "aws_eks_node_group" "main" {
5  for_each = var.subnetIds
6
7  cluster_name    = var.clusterName
8  node_group_name = "${var.namePrefix}-${each.key}"
9  node_role_arn   = var.roleArn
10  ami_type        = var.amiType
11  capacity_type   = var.capacityType
12
13  subnet_ids     = [each.value]
14  instance_types = var.instanceTypes
15
16  scaling_config {
17    desired_size = var.initialSize == null ? var.minSize : var.initialSize
18    max_size     = var.maxSize
19    min_size     = var.minSize
20  }
21
22  lifecycle {
23    ignore_changes = [scaling_config[0].desired_size]
24  }
25
26  dynamic "taint" {
27    for_each = var.noScheduleTaints
28    content {
29      effect = "NO_SCHEDULE"
30      key    = taint.key
31      value  = taint.value
32    }
33  }
34
35  tags = merge({ Name = "${var.namePrefix}-${each.key}" }, local.mergedTags)
36}
37
38# Service Deployment Module with Feature Toggles
39variable "environment" {
40  type = string
41  validation {
42    condition     = contains(["staging", "production", "productionUk"], var.environment)
43    error_message = "Invalid environment"
44  }
45}
46
47variable "technology" {
48  type = string
49  validation {
50    condition     = contains(["PHP", "Node", "Golang", "React", "Angular"], var.technology)
51    error_message = "Invalid technology"
52  }
53}

Remote State Management & Terraform Cloud

State management is the most critical aspect of Terraform at scale. Getting it wrong means state corruption, lost resources, and sleepless nights. For enterprise setups, Terraform Cloud (or Terraform Enterprise) provides the best experience: State isolation: One workspace per environment-layer combination. Example workspace naming: aws-production-infrastructure, azure-staging-kubernetes-objects, gcp-production-vpc. This prevents a staging change from accidentally modifying production resources. VCS-driven workflows: Connect workspaces to your GitLab/GitHub repository with trigger_patterns scoped to specific directories. A change in /aws/production/infrastructure/** only triggers the aws-production-infrastructure workspace, not all 20 workspaces. Remote execution: Plans run in Terraform Cloud's infrastructure, not on developer laptops. This ensures consistent provider versions, credentials, and execution environments. State locking: Terraform Cloud handles locking automatically — no need for DynamoDB tables or GCS bucket versioning. For GCP workloads where Terraform Cloud agent execution isn't available, GCS backend with prefix-based state isolation works well: bucket = "terraform-state-servicex", prefix = "production/servicex/kubernetes-cluster.tfstate".
1# Terraform Cloud Backend Configuration
2terraform {
3  cloud {
4    organization = "my-org"
5    workspaces {
6      name = "aws-production-infrastructure"
7    }
8  }
9}
10
11# GCS Backend for GCP Workloads
12terraform {
13  backend "gcs" {
14    bucket = "terraform-state-servicex"
15    prefix = "production/servicex/kubernetes-cluster.tfstate"
16  }
17}
18
19# Terraform Cloud Workspace Module with VCS Triggers
20resource "tfe_workspace" "main" {
21  name              = var.name
22  project_id        = var.projectId
23  force_delete      = false
24  queue_all_runs    = true
25  working_directory = "/${var.workingDirectory}"
26
27  trigger_patterns = [
28    for v in concat(
29      [var.workingDirectory],
30      var.additionalTriggerDirectories,
31    ) : "${v}/**/*"
32  ]
33
34  vcs_repo {
35    identifier         = "devops/terraform"
36    oauth_token_id     = var.gitlabOauthTokenId
37    ingress_submodules = false
38  }
39}
40
41resource "tfe_workspace_settings" "main" {
42  workspace_id   = tfe_workspace.main.id
43  execution_mode = var.executionMode
44  agent_pool_id  = var.executionMode == "agent" ? var.agentPoolId : null
45}

Secrets & Security with HashiCorp Vault

Secrets in Terraform are the biggest security risk if mishandled. Hardcoded credentials in .tf files, plaintext in state, or overly broad access policies can turn IaC into an attack vector. The gold standard is HashiCorp Vault integration: Vault as credential source: Use vault_generic_secret data sources and dedicated vault-secret modules to fetch credentials at plan time. Database passwords, API keys, OAuth secrets — nothing hardcoded. Kubernetes auth backend: Each cluster environment gets its own Vault auth backend (kubernetes-prod-cluster). Service accounts are bound to namespace-scoped Vault roles with minimal policies. Per-namespace policies: Vault policies follow the pattern kv/data/{project}/{environment}/{namespace}/*. A service in the "payments" namespace can only read its own secrets, not the "users" namespace secrets. Transit encryption: Vault's transit engine handles application-level encryption without exposing keys. Environment-specific transit backends (transit-production, transit-staging) isolate encryption domains. TOTP engine: For services requiring time-based OTP generation, Vault's TOTP engine provides centralized management. The principle: secrets should be ephemeral, scoped, and audited. If a developer can see a production database password, something is wrong.
1# Vault Kubernetes Auth Backend Role (per namespace)
2resource "vault_kubernetes_auth_backend_role" "main" {
3  count      = var.usesVault ? 1 : 0
4  role_name  = var.namespace
5  backend    = "kubernetes-prod-cluster"
6  token_period = 1800
7
8  bound_service_account_names = [
9    kubernetes_service_account.vault[0].metadata[0].name,
10  ]
11  bound_service_account_namespaces = [
12    kubernetes_namespace.main.metadata[0].name,
13  ]
14
15  token_policies = concat(var.additionalVaultPolicies, [
16    vault_policy.main[0].name
17  ])
18}
19
20# Namespace-Scoped Vault Policy
21resource "vault_policy" "main" {
22  count  = var.usesVault ? 1 : 0
23  name   = "prod-cluster/${var.namespace}"
24  policy = data.vault_policy_document.main.hcl
25}
26
27data "vault_policy_document" "main" {
28  rule {
29    path         = "kv/data/myproject/production/${var.namespace}/*"
30    capabilities = ["read"]
31  }
32}
33
34# Using Vault Secrets in Resources (No Hardcoded Creds)
35resource "azurerm_postgresql_flexible_server" "main" {
36  name                     = "prod-database"
37  administrator_login      = module.vaultSecrets.secretData["dbAdminUser"]
38  administrator_password   = module.vaultSecrets.secretData["dbAdminPass"]
39  sku_name                 = "B_Standard_B2s"
40  version                  = "17"
41  auto_grow_enabled        = true
42}

CI/CD Integration & Workspace Strategy

Terraform CI/CD at scale requires a strategy beyond "run terraform apply in a pipeline." You need workspace isolation, directory-scoped triggers, and safe promotion patterns. Workspace naming convention: {cloud}-{environment}-{layer}. Examples: aws-production-infrastructure, azure-staging-kubernetes-objects, gcp-production-vpc. This makes it immediately clear what each workspace manages. Directory structure mirrors workspaces: terraform/ aws/ production/ infrastructure/ -> aws-production-infrastructure workspace kubernetes-objects/ -> aws-production-kubernetes-objects workspace staging/ infrastructure/ -> aws-staging-infrastructure workspace azure/ production/ infrastructure/ -> azure-production-infrastructure workspace gcp/ servicex/ production/ -> gcp-production-* workspaces modules/ -> shared, triggers dependent workspaces VCS trigger patterns ensure a module change triggers all dependent workspaces. When you update modules/kubernetes-objects/prometheus-grafana, every workspace using that module should plan. Environment promotion: Changes flow staging -> production. Staging workspaces use auto-apply; production uses manual confirmation after plan review. Agent pools: For workspaces that need private network access (Vault configuration, Kubernetes API), use Terraform Cloud agents running inside your VPC.
1# Workspace Factory Pattern
2# global/terraform-cloud/main.tf
3
4locals {
5  workspaces = {
6    "aws-production-infrastructure" = {
7      workingDirectory = "aws/production/infrastructure"
8      executionMode    = "remote"
9      additionalTriggerDirectories = [
10        "modules/infrastructure",
11        "modules/shared",
12      ]
13    }
14    "aws-production-kubernetes-objects" = {
15      workingDirectory = "aws/production/kubernetes-objects"
16      executionMode    = "agent"  # needs K8s API access
17      additionalTriggerDirectories = [
18        "modules/kubernetes-objects",
19        "modules/shared",
20      ]
21    }
22    "azure-staging-kubernetes-objects" = {
23      workingDirectory = "azure/staging/kubernetes-objects"
24      executionMode    = "agent"
25      additionalTriggerDirectories = [
26        "modules/kubernetes-objects",
27        "modules/shared",
28      ]
29    }
30  }
31}
32
33module "workspaces" {
34  for_each          = local.workspaces
35  source            = "./modules/workspace"
36  name              = each.key
37  projectId         = tfe_project.main.id
38  workingDirectory  = each.value.workingDirectory
39  executionMode     = each.value.executionMode
40  agentPoolId       = tfe_agent_pool.main.id
41  gitlabOauthTokenId = var.gitlabOauthTokenId
42  additionalTriggerDirectories = each.value.additionalTriggerDirectories
43}