DevOps Fundamental for DevOps Fundamentals

Posted on Jul 20

Terraform Fundamentals: DRS (Elastic Disaster Recovery)

#terraform #iac #aws #drselasticdisasterrecover

Elastic Disaster Recovery with Terraform: A Production Deep Dive

Infrastructure failures are inevitable. The question isn’t if something will go wrong, but when and how quickly you can recover. Traditional disaster recovery (DR) solutions often involve complex, manual processes and significant downtime. Modern infrastructure as code (IaC) workflows, built around Terraform, demand a more automated, repeatable, and reliable approach. Elastic Disaster Recovery (DRS), particularly when orchestrated through Terraform, provides that capability. This isn’t a “nice-to-have” anymore; it’s a core component of a resilient platform engineering stack, enabling rapid failover and minimizing business impact. DRS fits squarely within IaC pipelines, acting as a critical extension of your core infrastructure provisioning and management.

What is "DRS (Elastic Disaster Recovery)" in Terraform context?

DRS, in the context of Terraform, refers to the ability to replicate and recover infrastructure across regions or availability zones. While there isn’t a single “DRS” resource in the Terraform registry, it’s implemented through a combination of resources that facilitate replication, failover, and failback. The core principle is to define your primary infrastructure with Terraform, then use that definition to create a secondary, standby environment. This secondary environment is kept synchronized with the primary, ready to take over in case of an outage.

Currently, the most mature implementations are found within cloud provider ecosystems. AWS Elastic Disaster Recovery (EDR) is a prime example, but the concepts translate to Azure Site Recovery and similar services on GCP. Terraform manages these services through their respective providers.

There isn’t a dedicated Terraform module for a generic “DRS” solution. Instead, you build DRS functionality using provider-specific resources. This means understanding the nuances of each cloud provider’s DR offering and translating that into Terraform code. A key caveat is that state management becomes paramount. You must ensure your Terraform state is properly versioned and secured to avoid conflicts during failover.

Use Cases and When to Use

DRS isn’t always necessary. Over-engineering DR for non-critical applications is wasteful. Here are scenarios where DRS is essential:

Regulatory Compliance: Industries like finance and healthcare often have strict RTO/RPO requirements mandating robust DR capabilities.
Multi-Region Applications: Applications designed for global reach require DR to ensure availability in the event of a regional outage.
Critical Business Services: Core services like e-commerce platforms, payment gateways, or internal tooling demand minimal downtime.
Planned Maintenance: DRS allows for zero-downtime deployments and maintenance by failing over to the secondary environment during updates.
Ransomware Protection: A geographically isolated DR environment can serve as a recovery point in the event of a ransomware attack. This requires careful network segmentation and access control.

Key Terraform Resources

Here are eight key Terraform resources used in building a DRS solution:

aws_ec2_replication_configuration (AWS EDR): Defines the replication settings for EC2 instances.

   resource "aws_ec2_replication_configuration" "example" {
     source_region = "us-east-1"
     destination_region = "us-west-2"
     instance_id = "i-xxxxxxxxxxxxxxxxx"
   }

aws_edr_recovery_plan (AWS EDR): Defines the failover sequence and dependencies.

   resource "aws_edr_recovery_plan" "example" {
     name = "my-recovery-plan"
     source_region = "us-east-1"
   }

aws_ec2_instance: Defines the EC2 instances that are part of the DR plan.

   resource "aws_ec2_instance" "web" {
     ami           = "ami-xxxxxxxxxxxxxxxxx"
     instance_type = "t3.micro"
   }

aws_security_group: Defines network access rules for instances in both primary and secondary regions.

   resource "aws_security_group" "web_sg" {
     name        = "web-sg"
     description = "Allow web traffic"
   }

aws_route53_record: Used for DNS failover, pointing to the secondary region’s resources.

   resource "aws_route53_record" "www" {
     zone_id = "Zxxxxxxxxxxxxxxxxx"
     name    = "www.example.com"
     type    = "A"
     ttl     = "60"
     records = [aws_instance.web.public_ip]
   }

azurerm_site_recovery_replication_policy (Azure Site Recovery): Defines the replication policy for Azure VMs.

   resource "azurerm_site_recovery_replication_policy" "example" {
     name                    = "default"
     recovery_point_retention_in_days = 7
   }

azurerm_site_recovery_protected_vm (Azure Site Recovery): Protects a VM for replication.

   resource "azurerm_site_recovery_protected_vm" "example" {
     name                    = "my-vm"
     resource_group_name     = "my-rg"
     recovery_vault_name     = "my-vault"
     replication_policy_name = azurerm_site_recovery_replication_policy.example.name
   }

data.aws_region: Used to dynamically determine available regions for replication.

   data "aws_region" "available" {
     name = "US West (Oregon)"
   }

Dependencies are crucial. aws_edr_recovery_plan depends on aws_ec2_replication_configuration and aws_ec2_instance. Lifecycle rules should be used to prevent accidental destruction of replication configurations.

Common Patterns & Modules

Using for_each is essential for replicating infrastructure across multiple instances. Dynamic blocks allow for flexible configuration based on instance attributes. Remote backends (e.g., Terraform Cloud, S3) are non-negotiable for state locking and versioning.

A layered module structure is recommended:

Core Infrastructure: Modules for VPC, subnets, security groups.
Application Modules: Modules for web servers, databases, etc.
DRS Module: A module that orchestrates replication and failover using the core and application modules.

While a single, comprehensive public module doesn’t exist, several community modules provide building blocks for specific components (e.g., replication configuration). Consider building your own DRS module tailored to your specific cloud provider and application requirements.

Hands-On Tutorial

This example demonstrates a basic AWS EDR setup.

Provider Setup:

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = "us-east-1"
}

Resource Configuration:

resource "aws_ec2_instance" "web" {
  ami           = "ami-0c55b999999999999" # Replace with a valid AMI

  instance_type = "t3.micro"
  tags = {
    Name = "Web Server"
  }
}

resource "aws_ec2_replication_configuration" "web_replication" {
  source_region = "us-east-1"
  destination_region = "us-west-2"
  instance_id = aws_ec2_instance.web.id
}

Apply & Destroy:

terraform init
terraform plan
terraform apply
terraform destroy

terraform plan will show the resources to be created. terraform apply will create the EC2 instance and the replication configuration. terraform destroy will remove them. This is a simplified example; a full DRS implementation requires a recovery plan and DNS failover configuration. This example would be integrated into a CI/CD pipeline, triggered by infrastructure changes.

Enterprise Considerations

Large organizations leverage Terraform Cloud/Enterprise for state management, remote runs, and collaboration. Sentinel or Open Policy Agent (OPA) enforce policy-as-code, ensuring compliance with security and governance standards. IAM roles are meticulously designed to follow the principle of least privilege. State locking is enforced through remote backends.

Costs are a significant factor. Replication incurs storage and network transfer costs. Multi-region deployments increase overall infrastructure spend. Scaling DRS requires careful capacity planning and automation.

Security and Compliance

Least privilege is paramount. Use aws_iam_policy (AWS) or azurerm_role_assignment (Azure) to grant only the necessary permissions to Terraform service accounts. Tagging policies enforce consistent metadata for cost allocation and security auditing. Drift detection identifies unauthorized changes to infrastructure.

resource "aws_iam_policy" "edr_policy" {
  name        = "EDR-Terraform-Policy"
  description = "Policy for Terraform to manage EDR resources"
  policy      = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = [
          "edr:*"
        ]
        Effect   = "Allow"
        Resource = "*"
      },
    ]
  })
}

Integration with Other Services

DRS doesn’t operate in isolation.

Monitoring (Datadog, Prometheus): Integrate with monitoring tools to track replication status and health checks.
Alerting (PagerDuty, Opsgenie): Configure alerts for replication failures or performance degradation.
Load Balancing (AWS ALB, Azure Load Balancer): Use load balancers to distribute traffic to the secondary region during failover.
Databases (RDS, Azure SQL Database): Replicate databases to the secondary region using native replication mechanisms.
DNS (Route 53, Azure DNS): Automate DNS failover to point to the secondary region’s resources.

graph LR
    A[Terraform DRS Configuration] --> B(Monitoring - Datadog);
    A --> C(Alerting - PagerDuty);
    A --> D(Load Balancing - ALB);
    A --> E(Database Replication - RDS);
    A --> F(DNS Failover - Route 53);

Module Design Best Practices

Abstract DRS functionality into reusable modules. Use input variables for customization (e.g., regions, instance types, replication policies). Define clear output variables for accessing replicated resources. Use locals for internal calculations and configuration. Thorough documentation is essential.

CI/CD Automation

# .github/workflows/dr-deploy.yml

name: DRS Deployment

on:
  push:
    branches:
      - main

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: hashicorp/setup-terraform@v2
      - run: terraform fmt
      - run: terraform validate
      - run: terraform plan -out=tfplan
      - run: terraform apply tfplan

Pitfalls & Troubleshooting

State Conflicts: Multiple Terraform runs attempting to modify the same state can lead to conflicts. Use state locking.
Replication Lag: Replication may not be instantaneous. Monitor replication lag and adjust replication policies accordingly.
Network Connectivity: Ensure network connectivity between the primary and secondary regions.
IAM Permissions: Incorrect IAM permissions can prevent replication or failover.
DNS Propagation: DNS propagation delays can impact failover time. Use a low TTL for DNS records.
Incorrect AMI IDs: Using invalid AMI IDs will cause instance creation to fail.

Pros and Cons

Pros:

Automated failover and failback.
Reduced downtime and data loss.
Improved resilience and business continuity.
Repeatable and consistent infrastructure.

Cons:

Increased complexity.
Higher costs.
Requires careful planning and configuration.
Provider-specific implementation.

Conclusion

Elastic Disaster Recovery, orchestrated through Terraform, is a critical component of a modern, resilient infrastructure. It’s not a simple undertaking, but the benefits – reduced downtime, improved data protection, and enhanced business continuity – far outweigh the challenges. Start with a proof-of-concept, evaluate existing modules, set up a CI/CD pipeline, and embrace policy-as-code to ensure a secure and reliable DRS implementation. The investment in automation and infrastructure resilience will pay dividends when the inevitable failure occurs.

DEV Community