ArchitectureHACloudKubernetes

Designing High Availability Systems

Architecture patterns and strategies for building resilient, always-on infrastructure

August 10, 202410 min read

Understanding High Availability

High availability (HA) is the ability of a system to remain operational and accessible even when individual components fail. The goal is to minimize downtime and ensure continuous service delivery. Key metrics to understand: - Availability Percentage: "Five nines" (99.999%) allows only 5.26 minutes of downtime per year - RTO (Recovery Time Objective): Maximum acceptable time to restore service - RPO (Recovery Point Objective): Maximum acceptable data loss measured in time - MTBF (Mean Time Between Failures): Average time between system failures - MTTR (Mean Time To Recovery): Average time to restore service after a failure HA is not just about redundancy — it's about designing systems that gracefully handle failures at every layer: network, compute, storage, and application.

Load Balancing Strategies

Load balancing is the first line of defense in an HA architecture. It distributes traffic across multiple instances, prevents overloading individual servers, and enables seamless failover. Layer 4 vs Layer 7 load balancing: - L4 (Transport): Routes based on IP and port — faster but less intelligent - L7 (Application): Routes based on HTTP headers, URLs, cookies — more flexible, enables path-based routing and canary deployments Multi-tier load balancing for global applications: 1. DNS-based load balancing (Route53, Cloud DNS) for geo-routing 2. Global load balancer (CloudFront, Azure Front Door) for edge termination 3. Regional load balancer (ALB, Azure Application Gateway) for service routing 4. Service mesh (Istio, Linkerd) for inter-service traffic management

1# Kubernetes Ingress with health checks and failover
2apiVersion: networking.k8s.io/v1
3kind: Ingress
4metadata:
5  name: api-ingress
6  annotations:
7    nginx.ingress.kubernetes.io/upstream-hash-by: "$remote_addr"
8    nginx.ingress.kubernetes.io/proxy-connect-timeout: "5"
9    nginx.ingress.kubernetes.io/proxy-read-timeout: "60"
10    nginx.ingress.kubernetes.io/proxy-next-upstream: "error timeout http_502 http_503"
11    nginx.ingress.kubernetes.io/proxy-next-upstream-tries: "3"
12    nginx.ingress.kubernetes.io/affinity: "cookie"
13    nginx.ingress.kubernetes.io/session-cookie-max-age: "172800"
14spec:
15  ingressClassName: nginx
16  tls:
17    - hosts:
18        - api.example.com
19      secretName: api-tls-cert
20  rules:
21    - host: api.example.com
22      http:
23        paths:
24          - path: /
25            pathType: Prefix
26            backend:
27              service:
28                name: api-service
29                port:
30                  number: 80
31---
32# Pod Disruption Budget for safe maintenance
33apiVersion: policy/v1
34kind: PodDisruptionBudget
35metadata:
36  name: api-pdb
37spec:
38  minAvailable: "75%"
39  selector:
40    matchLabels:
41      app: api-server

Failover Strategies

Failover is the automatic process of switching to a standby system when the primary fails. The effectiveness of a failover strategy directly impacts your RTO. Active-Active vs Active-Passive: - Active-Active: All nodes serve traffic simultaneously. Provides the best utilization and lowest failover time, but requires careful data synchronization. - Active-Passive: Standby nodes wait idle until needed. Simpler to implement but wastes resources and has higher failover times. Database failover patterns: - Synchronous replication: Zero data loss but higher latency - Asynchronous replication: Lower latency but potential for small data loss - Semi-synchronous: Compromise — at least one replica acknowledges writes Automated failover components: 1. Health checks: Frequent, lightweight probes that detect failures quickly 2. Decision engine: Logic that determines when to trigger failover (avoid false positives) 3. State promotion: Automated promotion of standby to primary 4. Traffic redirection: DNS or load balancer updates to route traffic to the new primary 5. Notification: Alerting the operations team about the failover event

1# AWS Route53 health check and failover routing
2resource "aws_route53_health_check" "primary" {
3  fqdn              = "primary.example.com"
4  port               = 443
5  type               = "HTTPS"
6  resource_path      = "/health"
7  failure_threshold  = 3
8  request_interval   = 10
9  measure_latency    = true
10
11  tags = {
12    Name = "primary-health-check"
13  }
14}
15
16resource "aws_route53_record" "failover_primary" {
17  zone_id = aws_route53_zone.main.zone_id
18  name    = "api.example.com"
19  type    = "A"
20
21  alias {
22    name                   = aws_lb.primary.dns_name
23    zone_id                = aws_lb.primary.zone_id
24    evaluate_target_health = true
25  }
26
27  failover_routing_policy {
28    type = "PRIMARY"
29  }
30
31  set_identifier  = "primary"
32  health_check_id = aws_route53_health_check.primary.id
33}
34
35resource "aws_route53_record" "failover_secondary" {
36  zone_id = aws_route53_zone.main.zone_id
37  name    = "api.example.com"
38  type    = "A"
39
40  alias {
41    name                   = aws_lb.secondary.dns_name
42    zone_id                = aws_lb.secondary.zone_id
43    evaluate_target_health = true
44  }
45
46  failover_routing_policy {
47    type = "SECONDARY"
48  }
49
50  set_identifier = "secondary"
51}

Multi-Region Deployment

Multi-region deployment is the gold standard for high availability. By distributing your infrastructure across geographically separated regions, you protect against regional outages and reduce latency for global users. Architecture components: - Global DNS with latency-based or geo-proximity routing - Regional Kubernetes clusters with identical configurations - Cross-region database replication with automatic promotion - Distributed cache (Redis Cluster or ElastiCache Global Datastore) - Object storage replication (S3 Cross-Region Replication) - Centralized monitoring and alerting across all regions Key challenges: - Data consistency across regions (CAP theorem trade-offs) - Handling split-brain scenarios during network partitions - Managing deployments consistently across regions (GitOps helps) - Cost optimization — running full redundancy in multiple regions is expensive - Testing failover regularly without impacting production The key is to automate everything. Manual failover procedures are too slow and error-prone for production systems. Use Infrastructure as Code, GitOps, and automated runbooks to ensure consistent, rapid response to any failure.

1# Kubernetes Federation - Multi-region deployment
2# Using Argo CD ApplicationSet for multi-cluster
3apiVersion: argoproj.io/v1alpha1
4kind: ApplicationSet
5metadata:
6  name: multi-region-api
7  namespace: argocd
8spec:
9  generators:
10    - clusters:
11        selector:
12          matchLabels:
13            tier: production
14        values:
15          region: "{{metadata.labels.region}}"
16  template:
17    metadata:
18      name: "api-{{values.region}}"
19    spec:
20      project: production
21      source:
22        repoURL: https://github.com/org/infra.git
23        path: "apps/api/overlays/{{values.region}}"
24        targetRevision: main
25      destination:
26        server: "{{server}}"
27        namespace: production
28      syncPolicy:
29        automated:
30          prune: true
31          selfHeal: true
32        syncOptions:
33          - CreateNamespace=true
34---
35# External DNS for automatic DNS record management
36apiVersion: externaldns.k8s.io/v1alpha1
37kind: DNSEndpoint
38metadata:
39  name: api-global
40  annotations:
41    external-dns.alpha.kubernetes.io/ttl: "60"
42spec:
43  endpoints:
44    - dnsName: api.example.com
45      recordType: A
46      targets:
47        - 203.0.113.10  # Region 1
48        - 198.51.100.20 # Region 2
49        - 192.0.2.30    # Region 3
50      setIdentifier: multi-region
51      providerSpecific:
52        - name: aws/failover
53          value: PRIMARY