Back to all posts
ArchitectureHACloudKubernetes
Designing High Availability Systems
Architecture patterns and strategies for building resilient, always-on infrastructure
August 10, 202410 min read
Understanding High Availability
High availability (HA) is the ability of a system to remain operational and accessible even when individual components fail. The goal is to minimize downtime and ensure continuous service delivery.
Key metrics to understand:
- Availability Percentage: "Five nines" (99.999%) allows only 5.26 minutes of downtime per year
- RTO (Recovery Time Objective): Maximum acceptable time to restore service
- RPO (Recovery Point Objective): Maximum acceptable data loss measured in time
- MTBF (Mean Time Between Failures): Average time between system failures
- MTTR (Mean Time To Recovery): Average time to restore service after a failure
HA is not just about redundancy — it's about designing systems that gracefully handle failures at every layer: network, compute, storage, and application.
Load Balancing Strategies
Load balancing is the first line of defense in an HA architecture. It distributes traffic across multiple instances, prevents overloading individual servers, and enables seamless failover.
Layer 4 vs Layer 7 load balancing:
- L4 (Transport): Routes based on IP and port — faster but less intelligent
- L7 (Application): Routes based on HTTP headers, URLs, cookies — more flexible, enables path-based routing and canary deployments
Multi-tier load balancing for global applications:
1. DNS-based load balancing (Route53, Cloud DNS) for geo-routing
2. Global load balancer (CloudFront, Azure Front Door) for edge termination
3. Regional load balancer (ALB, Azure Application Gateway) for service routing
4. Service mesh (Istio, Linkerd) for inter-service traffic management
1# Kubernetes Ingress with health checks and failover
2apiVersion: networking.k8s.io/v1
3kind: Ingress
4metadata:
5 name: api-ingress
6 annotations:
7 nginx.ingress.kubernetes.io/upstream-hash-by: "$remote_addr"
8 nginx.ingress.kubernetes.io/proxy-connect-timeout: "5"
9 nginx.ingress.kubernetes.io/proxy-read-timeout: "60"
10 nginx.ingress.kubernetes.io/proxy-next-upstream: "error timeout http_502 http_503"
11 nginx.ingress.kubernetes.io/proxy-next-upstream-tries: "3"
12 nginx.ingress.kubernetes.io/affinity: "cookie"
13 nginx.ingress.kubernetes.io/session-cookie-max-age: "172800"
14spec:
15 ingressClassName: nginx
16 tls:
17 - hosts:
18 - api.example.com
19 secretName: api-tls-cert
20 rules:
21 - host: api.example.com
22 http:
23 paths:
24 - path: /
25 pathType: Prefix
26 backend:
27 service:
28 name: api-service
29 port:
30 number: 80
31---
32# Pod Disruption Budget for safe maintenance
33apiVersion: policy/v1
34kind: PodDisruptionBudget
35metadata:
36 name: api-pdb
37spec:
38 minAvailable: "75%"
39 selector:
40 matchLabels:
41 app: api-serverFailover Strategies
Failover is the automatic process of switching to a standby system when the primary fails. The effectiveness of a failover strategy directly impacts your RTO.
Active-Active vs Active-Passive:
- Active-Active: All nodes serve traffic simultaneously. Provides the best utilization and lowest failover time, but requires careful data synchronization.
- Active-Passive: Standby nodes wait idle until needed. Simpler to implement but wastes resources and has higher failover times.
Database failover patterns:
- Synchronous replication: Zero data loss but higher latency
- Asynchronous replication: Lower latency but potential for small data loss
- Semi-synchronous: Compromise — at least one replica acknowledges writes
Automated failover components:
1. Health checks: Frequent, lightweight probes that detect failures quickly
2. Decision engine: Logic that determines when to trigger failover (avoid false positives)
3. State promotion: Automated promotion of standby to primary
4. Traffic redirection: DNS or load balancer updates to route traffic to the new primary
5. Notification: Alerting the operations team about the failover event
1# AWS Route53 health check and failover routing
2resource "aws_route53_health_check" "primary" {
3 fqdn = "primary.example.com"
4 port = 443
5 type = "HTTPS"
6 resource_path = "/health"
7 failure_threshold = 3
8 request_interval = 10
9 measure_latency = true
10
11 tags = {
12 Name = "primary-health-check"
13 }
14}
15
16resource "aws_route53_record" "failover_primary" {
17 zone_id = aws_route53_zone.main.zone_id
18 name = "api.example.com"
19 type = "A"
20
21 alias {
22 name = aws_lb.primary.dns_name
23 zone_id = aws_lb.primary.zone_id
24 evaluate_target_health = true
25 }
26
27 failover_routing_policy {
28 type = "PRIMARY"
29 }
30
31 set_identifier = "primary"
32 health_check_id = aws_route53_health_check.primary.id
33}
34
35resource "aws_route53_record" "failover_secondary" {
36 zone_id = aws_route53_zone.main.zone_id
37 name = "api.example.com"
38 type = "A"
39
40 alias {
41 name = aws_lb.secondary.dns_name
42 zone_id = aws_lb.secondary.zone_id
43 evaluate_target_health = true
44 }
45
46 failover_routing_policy {
47 type = "SECONDARY"
48 }
49
50 set_identifier = "secondary"
51}Multi-Region Deployment
Multi-region deployment is the gold standard for high availability. By distributing your infrastructure across geographically separated regions, you protect against regional outages and reduce latency for global users.
Architecture components:
- Global DNS with latency-based or geo-proximity routing
- Regional Kubernetes clusters with identical configurations
- Cross-region database replication with automatic promotion
- Distributed cache (Redis Cluster or ElastiCache Global Datastore)
- Object storage replication (S3 Cross-Region Replication)
- Centralized monitoring and alerting across all regions
Key challenges:
- Data consistency across regions (CAP theorem trade-offs)
- Handling split-brain scenarios during network partitions
- Managing deployments consistently across regions (GitOps helps)
- Cost optimization — running full redundancy in multiple regions is expensive
- Testing failover regularly without impacting production
The key is to automate everything. Manual failover procedures are too slow and error-prone for production systems. Use Infrastructure as Code, GitOps, and automated runbooks to ensure consistent, rapid response to any failure.
1# Kubernetes Federation - Multi-region deployment
2# Using Argo CD ApplicationSet for multi-cluster
3apiVersion: argoproj.io/v1alpha1
4kind: ApplicationSet
5metadata:
6 name: multi-region-api
7 namespace: argocd
8spec:
9 generators:
10 - clusters:
11 selector:
12 matchLabels:
13 tier: production
14 values:
15 region: "{{metadata.labels.region}}"
16 template:
17 metadata:
18 name: "api-{{values.region}}"
19 spec:
20 project: production
21 source:
22 repoURL: https://github.com/org/infra.git
23 path: "apps/api/overlays/{{values.region}}"
24 targetRevision: main
25 destination:
26 server: "{{server}}"
27 namespace: production
28 syncPolicy:
29 automated:
30 prune: true
31 selfHeal: true
32 syncOptions:
33 - CreateNamespace=true
34---
35# External DNS for automatic DNS record management
36apiVersion: externaldns.k8s.io/v1alpha1
37kind: DNSEndpoint
38metadata:
39 name: api-global
40 annotations:
41 external-dns.alpha.kubernetes.io/ttl: "60"
42spec:
43 endpoints:
44 - dnsName: api.example.com
45 recordType: A
46 targets:
47 - 203.0.113.10 # Region 1
48 - 198.51.100.20 # Region 2
49 - 192.0.2.30 # Region 3
50 setIdentifier: multi-region
51 providerSpecific:
52 - name: aws/failover
53 value: PRIMARY