Photo: Unsplash — Free to use
Why Zero Downtime Deployments Matter
Every minute of deployment downtime costs money. For an e-commerce site doing ₹1 lakh/day in revenue, a 5-minute deployment window costs ₹350. For SaaS products, downtime triggers churn and support tickets. 94% of enterprise applications now require 99.9%+ uptime SLAs — maintenance windows are no longer acceptable. This guide implements zero-downtime deployment for real applications.
The Core Concepts
All zero-downtime deployment strategies share a common principle: never have zero instances serving traffic. The implementation details vary, but the goal is always: deploy new version → verify it works → gradually shift traffic → remove old version.
Strategy 1: Rolling Deployment
Rolling deployments replace old instances with new ones incrementally. At any point, some instances run the old version and some run the new version.
How Rolling Deployment Works
- Start with N instances of v1 (e.g., 4 instances)
- Take 1 instance out of load balancer rotation
- Deploy v2 to that instance
- Health check passes → add back to rotation
- Repeat for remaining instances
- All instances now running v2
# AWS ECS rolling deployment configuration
deployment_configuration {
deployment_circuit_breaker {
enable = true
rollback = true # Auto-rollback on failure
}
maximum_percent = 200 # Allow 8 instances during deploy (4 old + 4 new)
minimum_healthy_percent = 100 # Never go below 4 healthy instances
}
Pros: Simple, no duplicate infrastructure cost, works with any load balancer
Cons: Mixed version traffic during rollout, harder to rollback quickly
Strategy 2: Blue-Green Deployment
Blue-green maintains two identical production environments. At any time, one (blue) serves all traffic. The other (green) is idle or receives the new deployment. Traffic switches atomically.
Blue-Green with AWS ALB
# Switch traffic from blue to green via ALB target group
aws elbv2 modify-listener --listener-arn $LISTENER_ARN --default-actions Type=forward,TargetGroupArn=$GREEN_TG_ARN
# Verify health, then decommission blue
# If issues: switch back to blue in seconds
aws elbv2 modify-listener --listener-arn $LISTENER_ARN --default-actions Type=forward,TargetGroupArn=$BLUE_TG_ARN
Pros: Instant rollback (seconds), zero mixed-version traffic, easy to test green before switching
Cons: Double infrastructure cost during deployment, database migration complexity
Strategy 3: Canary Deployment
Canary deployments route a small percentage of traffic (1–10%) to the new version first, monitoring error rates and performance before full rollout.
# Nginx weighted routing for canary
upstream backend {
server api-v1:3000 weight=95; # 95% to v1
server api-v2:3000 weight=5; # 5% to v2 (canary)
}
# After 30 min with no errors, increase canary %
upstream backend {
server api-v1:3000 weight=50;
server api-v2:3000 weight=50;
}
# Full cutover
upstream backend {
server api-v2:3000;
}
Pros: Real production traffic validates new version, gradual risk exposure, statistical validation
Cons: Complex traffic routing, mixed versions in production for extended period, requires good monitoring
Database Migrations: The Hard Part
Zero-downtime deployment is straightforward for stateless services. Database migrations are the challenge. The key rule: database changes must be backwards-compatible with the previous application version.
The Expand-Contract Pattern
Never rename a column or drop a column in a single deployment. Use the 3-phase expand-contract pattern:
- Expand: Add new column alongside old column. Application writes to both columns. Deploy this version.
- Migrate: Backfill the new column with data from the old column. Application reads from new column.
- Contract: Remove old column. Deploy application version that only uses new column.
This process takes 3 deployments but never causes downtime or breaks the running application during migration.
Non-Destructive Migration Rules
- Adding a nullable column: safe to deploy while app runs
- Adding an index: use
CREATE INDEX CONCURRENTLY— doesn't lock the table - Renaming a column: use expand-contract — never rename directly
- Dropping a column: only after application no longer references it in production
- Changing a column type: use expand-contract — never ALTER COLUMN TYPE directly on live table
Health Checks: The Gate to Zero Downtime
Your application must implement health check endpoints for the load balancer to route traffic correctly:
// Express health check endpoint
app.get('/health', async (req, res) => {
try {
// Check database connectivity
await db.query('SELECT 1');
// Check external service connectivity (optional)
// await redis.ping();
res.json({ status: 'healthy', version: process.env.APP_VERSION });
} catch (error) {
res.status(503).json({ status: 'unhealthy', error: error.message });
}
});
The load balancer routes traffic only to instances returning 200. During deployment, new instances must pass health checks before receiving traffic.
Graceful Shutdown: Handle In-Flight Requests
When a rolling deployment removes an instance from rotation, in-flight requests must complete before the process terminates:
// Node.js graceful shutdown
const server = app.listen(3000);
process.on('SIGTERM', () => {
console.log('SIGTERM received, shutting down gracefully');
server.close(() => {
// All connections closed, safe to exit
process.exit(0);
});
// Force shutdown after 30 seconds if graceful fails
setTimeout(() => {
console.error('Forced shutdown after timeout');
process.exit(1);
}, 30000);
});
Monitoring During Deployment
Automated rollback requires monitoring these signals during deployment:
- Error rate: HTTP 5xx responses > 1% → trigger rollback
- Latency P95: 95th percentile response time > 2x baseline → investigate
- Health check failures: Any instance failing health checks → stop rollout
AWS CodeDeploy, Kubernetes deployment rollout, and GitHub Actions all support automatic rollback triggers based on CloudWatch alarms or custom metrics.
Frequently Asked Questions
What is zero downtime deployment?
Zero downtime deployment is a strategy that updates an application without any period where the application is unavailable. Users experience no interruption during the deployment. Common strategies include rolling deployments (gradual instance replacement), blue-green deployments (instant traffic switch), and canary releases (gradual traffic shift).
What is the difference between blue-green and canary deployment?
Blue-green deployment switches 100% of traffic from old to new version atomically — instant cutover with instant rollback. Canary deployment gradually increases traffic to the new version (1% → 10% → 50% → 100%) with monitoring between steps. Canary is safer for catching issues affecting only some users.
How do I handle database migrations with zero downtime?
Use the expand-contract pattern: add new columns/tables alongside old ones, deploy application that uses both, migrate data, then remove old columns in a later deployment. Never rename or drop columns in the same deployment that changes application code. Use CREATE INDEX CONCURRENTLY to avoid table locks.
What is a rolling deployment?
A rolling deployment replaces application instances one-at-a-time (or in batches) rather than all at once. While deploying, some instances run the old version and some run the new version. Load balancers route traffic only to healthy instances, ensuring continuity.
How do I implement zero downtime deployment on AWS?
AWS supports zero downtime via: ECS rolling deployments (with minimum_healthy_percent=100%), CodeDeploy blue-green deployments, Application Load Balancer target group switching, and Kubernetes rolling updates on EKS. Use deployment circuit breakers for automatic rollback on failures.
Implement Zero Downtime Deployments
We set up rolling, blue-green, and canary deployment pipelines with automated rollback. Never worry about deployment downtime again. Get a DevOps consultation.