CloudFront, auto-scaling, observability in practice — by Toru Hosoi
If you run a video streaming platform long enough, you learn to recognise the heavy days. A popular drama's finale just aired — and in the minutes that follow, viewers flood in to replay what they missed. Traffic multiplies several-fold within minutes.
I've been handling spikes like this for about eight years, so here's how I think about CloudFront and everything around it.
Not all spikes look the same. I deal mostly with two patterns.
You can read the peak from the broadcast schedule. Think drama finales or the start of a live sports broadcast. Traffic ramps up within a few minutes of the program ending, and the peak holds for 15 to 30 minutes. Pre-scaling handles these.
Breaking news, a show that suddenly trends on social media — no advance signal. Pre-scaling can't be set up in time, so it comes down to auto-scaling response speed. This is the one that makes me nervous.
The first thing to do under spike pressure is stop requests from reaching the origin (ECS, ALB). How much CloudFront can absorb determines almost everything else.
Different content types tolerate different cache lifetimes. Don't set one blanket TTL — design per type. I own the web/API side of the platform, so my TTL layout looks like this:
# TTL design example
Images (thumbnails, etc.)
→ TTL: 3600s (1 hour)
API responses (program info, recommendations)
→ TTL: 60 — 300s. Updates are infrequent but freshness matters.
Static assets (CSS / JS / fonts)
→ TTL: 31536000s (1 year). Hash the filename and ship new URLs on change.
Authenticated endpoints
→ No cache (Cache-Control: no-store)
CloudFront edge locations are distributed globally. Before a piece of content is cached at each edge, the origin can see concentrated load. Origin Shield consolidates origin requests through a single region, which cuts origin load dramatically.
For a Japan-focused service, I use ap-northeast-1 (Tokyo) as the shield region. During spikes, origin request volume drops to a fraction of what it was.
If query strings or cookies vary widely, the same content ends up behind many different cache keys and you get cache misses. Exclude unnecessary query params from the cache key via a cache policy.
# Query params to strip from the cache key
utm_source, utm_medium, utm_campaign # analytics tagging
_ga, fbclid # analytics
timestamp, t # cache busters (unless intentional)
Relying on auto-scaling for predictable spikes means scale-out often completes after the peak has hit. ECS Fargate and EC2 scale-out takes a few minutes. Worst case, error rates creep up mid-scale.
For that reason, I schedule pre-scaling actions driven by the program guide.
Application Auto Scaling scheduled actions let you change task counts at specific times.
# Example: scale out 30 minutes before a popular show ends
aws application-autoscaling put-scheduled-action \
--service-namespace ecs \
--resource-id service/cluster-name/service-name \
--scheduled-action-name "pre-scale-drama-finale" \
--schedule "cron(30 20 15 4 ? 2025)" \
--scalable-target-action MinCapacity=20,MaxCapacity=50
# Scale back in overnight
aws application-autoscaling put-scheduled-action \
--service-namespace ecs \
--resource-id service/cluster-name/service-name \
--scheduled-action-name "scale-in-after-drama" \
--schedule "cron(0 2 16 4 ? 2025)" \
--scalable-target-action MinCapacity=4,MaxCapacity=20
Freshly scaled-out tasks have cold ElastiCache, which forces fallback lookups to the database. Build a warm-up step into the pre-scaling flow — deliberately load key cache entries after scale-out — and you'll keep DB load in check when the spike arrives.
For spikes that can't be pre-scaled, tuning auto-scaling is the answer. Two things matter: how fast you scale out, and how calm you stay when scaling in.
# ECS service auto-scaling design
Scale-out:
metric : CPUUtilization or ALB RequestCountPerTarget
threshold : CPU > 60% for 1 minute
action : +30% (minimum +2 tasks)
cooldown : 60s (short)
Scale-in:
threshold : CPU < 30% for 10 minutes
action : -1 task at a time
cooldown : 600s (long)
Scale in too eagerly and the residual spike bumps you back over the threshold. You end up in a scale-out → scale-in → scale-out thrash. Longer scale-in cooldowns prevent it.
CPU alone can trail a spike. Triggering scale-out as soon as ALB TargetResponseTime starts climbing lets you act before users see it.
Spikes bring more than legitimate users — bad traffic mixes in. CloudFront-attached WAF rules stop the junk at the edge.
No matter how thorough the design is, something unexpected always shows up. You need observability that catches it fast.
# CloudWatch metrics to keep eyes on
CloudFront:
- CacheHitRate (sudden drops mean origin pressure is rising)
- 5xxErrorRate (origin errors)
- TotalErrorRate (all errors)
- Requests (spike detection)
ALB:
- TargetResponseTime (early signal of degradation)
- HTTPCode_ELB_5XX_Count
- RequestCount
ECS:
- CPUUtilization
- MemoryUtilization
- RunningTaskCount (verify scaling is working)
RDS / Aurora:
- DatabaseConnections (connection exhaustion)
- CPUUtilization
- FreeableMemory
CloudWatch alarms fan out through EventBridge into PagerDuty. Routes depend on severity — minor things go to Slack only; serious ones page the on-call engineer directly.
During spikes, cascading alarms are common. Flatten the noise by mapping alarm dependencies: if the origin's CPU alarm is already firing, the downstream response-time alarm is expected — only the root-cause alarm should page someone.
After eight years of this, I think the simplest ordering is:
There's no single solution. It's the combination that works — drop one of them and a surprise spike will find the gap.