Designing for traffic spikes in video streaming

01

Intro.

If you run a video streaming platform long enough, you learn to recognise the heavy days. A popular drama's finale just aired — and in the minutes that follow, viewers flood in to replay what they missed. Traffic multiplies several-fold within minutes.

I've been handling spikes like this for about eight years, so here's how I think about CloudFront and everything around it.

02

Know your spike shape.

Not all spikes look the same. I deal mostly with two patterns.

Predictable spikes

You can read the peak from the broadcast schedule. Think drama finales or the start of a live sports broadcast. Traffic ramps up within a few minutes of the program ending, and the peak holds for 15 to 30 minutes. Pre-scaling handles these.

Sudden spikes

Breaking news, a show that suddenly trends on social media — no advance signal. Pre-scaling can't be set up in time, so it comes down to auto-scaling response speed. This is the one that makes me nervous.

Which pattern you're facing changes the play. The default strategy: pre-scale anything that the schedule can predict; let CloudFront caching plus auto-scaling absorb the rest.

03

Keep requests away from the origin.

The first thing to do under spike pressure is stop requests from reaching the origin (ECS, ALB). How much CloudFront can absorb determines almost everything else.

TTLs per content type

Different content types tolerate different cache lifetimes. Don't set one blanket TTL — design per type. I own the web/API side of the platform, so my TTL layout looks like this:

# TTL design example

Images (thumbnails, etc.)
  → TTL: 3600s (1 hour)

API responses (program info, recommendations)
  → TTL: 60 — 300s. Updates are infrequent but freshness matters.

Static assets (CSS / JS / fonts)
  → TTL: 31536000s (1 year). Hash the filename and ship new URLs on change.

Authenticated endpoints
  → No cache (Cache-Control: no-store)

Short TTLs for things that change often; long TTLs for anything versioned by filename. The trick is balancing cache efficiency and freshness.

Origin Shield as a mid-tier cache

CloudFront edge locations are distributed globally. Before a piece of content is cached at each edge, the origin can see concentrated load. Origin Shield consolidates origin requests through a single region, which cuts origin load dramatically.

For a Japan-focused service, I use ap-northeast-1 (Tokyo) as the shield region. During spikes, origin request volume drops to a fraction of what it was.

Cache key design

If query strings or cookies vary widely, the same content ends up behind many different cache keys and you get cache misses. Exclude unnecessary query params from the cache key via a cache policy.

# Query params to strip from the cache key

utm_source, utm_medium, utm_campaign  # analytics tagging
_ga, fbclid                            # analytics
timestamp, t                           # cache busters (unless intentional)

04

Pre-scaling: prepare for known spikes.

Relying on auto-scaling for predictable spikes means scale-out often completes after the peak has hit. ECS Fargate and EC2 scale-out takes a few minutes. Worst case, error rates creep up mid-scale.

For that reason, I schedule pre-scaling actions driven by the program guide.

Scheduled scaling for ECS services

Application Auto Scaling scheduled actions let you change task counts at specific times.

# Example: scale out 30 minutes before a popular show ends
aws application-autoscaling put-scheduled-action \
  --service-namespace ecs \
  --resource-id service/cluster-name/service-name \
  --scheduled-action-name "pre-scale-drama-finale" \
  --schedule "cron(30 20 15 4 ? 2025)" \
  --scalable-target-action MinCapacity=20,MaxCapacity=50

# Scale back in overnight
aws application-autoscaling put-scheduled-action \
  --service-namespace ecs \
  --resource-id service/cluster-name/service-name \
  --scheduled-action-name "scale-in-after-drama" \
  --schedule "cron(0 2 16 4 ? 2025)" \
  --scalable-target-action MinCapacity=4,MaxCapacity=20

Finish scaling out at least 30 minutes before the program ends. The extra cost is temporary; the cost of errors during a spike is far higher.

Warming up ElastiCache

Freshly scaled-out tasks have cold ElastiCache, which forces fallback lookups to the database. Build a warm-up step into the pre-scaling flow — deliberately load key cache entries after scale-out — and you'll keep DB load in check when the spike arrives.

05

Auto-scaling: catching the unpredictable.

For spikes that can't be pre-scaled, tuning auto-scaling is the answer. Two things matter: how fast you scale out, and how calm you stay when scaling in.

Aggressive out, cautious in

# ECS service auto-scaling design

Scale-out:
  metric       : CPUUtilization or ALB RequestCountPerTarget
  threshold    : CPU > 60% for 1 minute
  action       : +30% (minimum +2 tasks)
  cooldown     : 60s (short)

Scale-in:
  threshold    : CPU < 30% for 10 minutes
  action       : -1 task at a time
  cooldown     : 600s (long)

Scale in too eagerly and the residual spike bumps you back over the threshold. You end up in a scale-out → scale-in → scale-out thrash. Longer scale-in cooldowns prevent it.

Watch ALB TargetResponseTime

CPU alone can trail a spike. Triggering scale-out as soon as ALB TargetResponseTime starts climbing lets you act before users see it.

06

WAF to protect the origin.

Spikes bring more than legitimate users — bad traffic mixes in. CloudFront-attached WAF rules stop the junk at the edge.

Rate-based rules — cap short-window request volume per IP. Blocks scrapers and misbehaving retry loops.
AWS Managed Rules — OWASP Top 10 coverage and bot control, with far less maintenance than rolling your own.
Geo restrictions — if the service is Japan-only, block other regions to reduce wasted origin requests.

07

Monitoring: find trouble early.

No matter how thorough the design is, something unexpected always shows up. You need observability that catches it fast.

What to watch

# CloudWatch metrics to keep eyes on

CloudFront:
  - CacheHitRate         (sudden drops mean origin pressure is rising)
  - 5xxErrorRate         (origin errors)
  - TotalErrorRate       (all errors)
  - Requests             (spike detection)

ALB:
  - TargetResponseTime   (early signal of degradation)
  - HTTPCode_ELB_5XX_Count
  - RequestCount

ECS:
  - CPUUtilization
  - MemoryUtilization
  - RunningTaskCount     (verify scaling is working)

RDS / Aurora:
  - DatabaseConnections  (connection exhaustion)
  - CPUUtilization
  - FreeableMemory

PagerDuty routing

CloudWatch alarms fan out through EventBridge into PagerDuty. Routes depend on severity — minor things go to Slack only; serious ones page the on-call engineer directly.

During spikes, cascading alarms are common. Flatten the noise by mapping alarm dependencies: if the origin's CPU alarm is already firing, the downstream response-time alarm is expected — only the root-cause alarm should page someone.

The worst failure mode is alarm fatigue — so many alarms firing that the real one goes unnoticed. Review alarm design regularly.

08

Summary: priorities for spike-resilient infra.

After eight years of this, I think the simplest ordering is:

Absorb at CloudFront first — raising cache hit rate has the best return.
Prepare for what you can predict — plan scale-out far enough ahead that it finishes before the peak.
Let auto-scaling chase what you can't — scale out fast, scale in slow.
Use observability to catch what still gets through — shorten the time between "something's wrong" and "we're fixing it."

There's no single solution. It's the combination that works — drop one of them and a surprise spike will find the gap.