July 2023 – Streaming For The Poor

“It worked every time we checked…”

You have a backup stream. You haven’t checked it in six months, the credentials are probably expired, and the person who set it up left the company. The one time live went down it wouldn’t have made a difference as the failure was elsewhere.

What took you down usually wasn’t the thing you prepared for. Most backup streams are designed for one scenario: the encoder dies, the switch flips, life goes on. That’s the clean failure. It’s also the rare one. What actually happens in production is messier, slower, and harder to call. The stream is alive but broken. The dashboard is green. The audience is already gone.

The mentality

…we probably don’t need it

The primary never failed. Not last week, not last month, not at the big event in March. The encoder is solid. The CDN is enterprise-grade. The team knows what they’re doing.

Hard to argue with. It’s not laziness. It’s a reasonable reading of a track record that hasn’t included a failure yet.

The economics

…it’s not worth the extra money/effort

The backup stream has a cost. Second encoder, second ingest, second CDN, second everything if you do it properly. That’s a real line item on a budget that already has competing priorities.

The outage has a cost too. But it’s theoretical. It might never happen. And if it does, maybe it’s a short one, maybe viewers come back, maybe the client doesn’t notice. 😈

This is the math that kills backup streams. Not negligence. A rational calculation made by people who have never experienced the failure they’re deciding not to protect against.

Single points of failure

…the backup failed too

The backup encoder is on the same switch as the primary? The same venue WiFi? Pushing to the same ingest endpoint, just different stream key? When the network goes down, both go down. When the ingest malfunctions, both fail. The backup existed. It was wired to the same failure.

A backup that fails the same way as the primary isn’t a backup. But…

True separation at every layer (encoder, network, ingest, origin, CDN) is complex to build and expensive to run. Most operations can’t justify it and don’t need it.
The gaps are rarely invisible. Most engineers can tell you where the stack is fragile. The backup just never got prioritized over everything else that needed building.

Operational readiness

…somebody knows how to switch, I think

The backup stream is configured. The runbook exists somewhere. The vendor console has a button for it. But the one person who knows how it works is not on call tonight. The doc is three product iterations old. Nobody has actually switched to the backup since the initial test months ago. The failure doesn’t wait for the right person to be available.

Automated failover is worse in a sneaky way. You stop thinking about it entirely. The script will handle it. Except the detection logic was written for a clean failure, not a half-dead stream that’s technically alive. The threshold never triggers. The switch never happens. And the humans who could override it manually are asleep, unreachable, or no longer with the company.

What does backup even mean

…we have redundancy

Backup encoder? Backup ingest? Backup origin? Backup CDN? Backup player URL? Each one protects against a different failure. Most teams pick one, check the box, and call it redundancy.

A backup encoder does nothing when the CDN degrades. A backup CDN does nothing when the origin serves a stale manifest. A backup ingest does nothing when the venue network goes down. The stack has layers and each layer can fail on its own.

Knowing which layer to back up requires knowing which layer is most likely to fail for your specific setup. That answer is different for a stadium broadcast, a remote satellite uplink, and a cloud-only pipeline. There is no universal answer. There is just the layer you skipped.

Select your redundancy plan

I’ve been wanting to lay this out as a diagram for a while. Clients always assume backup is a yes/no question. It’s not. Here are your options.

0. No real backup

Cheap until it is not. One failure anywhere takes the stream down. No recovery path, no fallback. A deliberate risk acceptance, not a resilience strategy.

1. Restart-and-recover

Someone notices, someone restarts. No parallel path, just a documented procedure and an operator who knows what to do. About five minutes of downtime if everything goes right. Longer if the right person is not reachable.

2. Fire-exit backup

A separate, deliberately degraded path. Different encoder, different network, different ingest endpoint. Not a mirror – lower quality, simpler setup – but genuinely independent. When the primary goes down, viewers stay on air within about a minute. This is the minimum viable backup for anything with a real audience.

3. Mirrored backup

A full-quality duplicate of the primary path. Separate encoder, separate ingest, separate origin. The CDN is typically still shared, which remains a single point. Recovery is faster and the experience is seamless, but the cost is substantially higher.

4. Full active-active

Both paths running simultaneously. Automated failover routes to the best available path. Viewers never see a switch because there is no switch – continuity is built into the architecture. The right choice when downtime is not an option and the budget reflects that.

Let’s gamify it 🙂

Of all, economics hurts the most, we know it.

This tool gives you a hint of what option you should deploy depending on what you’re doing and how much it costs to fail. Source is here.

Month: July 2023

The backup stream nobody cared about