Content-aware encoding for the rest of us

Netflix encodes it a hundred times…

Per-title encoding is well understood by most of us, at least conceptually. Complex content needs more bits; simple content doesn’t. Encoding everything to the same ladder wastes bandwidth on talking heads and shortchanges sports. The theory is solid.

The expensive part is no longer the encoding. It’s the analysis. Finding the optimal ladder for a title means encoding it many times at different resolutions and bitrates, measuring quality at each point, and building a curve. Netflix does this. At their scale, the storage and bandwidth savings pay for that compute many times over.

For us plebs, the math is less obvious. The question isn’t whether content-aware encoding is worth doing. It’s how much analysis you can spare before it burns a hole in your budget. Especially for content that may be very unpopular.

YouTube took a cheaper path: one throwaway encode at 240p to measure complexity, one model to make the ladder decision. One pass instead of hundreds.

Can we do cheaper?

Yes, with trade-offs.

Per-category instead of per-title

You can afford to miss perfection. You can’t afford to use the obviously wrong ladder.

Let’s invent 3 categories to cover most libraries:

  1. Simple – talking heads, slides
  2. Standard – mixed content
  3. Complex – sports, concerts, high motion

Each maps to a preset ladder with different VBV caps. Classification comes from a short probe – a few clips sampled across the file, encoded at 240p, x264 veryfast. The output bitrate tells you which bucket. CRF handles the variation from there.

CRF

CRF targets quality, not bitrate. The encoder spends more bits on complex frames and less on simple ones – the crowd scene and the post-match interview both get what they need without you predicting the right bitrate for either.

We settled on CRF 21/23/25 across the three buckets, give or take a point depending on source. [libx264, yes. AV1 saves more but the analysis cost runs the wrong way for content nobody watches]

The catch: uncapped CRF isn’t streaming-safe. Pair it with VBV caps (maxrate + bufsize) to put a ceiling on spending. When content hits that ceiling, quality drops below your target. Predictable delivery in exchange for occasional quality concessions on the hardest content.

Hit and miss

The probe samples a few clips. It might miss the hard parts. A documentary that’s mostly archive footage with 20 minutes of field action could probe as simple and get encoded against the wrong caps.

That’s the gamble. Most titles will probe correctly. Some won’t. You’re accepting that error in exchange for a probe that costs almost nothing.

See for yourself

👉here

Everything above, in one script. Pulled from a pipeline that’s been quietly doing its job for two years without anyone noticing. Which is the point.

But wait, there’s more!

Post-production, it runs a quick VMAF pass against the source to tell if the guess was right. Later, one can decide to re-encode with the more appropriate setting or just leave it as is, presumably based on compute availability and popularity of the title. How’s that for a treat?

The number you have to lie about

This is the one that bites quietly. A variable-bitrate rung still advertises one fixed bitrate in the manifest, but actual delivery swings under the VBV cap. You have to pick that one number, and the cap and the average fail in opposite directions.

Advertise the cap and you name a bitrate the stream rarely hits. A viewer who could comfortably carry the real average reads as too weak for the rung and gets parked one below it. Quality you could have shipped, left on the table.

Advertise the average and the rung looks cheaper than its worst case. The viewer takes it, a hard scene climbs to the cap, the segment balloons, and the buffer drains into a stall.

Worth instrumenting if your audience is bandwidth-constrained: watch rebuffer ratio, not latency averages.

Does it scale?

Yes, but watch where the cost actually lives. The probe is fixed-cost per title and tiny, so catalog size is irrelevant: ten titles or ten million, same per-unit cost. What doesn’t scale flat is the optional re-encode-on-miss loop, because misses cluster in exactly the complex content that’s most expensive to re-encode. The cheap part scales with your library. The expensive part scales with your error rate on your hardest titles.

Does this work for live?

No. The probe needs the file to exist first, and live has no file. For live you classify up front by channel or event metadata: a sports channel gets the complex ladder, a webinar channel gets simple. No probe involved. The probe-then-encode mechanism is VOD only.

Why only three categories?

CRF already absorbs frame-level variation inside each bucket, so the bucket’s only job is to set the VBV ceiling, and a ceiling is coarse by nature. More buckets means more preset ladders to tune and maintain for diminishing return. Three is roughly where the curve flattens.

Isn’t this just per-title with extra steps?

The savings don’t come from nailing the optimal ladder per title. They come from the long tail of simple titles that a single fixed ladder over-provisions. You’re not chasing Netflix’s quality curve, you’re refusing to ship sports bitrates to slideshows.

Leave a Reply

Your email address will not be published. Required fields are marked *