Hardware/software transcoding – a hybrid low-cost approach

So this would be a significant savings, 5X.

ai-generated image [duh]

It’s commonly understood that ‘software encoding’ means encoding video on a general-purpose device and CPU. It is resource intensive but very flexible and easy to deploy.

Conversely, ‘hardware encoding’ requires specialized equipment. Rather than running complex algorithms, it makes use of dedicated chips to either do the whole encoding or offload the most critical jobs like motion estimation and entropy coding.

Without trying to be exhaustive, we’ll want to touch on the differences and trade-offs of one vs the other, promise it won’t take long:

  • Performance: given the same encoding settings, a hardware encoder will almost always be faster and more power efficient; that’s because components are designed specifically for such tasks, making the process highly economical
  • Flexibility: a hardware encoder can usually only output a specific subset of codecs and formats; logic of these has been ‘burned’ into the hardware and can’t be changed by rewriting software
  • Cost: the respective hardware needs purchased or rented upfront so it’ll be more expensive to hardware transcode at first; expect to break even and eventually have it pay for itself if you use it long enough
  • Form and size: devices vary much but expect hardware encoders to be smaller, slimmer, and more portable than their software counterparts
  • Reliability: every one device is different, yet hardware encoders are designed to run that specific job without interruption so they should crash less often

But wait, this piece is about transcoding, not encoding. I.e. the kind that you employ for adaptive bitrate, on the server/cloud, for a bunch of streams at the same time. And while the variety of consumer encoder devices way exceeds that of rack-mounted transcoders, there are still a few to choose from.

GPUs

…started to offer video encoding capabilities maybe 15 years ago. Lack of flexibility aside, the ability to offload the encode/transcode workload of a computer from CPU to GPU is very much welcome. That is, when it’s your computer and you’re broadcasting one stream towards Twitch or YouTube Live.

In a server environment the magic quickly fades. The coolest (i.e. most expensive) GPUs can encode 10+ streams at the same time; but that’s not their best asset, it’s just something they can also do. And since they’re also good for many other tasks (including crypto mining), their price matches the hardware complexity and demand.

Best implements I’ve seen will get you a 3-4 fold in transcoding capacity as compared to CPU, when stacking multiple GPUs in the same computer, yet the cost effectiveness of such implements is debatable.

FPGAs

…are devices that perform specialized tasks deeper in the hardware; in the case of video encoding, they can take performance to the next level both in terms of speed and number of streams they can process in parallel. As you can expect, they’re more expensive than GPUs but will eventually be worth it if you need to transcode a lot. And it’s these that Twitch itself transcodes on nowadays, along with Theo and many others.

Sure thing, it’s a whole different scenario if you 

  • A. need to transcode a fixed number of streams 24/7
  • B. Your stream count fluctuates dramatically from day to night or during weekends

In the case of B. your mid-term investment in GPUs or FPGAs for transcoding could be wasted if you’re unable to fully utilize them. What if you could rent them by the minute instead? 🙂

Cloud Offerings

Yep, many of the clouds offer GPU equipped virtual (and real) servers, and lately AWS has instances equipped with video transcoding specialized FPGAs. And it’s these that we’ll try to squeeze the best ABR transcoding feat for the buck by maxing them out, hold tight.

What’s in the box?

The virtual server comes in 3 sizes, and bundled with the drivers and software to start experimenting right away. If you’re at least a bit handy with ffmpeg or gStreamer, it’ll feel like a charm. Take this one for  a spin:

ffmpeg -c:v mpsoc_vcu_h264 -i input.mp4 \
  -filter_complex " \
    multiscale_xma=outputs=4: \
    out_1_width=1280: out_1_height=720: out_1_rate=full: \
    out_2_width=848:  out_2_height=480: out_2_rate=half: \
    out_3_width=640:  out_3_height=360: out_3_rate=half: \
    out_4_width=288:  out_4_height=160: out_4_rate=half  \
    [a][b][c][d]; [a]split[aa][ab]; [ab]fps=30[abb]" \
  -map "[aa]"  -b:v 4M    -c:v mpsoc_vcu_h264 -f mp4 -y ./scaled_720p60.mp4 \
  -map "[abb]" -b:v 3M    -c:v mpsoc_vcu_h264 -f mp4 -y ./scaled_720p30.mp4 \
  -map "[b]"   -b:v 2500K -c:v mpsoc_vcu_h264 -f mp4 -y ./scaled_480p30.mp4 \
  -map "[c]"   -b:v 1250K -c:v mpsoc_vcu_h264 -f mp4 -y ./scaled_360p30.mp4 \
  -map "[d]"   -b:v 625K  -c:v mpsoc_vcu_h264 -f mp4 -y ./scaled_288p30.mp4

Cool, huh? In case I forgot to mention, there’s also a hardware decoder and a scaler in there so the above won’t use any CPU. At a glance, it flows like this

As you may see, the raw video (after decode and before re-encode) does not flow through your main computer’s pipeline (but through the FPGA’s) allowing for lower latencies than a common CPU transcode.

Going the extra mile

So we have a computer equipped with a high performance hardware transcoder. That transcoder itself has a limited number of streams and resolutions it can transcode at the same time.

However that FPGA physically lives in a computer with lots of CPU to waste. Can we use that to software transcode even more? 😈 Like this

Turns out that’s possible. Particularly the lower resolutions won’t take as big of a toll on the CPU so we can software transcode these and leave the big ones to the hardware. There’s a slight toll on the latency as raw video needs moved from the FPGA to the CPU pipeline, but let’s assume we can afford that.

The nugget

…is here. It spins up a cloud FPGA enabled virtual server and simply transcodes using the outlined technique. As compared to using just the dedicated hardware, it manages to transcode (up to, depending on use case):

  • 25% more streams
  • 56% more renditions
  • 27% more pixels 

…while paying nothing more. 😎

Does it scale?

The solution works best for a fixed number of live 24/7 well-known input streams. The main point of this research is to optimize cost, and running sensibly less than the supported number of streams on a given machine will compromise this goal. 

Nevertheless, a scaling solution can be imagined by making use of autoscaling, load balancing (for RTMP ingress) and an ‘origin resolver’ similar to this.

Is it stable?

I see no good reason for it to be less stable than the hardware-only approach. Yet I never ran this in production. The thought of it struck me a couple years back while working on a large transcoding project, but it was not applicable there. Since, found no context to apply it, if anyone beats me to the game I’m waiting for feedback.

Does it fit my project?

By now, you probably have a sense of it, but if not, just give it a shot.

A quick heads-up, though: it won’t qualify if you use/require any of the following:

  • maximum resolution exceeding 4096 x 2160
  • very low latency – muxing alone (from 2 separate encoders) adds in excess of 100ms to the lag
  • user-generated content
    • some video profiles fail to decode accelerated
    • resolutions not multiple of 4 (width and height) are not supported
  • advanced encoding settings
  • variable/unexpected video dimensions

Leave a Reply

Your email address will not be published. Required fields are marked *