Self contained AI voice assistant – free and open source

We can’t use OpenAI

Ok, we’re peeking outside the box on this one. But hey, we all have to get into AI or get run by it, right? This post was written by a human.

Long story short, the client was looking for a novel approach to home (read hotel room) automation to impress the investors. It had to make extensive use of AI or it wouldn’t fly.

A couple hours of brainstorming later, we settled on a voice-driven solution that would allow users (read hotel guests) to manage the lights, curtains, and entertainment system, but also use as a friendly assistant for information or casual conversation, similar to ChatGPT.

The Constraints

This could not use a public service like OpenAI or Anthropic for multiple reasons, one of which being the need to ensure the privacy of user queries. The ability to deploy the entire GPT “brain” on-site (read hotel basement) was also a significant selling point. 😊

Furthermore, there needed to be strict control over which GPT model was utilized for specific deployments, along with the capability to introduce biases (political, religious, commercial) into some instances. Agree, that’s wildly controversial but for the scope of this story let’s see it as just another selling point.

Finally, it needed to be open source or at least not associated with the major industry players, allowing them to take pride in having developed their own solution.

Challenges

Hardware and Resource Demands

Running the voice-text-voice stack requires significant computational power, especially the speech recognition and (text) response generation. While GPUs are ideal for handling these demands, they aren’t always available in every environment. This makes optimizing systems for mixed hardware setups critical.

Balancing performance across varied infrastructure, whether with or without GPUs, involves careful tuning to maintain responsiveness and ensure the assistant works predictably even in resource-constrained scenarios.

Open-source Complexities

Going full open-source is like trading a smooth highway for a bumpy back road. These tools often require extra tinkering and development compared to polished commercial options.

Without the reliability and support of paid alternatives, teams must invest additional time in troubleshooting, integration, and ongoing refinement to ensure smooth operation and high performance.

Deployment and Scalability

Implementing this at scale poses challenges beyond computing power. As more users connect, maintaining seamless performance becomes increasingly complex. The system requires careful tuning to ensure efficiency and responsiveness, turning scalability into a juggling act.

In a self-hosted setup, you don’t have the flexibility of cloud-based scaling to adjust costs dynamically. However given the hotel environment, the occupancy is well known and that gives you a good idea of the maximum expected demand. This insight lets you cut down on infrastructure power during low occupancy times, making better use of resources and keeping things efficient.

Customization and bias

I won’t go into the ethics of it here. But technically, it can be challenging to determine how strong a bias can be before it becomes disruptive. Understanding how subtle biases affect user interactions requires careful analysis and testing to ensure a balance between effective customization and user experience.

And they agreed to make it public

Well, some of it.

Here is a refined version of the pitch that was presented to the investors, and they seem to have liked it. It’s fully stripped of the parts where you can control everything around you (lights, blinds, tv, room service, laundry, concierge) as they’re looking to patent portions of that.

Yet it’s a good starting point for any rookie AI endeavor. Since you can get it up and running it in some 20min, you’ll know sooner rather than later if it’s of any use for your project.

More challenges

Data Privacy and Security

Be careful with user voice data, as it could contain sensitive information. Encrypt everything in transit, securely store interaction history, and purge it upon checkout. However, if you aim to implement a “Welcome back, Mr. Anderson” things become more complex.

Model Optimization

Optimizing heavy AI models can be tricky, but it’s about making them faster and less demanding on resources. You can tweak them using techniques like compressing data (quantization), trimming unnecessary parts (pruning), or switching to smaller models that still get the job done. It’s all about finding ways to keep things running smoothly without losing too much accuracy.

Latency and User Experience

Reducing response times is essential for user satisfaction. One great feature of many models is their ability to start generating a response before the entire output is finalized. Similarly, when converting text to speech, you should begin generating audio as soon as text starts coming through, allowing for a smoother, more immediate experience.

Adaptation to Multilingual Settings

This is a topic in itself and there’s a lot to be said, but let’s try to stay on point.

  • The availability of English content far outweighs the others, thus English models are way better than any other.
  • Training on multilingual data will output a total mess, don’t do it
  • Translating to/from an English model is acceptable up to a point; you will lose coherence, sentiment, and idiomatic expressions that a native model might otherwise convey
  • In multicultural environments, users often switch between languages in conversation; handling mixed language input adds an extra layer of complexity to model training and deployment.

The agreed approach for the case in point was to just ask the user for the interaction language upfront, much like you see on flight entertainment systems. This keeps thigs simple and avoids traps. Native models are preferred but translations may be acceptable for some contexts. Extra languages can be added for particular implements.

Save face, use green energy

Sooner or later, customers and their customers become aware that this whole AI is power hungry. So we need to preemptively save them from the environmental guilt trip. Be sure to highlight that tech wizardry is being powered by renewable energy, or other eco-conscious practices you may implement.

User Trust and Transparency

You’ll want to be able to show what tech is in play and exactly how data’s handled—especially if the press or regulators come asking. Keeping things clear and upfront helps people feel secure about their data and builds trust all around.

Hallucinations

It’s happened to you already, I’m sure. A generative AI spitting out nonsensical, fabricated, or downright stupid content.

While models are slowly improving on this too, we kinda learned to live with it and use common sense and critical thinking (at least I like to hope that’s the case) when querying ChatGPT. But when a concierge voice bot confidently gets you on a cab to a restaurant that doesn’t exist, it becomes a problem.

To overcome this in the context, the provided solution sits on 3 pillars:

  • User Awareness – users are informed that the system may occasionally provide incorrect information.
  • Reporting Mechanism – users can easily report any inaccuracies they encounter.
  • Supervised Monitoringsupervisors monitor interactions to catch mistakes early, with their view filtered to protect user privacy

Will AI take over?

Eventually. But I wouldn’t worry just yet. At least 2 aspects seriously hold it back.

Data

Being smarter in itself doesn’t just make one more powerful. There’s only this much data that an AI can process, and that’s the whole human knowledge up until this point. But there’s no easy way for an AI to come up with new data and evolve fast enough to take us by surprise. Soon enough specialized AIs should be able to distill potential scientific, engineering, and artistic breakthroughs out of existing data. But these will be just ideas 💡 or concepts. Validating them into actual facts or usable new ‘data’ might still require good old experimentation, peer review, and some combination of imaginativeness and insanity that machines don’t yet possess.

Energy

If you manage to play with the demo, be curious of how intensive it is on the CPU or GPU while generating text for just one query. And that’s just one side of the coin, training a model takes great amounts of effort (read funding for infrastructure, energy, and human scrutiny). Otherwise, know that any GPT inquiry costs a few cents. If you’re getting it for free, it’s investor money.

While generative AI is the hype these days and where the money flows mostly, there’s way more to AI than that. The self-aware Skynet that we’re so afraid of will derive from something much more complex than a language model. And while the plot is not implausible, it’s hard to think such singularity would have enough leverage to highjack enough energy to sustain and grow itself, let alone build new power plants to keep growing and stay on top of human kind.

Hardware/software transcoding – a hybrid low-cost approach

So this would be a significant savings, 5X.

ai-generated image [duh]

It’s commonly understood that ‘software encoding’ means encoding video on a general-purpose device and CPU. It is resource intensive but very flexible and easy to deploy.

Conversely, ‘hardware encoding’ requires specialized equipment. Rather than running complex algorithms, it makes use of dedicated chips to either do the whole encoding or offload the most critical jobs like motion estimation and entropy coding.

Without trying to be exhaustive, we’ll want to touch on the differences and trade-offs of one vs the other, promise it won’t take long:

  • Performance: given the same encoding settings, a hardware encoder will almost always be faster and more power efficient; that’s because components are designed specifically for such tasks, making the process highly economical
  • Flexibility: a hardware encoder can usually only output a specific subset of codecs and formats; logic of these has been ‘burned’ into the hardware and can’t be changed by rewriting software
  • Cost: the respective hardware needs purchased or rented upfront so it’ll be more expensive to hardware transcode at first; expect to break even and eventually have it pay for itself if you use it long enough
  • Form and size: devices vary much but expect hardware encoders to be smaller, slimmer, and more portable than their software counterparts
  • Reliability: every one device is different, yet hardware encoders are designed to run that specific job without interruption so they should crash less often

But wait, this piece is about transcoding, not encoding. I.e. the kind that you employ for adaptive bitrate, on the server/cloud, for a bunch of streams at the same time. And while the variety of consumer encoder devices way exceeds that of rack-mounted transcoders, there are still a few to choose from.

GPUs

…started to offer video encoding capabilities maybe 15 years ago. Lack of flexibility aside, the ability to offload the encode/transcode workload of a computer from CPU to GPU is very much welcome. That is, when it’s your computer and you’re broadcasting one stream towards Twitch or YouTube Live.

In a server environment the magic quickly fades. The coolest (i.e. most expensive) GPUs can encode 10+ streams at the same time; but that’s not their best asset, it’s just something they can also do. And since they’re also good for many other tasks (including crypto mining), their price matches the hardware complexity and demand.

Best implements I’ve seen will get you a 3-4 fold in transcoding capacity as compared to CPU, when stacking multiple GPUs in the same computer, yet the cost effectiveness of such implements is debatable.

FPGAs

…are devices that perform specialized tasks deeper in the hardware; in the case of video encoding, they can take performance to the next level both in terms of speed and number of streams they can process in parallel. As you can expect, they’re more expensive than GPUs but will eventually be worth it if you need to transcode a lot. And it’s these that Twitch itself transcodes on nowadays, along with Theo and many others.

Sure thing, it’s a whole different scenario if you 

  • A. need to transcode a fixed number of streams 24/7
  • B. Your stream count fluctuates dramatically from day to night or during weekends

In the case of B. your mid-term investment in GPUs or FPGAs for transcoding could be wasted if you’re unable to fully utilize them. What if you could rent them by the minute instead? 🙂

Cloud Offerings

Yep, many of the clouds offer GPU equipped virtual (and real) servers, and lately AWS has instances equipped with video transcoding specialized FPGAs. And it’s these that we’ll try to squeeze the best ABR transcoding feat for the buck by maxing them out, hold tight.

What’s in the box?

The virtual server comes in 3 sizes, and bundled with the drivers and software to start experimenting right away. If you’re at least a bit handy with ffmpeg or gStreamer, it’ll feel like a charm. Take this one for  a spin:

ffmpeg -c:v mpsoc_vcu_h264 -i input.mp4 \
  -filter_complex " \
    multiscale_xma=outputs=4: \
    out_1_width=1280: out_1_height=720: out_1_rate=full: \
    out_2_width=848:  out_2_height=480: out_2_rate=half: \
    out_3_width=640:  out_3_height=360: out_3_rate=half: \
    out_4_width=288:  out_4_height=160: out_4_rate=half  \
    [a][b][c][d]; [a]split[aa][ab]; [ab]fps=30[abb]" \
  -map "[aa]"  -b:v 4M    -c:v mpsoc_vcu_h264 -f mp4 -y ./scaled_720p60.mp4 \
  -map "[abb]" -b:v 3M    -c:v mpsoc_vcu_h264 -f mp4 -y ./scaled_720p30.mp4 \
  -map "[b]"   -b:v 2500K -c:v mpsoc_vcu_h264 -f mp4 -y ./scaled_480p30.mp4 \
  -map "[c]"   -b:v 1250K -c:v mpsoc_vcu_h264 -f mp4 -y ./scaled_360p30.mp4 \
  -map "[d]"   -b:v 625K  -c:v mpsoc_vcu_h264 -f mp4 -y ./scaled_288p30.mp4

Cool, huh? In case I forgot to mention, there’s also a hardware decoder and a scaler in there so the above won’t use any CPU. At a glance, it flows like this

As you may see, the raw video (after decode and before re-encode) does not flow through your main computer’s pipeline (but through the FPGA’s) allowing for lower latencies than a common CPU transcode.

Going the extra mile

So we have a computer equipped with a high performance hardware transcoder. That transcoder itself has a limited number of streams and resolutions it can transcode at the same time.

However that FPGA physically lives in a computer with lots of CPU to waste. Can we use that to software transcode even more? 😈 Like this

Turns out that’s possible. Particularly the lower resolutions won’t take as big of a toll on the CPU so we can software transcode these and leave the big ones to the hardware. There’s a slight toll on the latency as raw video needs moved from the FPGA to the CPU pipeline, but let’s assume we can afford that.

The nugget

…is here. It spins up a cloud FPGA enabled virtual server and simply transcodes using the outlined technique. As compared to using just the dedicated hardware, it manages to transcode (up to, depending on use case):

  • 25% more streams
  • 56% more renditions
  • 27% more pixels 

…while paying nothing more. 😎

Does it scale?

The solution works best for a fixed number of live 24/7 well-known input streams. The main point of this research is to optimize cost, and running sensibly less than the supported number of streams on a given machine will compromise this goal. 

Nevertheless, a scaling solution can be imagined by making use of autoscaling, load balancing (for RTMP ingress) and an ‘origin resolver’ similar to this.

Is it stable?

I see no good reason for it to be less stable than the hardware-only approach. Yet I never ran this in production. The thought of it struck me a couple years back while working on a large transcoding project, but it was not applicable there. Since, found no context to apply it, if anyone beats me to the game I’m waiting for feedback.

Does it fit my project?

By now, you probably have a sense of it, but if not, just give it a shot.

A quick heads-up, though: it won’t qualify if you use/require any of the following:

  • maximum resolution exceeding 4096 x 2160
  • very low latency – muxing alone (from 2 separate encoders) adds in excess of 100ms to the lag
  • user-generated content
    • some video profiles fail to decode accelerated
    • resolutions not multiple of 4 (width and height) are not supported
  • advanced encoding settings
  • variable/unexpected video dimensions