Self contained AI voice assistant – free and open source

We can’t use OpenAI

Ok, we’re peeking outside the box on this one. But hey, we all have to get into AI or get run by it, right? This post was written by a human.

Long story short, the client was looking for a novel approach to home (read hotel room) automation to impress the investors. It had to make extensive use of AI or it wouldn’t fly.

A couple hours of brainstorming later, we settled on a voice-driven solution that would allow users (read hotel guests) to manage the lights, curtains, and entertainment system, but also use as a friendly assistant for information or casual conversation, similar to ChatGPT.

The Constraints

This could not use a public service like OpenAI or Anthropic for multiple reasons, one of which being the need to ensure the privacy of user queries. The ability to deploy the entire GPT “brain” on-site (read hotel basement) was also a significant selling point. 😊

Furthermore, there needed to be strict control over which GPT model was utilized for specific deployments, along with the capability to introduce biases (political, religious, commercial) into some instances. Agree, that’s wildly controversial but for the scope of this story let’s see it as just another selling point.

Finally, it needed to be open source or at least not associated with the major industry players, allowing them to take pride in having developed their own solution.

Challenges

Hardware and Resource Demands

Running the voice-text-voice stack requires significant computational power, especially the speech recognition and (text) response generation. While GPUs are ideal for handling these demands, they aren’t always available in every environment. This makes optimizing systems for mixed hardware setups critical.

Balancing performance across varied infrastructure, whether with or without GPUs, involves careful tuning to maintain responsiveness and ensure the assistant works predictably even in resource-constrained scenarios.

Open-source Complexities

Going full open-source is like trading a smooth highway for a bumpy back road. These tools often require extra tinkering and development compared to polished commercial options.

Without the reliability and support of paid alternatives, teams must invest additional time in troubleshooting, integration, and ongoing refinement to ensure smooth operation and high performance.

Deployment and Scalability

Implementing this at scale poses challenges beyond computing power. As more users connect, maintaining seamless performance becomes increasingly complex. The system requires careful tuning to ensure efficiency and responsiveness, turning scalability into a juggling act.

In a self-hosted setup, you don’t have the flexibility of cloud-based scaling to adjust costs dynamically. However given the hotel environment, the occupancy is well known and that gives you a good idea of the maximum expected demand. This insight lets you cut down on infrastructure power during low occupancy times, making better use of resources and keeping things efficient.

Customization and bias

I won’t go into the ethics of it here. But technically, it can be challenging to determine how strong a bias can be before it becomes disruptive. Understanding how subtle biases affect user interactions requires careful analysis and testing to ensure a balance between effective customization and user experience.

And they agreed to make it public

Well, some of it.

Here is a refined version of the pitch that was presented to the investors, and they seem to have liked it. It’s fully stripped of the parts where you can control everything around you (lights, blinds, tv, room service, laundry, concierge) as they’re looking to patent portions of that.

Yet it’s a good starting point for any rookie AI endeavor. Since you can get it up and running it in some 20min, you’ll know sooner rather than later if it’s of any use for your project.

More challenges

Data Privacy and Security

Be careful with user voice data, as it could contain sensitive information. Encrypt everything in transit, securely store interaction history, and purge it upon checkout. However, if you aim to implement a “Welcome back, Mr. Anderson” things become more complex.

Model Optimization

Optimizing heavy AI models can be tricky, but it’s about making them faster and less demanding on resources. You can tweak them using techniques like compressing data (quantization), trimming unnecessary parts (pruning), or switching to smaller models that still get the job done. It’s all about finding ways to keep things running smoothly without losing too much accuracy.

Latency and User Experience

Reducing response times is essential for user satisfaction. One great feature of many models is their ability to start generating a response before the entire output is finalized. Similarly, when converting text to speech, you should begin generating audio as soon as text starts coming through, allowing for a smoother, more immediate experience.

Adaptation to Multilingual Settings

This is a topic in itself and there’s a lot to be said, but let’s try to stay on point.

  • The availability of English content far outweighs the others, thus English models are way better than any other.
  • Training on multilingual data will output a total mess, don’t do it
  • Translating to/from an English model is acceptable up to a point; you will lose coherence, sentiment, and idiomatic expressions that a native model might otherwise convey
  • In multicultural environments, users often switch between languages in conversation; handling mixed language input adds an extra layer of complexity to model training and deployment.

The agreed approach for the case in point was to just ask the user for the interaction language upfront, much like you see on flight entertainment systems. This keeps thigs simple and avoids traps. Native models are preferred but translations may be acceptable for some contexts. Extra languages can be added for particular implements.

Save face, use green energy

Sooner or later, customers and their customers become aware that this whole AI is power hungry. So we need to preemptively save them from the environmental guilt trip. Be sure to highlight that tech wizardry is being powered by renewable energy, or other eco-conscious practices you may implement.

User Trust and Transparency

You’ll want to be able to show what tech is in play and exactly how data’s handled—especially if the press or regulators come asking. Keeping things clear and upfront helps people feel secure about their data and builds trust all around.

Hallucinations

It’s happened to you already, I’m sure. A generative AI spitting out nonsensical, fabricated, or downright stupid content.

While models are slowly improving on this too, we kinda learned to live with it and use common sense and critical thinking (at least I like to hope that’s the case) when querying ChatGPT. But when a concierge voice bot confidently gets you on a cab to a restaurant that doesn’t exist, it becomes a problem.

To overcome this in the context, the provided solution sits on 3 pillars:

  • User Awareness – users are informed that the system may occasionally provide incorrect information.
  • Reporting Mechanism – users can easily report any inaccuracies they encounter.
  • Supervised Monitoringsupervisors monitor interactions to catch mistakes early, with their view filtered to protect user privacy

Will AI take over?

Eventually. But I wouldn’t worry just yet. At least 2 aspects seriously hold it back.

Data

Being smarter in itself doesn’t just make one more powerful. There’s only this much data that an AI can process, and that’s the whole human knowledge up until this point. But there’s no easy way for an AI to come up with new data and evolve fast enough to take us by surprise. Soon enough specialized AIs should be able to distill potential scientific, engineering, and artistic breakthroughs out of existing data. But these will be just ideas 💡 or concepts. Validating them into actual facts or usable new ‘data’ might still require good old experimentation, peer review, and some combination of imaginativeness and insanity that machines don’t yet possess.

Energy

If you manage to play with the demo, be curious of how intensive it is on the CPU or GPU while generating text for just one query. And that’s just one side of the coin, training a model takes great amounts of effort (read funding for infrastructure, energy, and human scrutiny). Otherwise, know that any GPT inquiry costs a few cents. If you’re getting it for free, it’s investor money.

While generative AI is the hype these days and where the money flows mostly, there’s way more to AI than that. The self-aware Skynet that we’re so afraid of will derive from something much more complex than a language model. And while the plot is not implausible, it’s hard to think such singularity would have enough leverage to highjack enough energy to sustain and grow itself, let alone build new power plants to keep growing and stay on top of human kind.

Leave a Reply

Your email address will not be published. Required fields are marked *