Now, more than ever before is the time for AI-powered voice-based systems. Consider a call to customer service. Soon all the brittleness and inflexibility will be gone – the stiff robotic voices, the “press one for sales”-style constricting menus, the annoying experiences that have had us all frantically pressing zero in the hopes of talking instead with a human agent. (Or, given the long waiting times that being transferred to a human agent can entail, had us giving up on the call altogether.)
No more. Advances not only in transformer-based large language models (LLMs) but in automatic speech recognition (ASR) and text-to-speech (TTS) systems mean that “next-generation” voice-based agents are here – if you know how to build them.
Today we take a look into the challenges confronting anyone hoping to build such a state-of-the-art voice-based conversational agent.
Before jumping in, let’s take a quick look at the general attractions and relevance of voice-based agents (as opposed to text-based interactions). There are many reasons why a voice interaction might be more appropriate than a text-based one – these can include, in increasing order of severity:
-
Preference or habit – speaking pre-dates writing developmentally and historically
-
Slow text input – many can speak faster than they can text
-
Hands-free situations – such as driving, working out or doing the dishes
-
Illiteracy – at least in the language(s) the agent understands
-
Disabilities – such as blindness or lack of non-vocal motor control
In an age seemingly dominated by website-mediated transactions, voice remains a powerful conduit for commerce. For example, a recent study by JD Power of customer satisfaction in the hotel industry found that guests who booked their room over the phone were more satisfied with their stay than those who booked through an online travel agency (OTA) or directly through the hotel’s website.
But interactive voice responses, or IVRs for short, are not enough. A 2023 study by Zippia found that 88% of customers prefer voice calls with a live agent instead of navigating an automated phone menu. The study also found that the top things that annoy people the most about phone menus include listening to irrelevant options (69%), inability to fully describe the issue (67%), inefficient service (33%), and confusing options (15%).
And there is an openness to using voice-based assistants. According to a study by Accenture, around 47% of consumers are already comfortable using voice assistants to interact with businesses and around 31% of consumers have already used a voice assistant to interact with a business.
Whatever the reason, for many, there is a preference and demand for spoken interaction – as long as it is natural and comfortable.
Roughly speaking, a good voice-based agent should respond to the user in a way that is:
-
Relevant: Based on a correct understanding of what the user said/wanted. Note that in some cases, the agent’s response will not just be a spoken reply, but some form of action through integration with a backend (e.g., actually causing a hotel room to be booked when the caller says “Go ahead and book it”).
-
Accurate: Based on the facts (e.g., only say there is a room available at the hotel on January 19th if there is)
-
Clear: The response should be understandable
-
Timely: With the kind of latency that one would expect from a human
-
Safe: No offensive or inappropriate language, revealing of protected information, etc.
Current voice-based automated systems attempt to meet the above criteria at the expense of a) being a) very limited and b) very frustrating to use. Part of this is a result of the high expectations that a voice-based conversational context sets, with such expectations only getting higher the more that voice quality in TTS systems becomes indistinguishable from human voices. But these expectations are dashed in the systems that are widely deployed at the moment. Why?
In a word – inflexibility:
-
Limited speech – the user is typically forced to say things unnaturally: in short phrases, in a particular order, without spurious information, etc. This offers little or no advance over the old school number-based menu system
-
Narrow, non-inclusive notion of “acceptable” speech – low tolerance for slang, uhms and ahs, etc.
-
No backtracking: If something goes wrong, there may be little chance of “repairing” or correcting the problematic piece of information, but instead having to start over, or wait for a transfer to a human.
-
Strict turn-taking – no ability to interrupt or speak an agent
It goes without saying that people find these constraints annoying or frustrating.
The good news is that modern AI systems are powerful and fast enough to vastly improve on the above kinds of experiences, instead of approaching (or exceeding!) human-based customer service standards. This is due to a variety of factors:
-
Faster, more powerful hardware
-
Improvements in ASR (higher accuracy, overcoming noise, accents, etc.)
-
Improvements in TTS (natural-sounding or even cloned voices)
-
The arrival of generative LLMs (natural-sounding conversations)
That last point is a game-changer. The key insight was that a good predictive model can serve as a good generative model. An artificial agent can get close to human-level conversational performance if it says whatever a sufficiently good LLM predicts to be the most likely thing a good human customer service agent would say in the given conversational context.
Cue the arrival of dozens of AI startups hoping to solve the voice-based conversational agent problem simply by selecting, and then connecting, off-the-shelf ASR and TTS modules to an LLM core. On this view, the solution is just a matter of selecting a combination that minimizes latency and cost. And of course, that’s important. But is it enough?
There are several specific reasons why that simple approach won’t work, but they derive from two general points:
-
LLMs actually can’t, on their own, provide good fact-based text conversations of the sort required for enterprise applications like customer service. So they can’t, on their own, do that for voice-based conversations either. Something else is needed.
-
Even if you do supplement LLMs with what is needed to make a good text-based conversational agent, turning that into a good voice-based conversational agent requires more than just hooking it up to the best ASR and TTS modules you can afford.
Let’s look at a specific example of each of these challenges.
Challenge 1: Keeping it Real
As is now widely known, LLMs sometimes produce inaccurate or ‘hallucinated’ information. This is disastrous in the context of many commercial applications, even if it might make for a good entertainment application where accuracy may not be the point.
That LLMs sometimes hallucinate is only to be expected, on reflection. It is a direct consequence of using models trained on data from a year (or more) ago to generate answers to questions about facts that are not part of, or entailed by, a data set (however huge) that might be a year or more old. When the caller asks “What’s my membership number?”, a simple pre-trained LLM can only generate a plausible-sounding answer, not an accurate one.
The most common ways of dealing with this problem are:
-
Fine-tuning: Train the pre-trained LLM further, this time on all the domain-specific data that you want it to be able to answer correctly.
-
Prompt engineering: Add the extra data/instructions in as an input to the LLM, in addition to the conversational history
-
Retrieval Augmented Generation (RAG): Like prompt engineering, except the data added to the prompt is determined on the fly by matching the current conversational context (e.g., the customer has asked “Does your hotel have a pool?”) to an embedding encoded index of your domain-specific data (that includes, e.g. a file that says: “Here are the facilities available at the hotel: pool, sauna, EV charging station.”).
-
Rule-based control: Like RAG, but what is to be added to (or subtracted from) the prompt is not retrieved by matching a neural memory but is determined through hard-coded (and hand-coded) rules.
Note that one size does not fit all. Which of these methods will be appropriate will depend on, for example, the domain-specific data that is informing the agent’s answer. In particular, it will depend on whether said data changes frequently (call to call, say – e.g. customer name) or hardly ever (e.g., the initial greeting: “Hello, thank you for calling the Hotel Budapest. How may I assist you today?”). Fine-tuning would not be appropriate for the former, and RAG would be a clumsy solution for the latter. So any working system will have to use a variety of these methods.
What’s more, integrating these methods with the LLM and each other in a way that minimizes latency and cost requires careful engineering. For example, your model’s RAG performance might improve if you fine-tune it to facilitate that method.
It may come as no surprise that each of these methods in turn introduce their own challenges. For example, take fine-tuning. Fine-tuning your pre-trained LLM on your domain-specific data will improve its performance on that data, yes. But fine-tuning modifies the parameters (weights) that are the basis of the pre-trained model’s (presumably fairly good) general performance. This modification therefore causes an unlearning (or “catastrophic forgetting”) of some of the model’s previous knowledge. This can result in the model giving incorrect or inappropriate (even unsafe) responses. If you want your agent to continue to respond accurately and safely, you need a fine-tuning method that mitigates catastrophic forgetting.
Determining when a customer has finished speaking is critical for natural conversation flow. Similarly, the system must handle interruptions gracefully, ensuring the conversation remains coherent and responsive to the customer’s needs. Achieving this to a standard comparable to human interaction is a complex task but is essential for creating natural and pleasant conversational experiences.
A solution that works requires the designers to consider questions like these:
-
How long after the customer stops speaking should the agent wait before deciding that the customer has stopped speaking?
-
Does the above depend on whether the customer has completed a full sentence?
-
What should be done if the customer interrupts the agent?
-
In particular, should the agent assume that what it was saying was not heard by the customer?
These issues, having largely to do with timing, require careful engineering above and beyond that involved in getting an LLM to give a correct response.
The evolution of AI-powered voice-based systems promises a revolutionary shift in customer service dynamics, replacing antiquated phone systems with advanced LLMs, ASR, and TTS technologies. However, overcoming challenges in hallucinated information and seamless endpointing will be pivotal for delivering natural and efficient voice interactions.
Automating customer service has the power to become a true game changer for enterprises, but only if done correctly. In 2024, particularly with all these new technologies, we can finally build systems that can feel natural and flowing and robustly understand us. The net effect will reduce wait times, and improve upon the current experience we have with voice bots, marking a transformative era in customer engagement and service quality.