Andrew R. Freed

Posted on Mar 10, 2021

Comparing Voice and Text Chat Experiences

#chatbot #ai #voice #ux

Excerpted from Creating Virtual Assistants.

The choice of a channel has far-reaching considerations into how your virtual assistant works. Each channel has pros and cons and you will likely need to customize your virtual assistant to exploit the benefits of a channel while avoiding the pitfalls. The specific channel you already use—voice or web—can also influence what you can go after first. Additionally, some business processes may be friendlier to adapt to either voice or web. For example, you can give driving directions with a map in a web channel – how would you give directions over voice?

Today's consumers are used to getting the information they want, when they want it, in the mode that they most prefer, as shown in figure 1.

Figure 1 Consumers are used to picking the channel that works best for them

Which channel has the closest affinity to most of your users? If your users are on the phone all day then they will be more likely to prefer a voice channel. If your users primarily interact with you through your website, app, or email they will be more likely to prefer a web chat option. Phones are ubiquitous and nearly everyone knows how to make phone calls—whether or not they like using the phone is a different story! Similarly, instructing users to "go to our website" may be a welcome relief or a daunting challenge depending on your user base. Of course—there is nothing to stop you from supporting both channels—just time and money!

With all else being equal it is easier to start with web chat rather than voice. Voice introduces two additional AI service (speech to text and text to speech conversion) which requires additional development and training work—though training a speech engine is continuing to get easier.

Table 1 Comparison between web and voice channels

Web Benefits	Voice Benefits
Technical implementation has less moving parts	Almost everyone has a phone
Do not have to train speech recognition	Friendlier to non-tech-savvy users
Easily deployed on websites and mobile apps	Easy to transfer users in and out of virtual assistant using public telephony switching

How users receive information in voice and web

The first difference between voice and web is the way the user receives information from the solution. In a web solution, the user will have the full chat transcript on their screen—a complete record of everything they said and what you said. While many users don't like scrolling, they have the option to scroll back to view the entire conversation at any time and re-read any part they would like. They can print their screen or copy/paste parts or the entire transcript at any time. A lengthy response can be skimmed or read in full at the user's discretion. The user may multi-task while chatting with little impact on the broader solution. A web assistant can return rich responses including not just text but images, buttons, and more. Figure 2 shows some of the differences.

Figure 2 The web channel allows rich responses like images. Voice solutions should be prepared to repeat important information. Map photo by Waldemar Brandt on Unsplash.

Conversely, in a voice channel, the user's interaction is only what they hear. If the user misses a key piece of information, they do not have a way of getting it again, unless you code a "repeat" question or functionality into your assistant. Figure 2 shows how web and voice channels can best handle a question like “where’s the nearest store?”

Further, a long verbal readout can be very frustrating for a user: they may need a pencil and paper to take notes, they may have to wait a long time to get what they want, and they probably have to be very quiet for fear of confusing the speech engine (I practically hold my breath when talking with some automated voice systems). Also, directly sending rich media responses like images is impossible over voice though you may be able to leverage side channels like SMS or email to send information for later review.

Figure 3 Different channels have different user experiences

You must be aware of the cost to the user when you have a long message. As shown in Figure 3, a web user can skim long messages, but a voice user cannot. Beware the temptation to cram every last piece of information into a message, especially in voice. The average adult reading speed is around 200 words per minute and speaking speed is around 160 words per minute, though automated speech systems can be tuned to speak more quickly.

Consider a hypothetical greeting:

"Thank you for calling the So-and-So automated voice hotline. We appreciate your call and look forward to serving you. This call may be recorded for quality assurance and training purposes. If you know the extension of the party you are calling you can dial it at any time. Please listen carefully as our menu options have recently changed. For appointments press 1."

I timed myself reading this 62-word message. It takes 20 seconds of audio to get to the first useful piece of information! (Hopefully you wanted appointments!) Perhaps the lawyers insisted—but look at how much "junk" is in that message from the user's point of view. Figure 4 breaks down the greeting.

Figure 4 User's thought progression through a long greeting

Contrast that with the following greeting:

"Thanks for calling So-and-So. Calls are recorded. How can I help you?"

This new message has 4 seconds to value while still covering the basics of greeting, notification, and intent gathering. You only get one chance to make a great first impression—don't waste your users' time on your greeting!

Take to heart the following quote:

"I have only made this letter longer because I have not had the time to make it shorter."
Blaise Pascal, The Provincial Letters (Letter 16, 1657)

It takes work to be concise, but your users will appreciate you for it!

How the assistant receives information in Voice and Web

Another key difference between voice and web is how you receive input from the user. In a web channel, you can be sure of receiving exactly what was on the user's screen. You may provide a pop-up form to collect one or more pieces of information at once (first and last name, full address). The user may have clicked a button and you will know exactly what they clicked. The user may have misspelled one or more words but virtual assistants are increasingly resilient to misspellings and typos as demonstrated in Figure 5.

Figure 5 Most major virtual assistant platforms are resilient against misspellings

In a voice channel you will receive a textual transcription of what the speech engine interpreted. Anyone who has used voice dictation has seen words get missed. The assistant can be adaptive to some mis-transcriptions (just like it can be adaptive to misspellings in chat) when the words are not contextually important. Figure 6 shows a voice assistant adapting to a pair of mis-transcriptions: “wear” for “where” and “a” for “the”.

Figure 6 Voice assistants can adapt to mis-transcriptions in general utterances as long as the key contextual phrases are preserved, like "nearest store"

Aside from simple mis-transcriptions, another class of inputs gives speech engines trouble—any input that is hard for humans will be hard for speech engines as well.

Proper names and addresses are both notoriously difficult for speech engines in both recognition (speech to text) and synthesis (text to speech). When I'm talking to a person on the phone and they ask for my last name I say "Freed. F-R-E-E-D" since many people hear "Free" or try to use the old German spelling "Fried". Common names are not that common—you should be easily able to rattle off a couple of "uncommon" names within your personal network rather quickly. Speech engines work best with a constrained vocabulary, even if that vocabulary is "the English language", and most names are considered out-of-vocabulary.

Sidebar: What’s a vocabulary?
Speech to text providers refer to a “vocabulary” – this is simply a list of words. A speech model is trained to recognize a set of words. A generalized English model may have a dictionary that includes all of the most common words in the English language.

Your virtual assistant will probably need to deal with uncommon words and jargon. The name of your company, or the products your company offers, may not be included in that vocabulary of words. If so, you will need to train a speech model to recognize them.

Addresses are harder than names. I found a random street name on a map "Westmoreland Drive." If you heard that would you transcribe "Westmoreland Drive" or "W. Moreland Drive" or "West Moorland Drive"? Figure 7 shows a challenge in mapping similar phonetics to words that sound similar.

Figure 7 Transcription challenges on unusual terms

Sidebar: On spelling out words and the difficulty of names and addresses
Spelling out a difficult name can sometimes be helpful for humans, but it does not help machines much. Letters are easily confused with each other: B/C/D/E/G/P/T all sound similar without context. Humans may require several repetitions to correctly interpret a proper name, even spelled out. There is rich literature on the difficulty of names and addresses. One such article is "The Difficulties with Names: Overcoming Barriers to Personal Voice Services" by Dr. Murray Spiegel (2003) http://web.media.mit.edu/~geek/TheDifficultiesWithNames.htm

The difficulty in receiving certain inputs from users affects the way you build a dialog structure, perhaps most significantly in authenticating users. In a web chat you can collect any information you need verbatim from a user. You may in fact authenticate in your regular website and pass an authenticated session to the web chat. In a voice channel, you need to be more restrictive in what you receive. During authentication, a single transcription error in the utterance will fail validation, just like if the user mistypes their password, as shown in Figure 8.

Figure 8 For most data inputs, a single transcription mistake prevents the conversation from proceeding. Voice systems need to take this into account by using re-prompts or alternate ways to receive difficult inputs.

In the best cases, speech technology has a 5% error rate and in challenging cases like names and addresses the error rate can be much, much higher (Some voice projects report a 60-70% error rate on addresses). With alphanumeric identifiers, the entire sequence needs to be transcribed correctly, as shown in Figure 8. A 5% error rate may apply to each character, so the error rate for the entire sequence will be higher. For this reason, a six-digit ID is much more likely to transcribe accurately than a twelve-digit ID.

For accurate transcriptions, constrained inputs like numeric sequences and dates work best. If you encounter a transcription error, you can always prompt the user to provide the information again. Keep in mind that you will want to limit the number of re-prompts. You may implement a “three strikes rule” – if three consecutive transcriptions fail then you direct the user to alternate forms of help that will serve them better.

Table 2 Summary of data types by how well speech engines can transcribe them

Data types that transcribe well	Data types that do not transcribe well
Numeric identifiers (ex: Social Security Number)	Proper names
Dates	Addresses
Short alphanumeric sequences (i.e. “ABC123”)	Long alphanumeric sequences (i.e. "ABCDEFGHI123456")

Voice authentication can make use of an alternate channel such as SMS. You can send a text message to a number on file for a user with a one-time code and use that in authentication, instead of collecting information over voice. If you absolutely must authenticate via an option that is difficult for speech, over the speech channel, be prepared to work hard in both speech training and orchestration layer post-processing logic. You will need a good call hand-off strategy in this scenario.

That’s all for this article. If you want to learn more about the book, check it out on Manning’s browser-based liveBook reader here.

Take 35% off Creating Virtual Assistants by entering devtofreed into the discount code box at checkout at manning.com.

DEV Community

Comparing Voice and Text Chat Experiences

How users receive information in voice and web

How the assistant receives information in Voice and Web

Top comments (0)

Read next

Kakizu: Turn your sketches into beautiful AI generated art using Cloudflare!

Recommender Systems in the Era of Large Language Models (LLMs)

Manipulating Large Language Models to Increase Product Visibility

Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length