DEV Community

Ottomatias Peura
Ottomatias Peura

Posted on • Edited on • Originally published at speechly.com

What are voice user interfaces and how you can start adding them to your application?

As long as general artificial intelligence doesn't exist (and it won't happen soon), any computer system needs a way of interacting with its users. With the user interface the user gives commands to the computer system.

If you want an application to calculate what is the sum of two arbitrary numbers, you will need a way to get these numbers from the user to your application. A typical calculator solves this problem by presenting a user with digits from 0 to 9 either in physical form or as touchable and clickable buttons.

User interfaces can be either good or bad. Good user interfaces are the ones that you don't think that much and the bad ones – well, they are the ones that make your brains melt when you even think of them. They are the ones that even if you've used them a million times, you still don't know how they work and why they work like this. In my experience, every system used for invoicing is among these.

Good user interfaces are intuitive, concise, simple, and aesthetically pleasing. They give enough feedback to the user, tolerate errors such as misclicks or typos, and use familiar elements so that they are easy to use even for new users.

Voice user interfaces (VUI) are simply user interfaces that use voice or speech as the primary means for interacting with the system. Instead of touching and swiping through menus to find the action the user wants to do, the user can say what they want to do. Typical examples of voice user interfaces are voice assistants such as Google Assistant, Alexa, or Siri.

Voice user interfaces are typically highly intuitive and don't require a lot of training to use. For example, if you know that you are interacting with an alarm clock and you know it has a voice user interface, you can be pretty confident commands such as "set an alarm for 8am" works as you'd suppose it works.

In addition to being intuitive, the benefits of a voice user interface include hands-free action and the ability to use them from a distance. Voice user interfaces can also convey emotion in a way text-based interfaces can't.

The first applications of voice user interfaces were interactive voice response (IVR) systems that came into existence already back in the 80s. These were systems that understood simple commands through a telephone call and were used to improve efficiency in call centers. Last time I encountered one was a customer support call center for home appliances where a synthesized voice asked me to say the name of the manufacturer of the device I needed help with. Once it recognized correctly my utterance "Siemens", it directed me to a correct person.

Current voice user interfaces can be a lot smarter and can understand complex sentences and even combinations of them. For example, Google Assistant is perfectly fine with something like "Turn off the living room light and turn on the kitchen light".

While user interfaces that use voice are intuitive once you know what they are supposed to do, typical problems with voice user interfaces are to do with exploration. Let's say you find yourself closed in a small room that has a mirror on one side and seemingly similar metal panels on every other wall. That might be a cause for panic – or it's a voice-enabled elevator.

If the user doesn't really know what the system is supposed to do as is the case with almost every new mobile application you install, the best way to learn it is by clicking through menus and dialogs (or by reading the documentation!). While you do it, you may find buttons that you were not expecting to find but that might seem interesting or helpful. Then you try them out and hopefully learn how the system works.

A system that only has a voice user interface (for example smart speakers), on the other hand, doesn't support this kind of usage. You can only try saying things and it might or might not do what you were expecting. You can try saying "can you reserve me a table in a seafood restaurant" for your smart speaker, but the answer is either along the lines of "I don't understand" or in the best case something like "You'll need to login to do that. Please open application XYZ on your mobile phone and set up voice."

And even if that worked, it doesn't necessarily mean that the smart speaker can also get you an Uber or flight tickets. You'll have to try each of these individually and learn. If you are building a voice user interface to your application, you should really think about how you support feature discoverability in your application.

The second issue with voice-only user interfaces is slowness. Let's say you want to know what happened in NBA in the last round. Saying out loud results for each of the 15 games will take some time. While voice is the most natural way for us humans to interact, it's nowhere near the fastest way of consuming information. On the other hand, voice is faster than typing when inputting information.

Voice user interfaces should support multi-modality. Multi-modality means that a user interface supports more than one user interaction modality, for example, vision, touch, and voice. Now getting those NBA results will become a lot nicer: say "show me the latest NBA results" give you a screen that shows the results. It's easy to skim them through, a sudden loud sound from distance doesn't break your experience and you don't have to wait until you hear the result for your favorite team.

While adding a voice user interface can improve the user experience, it's not a silver bullet that'll turn a bad application into a good one. Just like any user interface, it needs designing and thinking through.

We have successfully solved issues with voice user interfaces with many of our customers. If you are interested in building voice user interfaces, sign up to our Dashboard or send us a message through the Intercom widget on the bottom left!

Top comments (0)