Spoken Language Understanding SLU and Natural Language Understanding NLU aim to help machines understand human language. The main difference is the input data type. SLU deals with understanding speech, whereas NLU deals with understanding text. NLU is a part of SLU whether it’s trained independently or not.
Research on NLU started in the 1960s: Bobrow’s Ph.D. dissertation Weizenbaum’s ELIZA, a mock psychotherapist chatbot, and Winograd’s SHRDLU are the pioneer works in this space. SLU’s popularity started with the recent advances in speech recognition powered by deep learning. The query “spoken language” returns over 1000 studies on both Amazon and Microsoft research publications websites.
Conventional SLU Approach
The conventional SLU processes utterances in two steps - Speech-to-Text (STT) first, then NLU. Once STT transcribes the speech to text, NLU extracts meaning by processing the transcribed text. The performance relies on independently trained STT and NLU modules. If STT returns erroneous output, then it leads to incorrect NLU predictions. Hence machines cannot capture what humans say. Many voice applications, including voice assistants - Alexa, Siri, and Google use this approach.
End-to-End SLU Approach
The modern SLU uses an end-to-end model instead of two distinct components. Developers train STT and NLU jointly, resulting in higher accuracy.
Picovoice calls this Speech-to-Intent as it infers users’ intents directly from speech. Amazon calls it FANS - Fusing ASR and NLU for SLU.
Conventional SLU Approach vs. End-to-End SLU Approach
The answer is “it depends.” It depends on the availability of corpora and information. If available, then the answer is modern end-to-end SLU. If not, then the conventional SLU. Text-based understanding (NLU) has been around longer than speech-based understanding (SLU). Thus, it has richer datasets.
For domain-specific applications such as IVR systems, menu navigation on a website, or ordering food at a QSR, the modern end-to-end SLU is preferable. Nobody would discuss the meaning of life with a voice assistant while ordering a hamburger. For open-domain use cases such as voice assistants like Alexa, conventional -cascading SLU works better given the variety of topics they cover. One can discuss the meaning of life with Alexa - although there are better options.
Top SLU and NLU engines in the market
Free and Open-source SLU and NLU Engines:
Rasa: Rasa is an open-source NLU engine that processes text inputs. The core software is free, and Rasa offers paid support and consulting services. Anyone can choose a speech-to-text service, and run Rasa on transcribed text.
Snips: Snips is an open-source SLU engine that uses the conventional method. Snips no longer maintains it after being acquired by Sonos. Yet the repo is still available on GitHub and used by developers.
Wit.ai: Wit.ai is a free platform and now requires a Facebook account after being acquired by Facebook. If one doesn’t (want to) have a Facebook account or deletes it, then they cannot use Wit.
Top paid SLU and NLU Engines:
Dialogflow: Google, after the API.ai acquisition, named it Dialogflow and offers both chatbot and voicebot tools under the same name. It uses the conventional approach. Dialogflow records and sends voice data to Google’s servers for transcription and then processes transcribed text. It charges based on usage.
Lex: Amazon’s Lex is an AWS offering. Like Dialogflow, Lex offers text and voice capabilities, uses the conventional approach, and transcribes speech and understanding separately in its cloud. It charges based on usage.
Rhino: Picovoice’s Rhino is an SLU engine that uses the end-to-end approach and infers intents and intent details directly from speech. Rhino is voice-based and does not support text-based services. It charges based on the number of users and offers unlimited interactions per user.
Top comments (4)
Great article! What do you think about IBM watson? I used their NLU services for my bachelors thesis.
Glad you enjoyed it.
You're right! Given the limited space, I didn't include IBM Watson in the article. The answer to this question also is "it depends", unfortunately.
Accuracy, ease of use and total cost of ownership are common criteria people use generally. [picovoice.ai/blog/selecting-natura...]
Watson NLU might be preferable if you're familiar with the Watson ecosystem - ease of use. However, if you care about privacy, IBM keeps the voice data for training purposes. So, it can be a deal-breaker.
At Picovoice, we also publish open-source benchmarks. For accuracy, you can reproduce it and evaluate the accuracy across various SnR. Target environment, noise, users' accents, distance from the microphone are important factors for accuracy. Even just for accuracy, I'd say it depends :)
picovoice.ai/docs/benchmark/nlu/#r...
Wow, I didn't know that! Is there a place where you can see which providers keeps your data and which don't? Or do I need to look that up in their terms of service? Thanks for the attached articles, I will check out the benchmarks, too
Not sure but fine prints, T&Cs, FAQs do the job and also explain how to opt-out when possible.
Will my data be stored by NLU?
By default, all Watson services log requests and their results.
ibm.com/cloud/watson-natural-langu...