Gilad David Maayan

Posted on Jun 19, 2019

Building an AI Chatbot with Web Speech API and Node.js

The Web Speech API allows you to incorporate voice data into web apps. The Web Speech API has two parts: SpeechSynthesis (Text-to-Speech), and SpeechRecognition (Asynchronous Speech Recognition.) This tutorial will cover the steps to follow when using Web Speech API and Node.js to create an AI based voice chat UI in a web browser. Once completed the app should be able to understand the voice of the user and reply appropriately using a synthetic voice.

Although the Web Speech API has been around for a while in Chrome and Firefox, the functionality is only partially available in Safari and other browsers. Safari, for instance, only supports SpeechSynthesis and not SpeechRecognition. You can view the list of supported browsers.

Building the web app takes us through three major steps –

Using the Speech Recognition interface in the Web Speech API to listen to the voice of the user
Transmitting the user’s message to a commercial natural language processing API as a text string
Using the Speech Synthesis interface to lend a synthetic voice to the response text received from the API.ai.

Additionally, the complete source code for this tutorial can be found on Github here.

This tutorial uses Node.js extensively. It is important that you are familiar with JavaScript and also possess a basic understanding of Node.js. You need to have Node.js installed on your system.

Setting Up Your Node.js Application

To begin, use Node.js to set up a web app framework. You can create your app directory and build your app structure as below –

├── index.js
├── public
│   ├── css
│   │   └── style.css
│   └── js
│       └── script.js
└── views
    └── index.html

In this article, we will be keeping things simple. You can also try the Express generator, which creates the scaffolding for a full app with numerous JavaScript files, Jade templates, and sub-directories for various purposes.

To initialize your Node.js app, you can now run the following command –

$ npm init -f

The –f flag ensures the default setting is accepted. If not, you can manually configure the app without the flag. This should also generate a package.json file. This contains the required basic information about your app. Once done, you can install the dependencies needed to build out your app –

$ npm install express Socket.io apiai --save

Once the --save flag is added, your package.json file is automatically updated with the relevant dependencies.

{
 "name": "ai-bot",
 "version": "1.0.0",
 "description": "",
 "main": "index.js",
 "dependencies": {
   "apiai": "^4.0.3",
   "express": "^4.16.3",
   "Socket.io": "^2.1.1"
 },
 "devDependencies": {},
 "scripts": {
   "test": "echo \"Error: no test specified\" && exit 1"
 },
 "keywords": [],
 "author": "",
 "license": "ISC"
}

In order to run the server locally, we are going to use Express, a Node.js web application server framework. Express helps you write middlewares which is a great pattern for making servers. To allow for real-time bidirectional communication between the browser and the server, we’ll be using Socket.io. For the uninitiated, it is essentially a library that allows users to employ WebSocket efficiently using Node.js.

The final piece of the puzzle is Dialogflow’s legacy Node.js client called apiai, which will help make our language bot intelligent. Because its simple, easy and intuitive, you will not need any natural language processing or deep learning expertise.

By establishing a socket connection between the server and the client, chat messages will be transmitted as soon as text data is returned by the Web Speech API or by API.ai API.

The next step is to create an index.js file and then instantiate Express.
Here’s the basic version of our express server –

const express = require('express')
const app = express()
const port = 3000

app.use(express.static(__dirname + '/views')); // html
app.use(express.static(__dirname + '/public')); // js, css, images
//Any request to '/' renders index.html
app.get('/', (req, res) => {
  res.sendFile('index.html');
});

//Start listening to the port
app.listen(port, () => console.log(`Example app listening on port ${port}!`))

Receiving Speech with the Speech Recognition Interface

Speech Recognition is the main controller interface of the Web Speech API. It is used to receive and the user’s speech via a microphone. There are

Creating the User Interface

The app we are attempting to build has a relatively simple UI. It basically consists of a button to trigger voice recognition.

To begin, we will need to set up our ‘index.html’ file and then include the front end JavaScript file, script.js and Socket.io. This will be used at a later stage to allow communication in real-time –

<html lang="en">
  <head>...</head>
  <body>
    ...
    <script src="https://cdnjs.cloudflare.com/ajax/libs/Socket.io/2.0.1/Socket.io.js"></script>
    <script src="js/script.js"></script>
  </body>
</html>

This can be followed by adding a button interface in the body of the HTML

<button>Talk</button>

Capturing Voice with JavaScript

From within script.js, you can invoke an instance of Speech Recognition. You can use the controller interface of the Web Speech API to manage the voice recognition -

const SpeechRecognition = window.SpeechRecognition || window.webkitSpeechRecognition;

const recognition = new SpeechRecognition();

Both prefixed and non-prefixed objects have been included as Chrome supports API with prefixed properties at the moment.

Additionally, ECMAScript6 syntax has been used here including the const as well as arrow functions because these are currently available in browsers that support both Speech API interfaces – SpeechRecognition as well as SpeechSynthesis

Users also have the option of setting different properties to customize speech recognition –

recognition.lang = 'en-US';
recognition.interminResults = false;

Once done, the DOM reference for the button UI can be captured and you can listen for the click event that signals the initiation of speech recognition -

document.querySelector('button').addEventListener('click', () => {
  recognition.start();
});

After the speech recognition has begun, you can use the result event to request the content of the text -

recognition.addEventListener('result', (e) => {
  let last = e.results.length - 1;
  let text = e.results[last][0].transcript;

  console.log('Confidence: ' + e.results[0][0].confidence);

  // We will use the Socket.io here later...
});

This code will display a SpeechRecognitionResultList object. This will contain the result which can be retrieved from the array.

In the next step, we’ll use Socket.io to transmit the result to the server code.

Real-Time Communication with Socket.io

Socket.io is essentially a library used for web-based real-time applications. It allows real-time bidirectional communication between servers and web clients. Here, we will attempt to use Socket.io to pass the result from the browser to the Node.js code, followed by passing the response back to our browser.

We use WebSocket via Socket.io instead of HTTP or AJAX because sockets are the preferred solution when it comes to bidirectional communication, especially when pushing an event between a browser and a server. Using a continuous connection, we would not need to reload the browser or send AJAX requests at frequent intervals.

Next, you can instantiate Socket.io in script.js –

const socket = io();

Next, we insert this code where the result event is being played via SpeechRecognition

socket.emit('chat message', text);

Now we return to Node.js to receive the text and use AI to formulate a reply to the user.

Integrating AI with our application

A number of different services and platforms allow for the integration of an app with an AI system via speech-to-text and natural language processing. These include Microsoft’s LUIS, IBM’s Watson and Wit.ai. There are also more sophisticated deep learning platforms for more demanding projects.

To keep things simple for this tutorial, we will make use of DialogFlow, as it provides a free developer account and allows for the easy set up of a small-talk system via its Node.js library and web interface. If you haven’t heard about DialogFlow before, here’s an excerpt from Wikipedia:

Dialogflow (formerly API.ai, Speaktoit) is a Google-owned developer of human–computer interaction technologies based on natural language conversations. The company is best known for creating the Assistant (by Speaktoit), a virtual buddy for Android, iOS, and Windows Phone smartphones that performs tasks and answers users’ question in a natural language.[1] Speaktoit has also created a natural language processing engine that incorporates conversation context like dialogue history, location and user preferences.

Setting up DialogFlow

To setup DialogFlow, you’ll need to create a DialogFlow Account.

After creating an account, you would need to create an “agent”. The Getting Started guide illustrates all the relevant details.

Fill in the details for the agent.

Rather than opting for the complete customization method and creating entities and intents, you can just click Small Talk in the left menu.

You can then toggle the switch for the service to be enabled.

You have the option of customizing your small-talk agent based on your preference via the DialogFlow interface. You could do something like this:

To use the API with our Nodejs application, you’ll need to go to the ‘General Settings’ page (click on the cog icon beside your agent’s name in the menu) and retrieve your API Key. The ‘client access token’ will be required to use the Node.js SDK. Make sure that you choose v1 over v2. Although v2 APIs are getting mature, we’ll stick with v1 for the purpose of demonstration.

Using the DialogFlow Node.Js SDK

To connect your Node.js app to DialogFlow via the latter’s Node.js SDK, you need to go back to your ‘index.js’ file. You need to initialize API.ai using your access token –

const apiai = require('apiai')(APIAI_TOKEN);

In case you prefer to execute the code locally, you need to hardcode your API key at this stage. There are a number of different methods to set your environment variables. Here we set an .env file to include these variables.

We are now using the server-side Socket.io to receive the result from the browser. After the connection is completed and the desired message has been received, you can use the DialogFlow APIs to recall a reply to the message from the user -

io.on('connection', function(socket) {
  socket.on('chat message', (text) => {

    // Get a reply from API.ai

    let apiaiReq = apiai.textRequest(text, {
      sessionId: APIAI_SESSION_ID
    });

    apiaiReq.on('response', (response) => {
      let aiText = response.result.fulfillment.speech;
      socket.emit('bot reply', aiText); // Send the result back to the browser!
    });

    apiaiReq.on('error', (error) => {
      console.log(error);
    });

    apiaiReq.end();

  });
});

Once API.ai generates the result, you can use Socket.io’s socket.emit() function to push it to the browser.

Giving The AI A Voice With The SpeechSynthesis Interface

Now it is time to return to ‘script.js’ and complete the app.
Start by creating a function to generate a synthetic voice. Here we use the SpeechSynthesis controller interface from the Web Speech API. This function picks up a string as an argument and then allows the browser to speak the text –

function synthVoice(text) {
  const synth = window.speechSynthesis;
  const utterance = new SpeechSynthesisUtterance();
  utterance.text = text;
  synth.speak(utterance);
}

There are three steps to this function –

Create a reference for the API entry point – window.speechSynthesis. This does not need a prefixed property as this particular API is widely supported when compared to SpeechRecognition.
In the next step, we create a fresh SpeechSynthesisUtterance() instance via its constructor. The relevant text that needs to be synthesized is now set. You also have the option to set additional properties including voice that chooses the set of voices supported by the browser and the OS.
At the final step, we use SpeechSynthesis.speak() to allow it to speak. You can now use Socket.io to retrieve the response from the server. Once this message is retrieved, you can call the following function –

socket.on('bot reply', function(replyText) {
    synthVoice(replyText);
});

You should now be ready to give your AI chatbot a test.

It is important to remember that your browser will request permission before using the microphone in the first instance. Similar to other web APIs like the Notification API and the Geolocation API, the browser does not access sensitive information unless permission is explicitly granted.