Phil Nash for Twilio

Posted on Aug 26, 2019 • Originally published at twilio.com

Text to speech in the browser with the Web Speech API

#javascript #webdev #texttospeech

The Web Speech API has two functions, speech synthesis, otherwise known as text to speech, and speech recognition. With the SpeechSynthesis API we can command the browser to read out any text in a number of different voices.

From a vocal alerts in an application to bringing an Autopilot powered chatbot to life on your website, the Web Speech API has a lot of potential for web interfaces. Follow on to find out how to get your web application speaking back to you.

What you'll need

If you want to build this application as we learn about the SpeechSynthesis API then you'll need a couple of things:

A modern browser (the API is supported across the majority of desktop and mobile browsers)
A text editor

Once you're ready, create a directory to work in and download this HTML file and this CSS file to it. Make sure they are in the same folder and the CSS file is named style.css. Open the HTML file in your browser and you should see this:

Let's get started with the API by getting the browser to talk to us for the first time.

The Speech Synthesis API

Before we start work with this small application, we can get the browser to start speaking using the browser's developer tools. On any web page, open up the developer tools console and enter the following code:

speechSynthesis.speak(new SpeechSynthesisUtterance("Hello, this is your browser speaking."));

Your browser will speak the text "Hello, this is your browser speaking." in its default voice. We can break this down a bit though.

We created a SpeechSynthesisUtterance which contained the text we wanted to be spoken. Then we passed the utterance to the speak method of the speechSynthesis object. This queues up the utterance to be spoken and then starts the browser speaking. If you send more than one utterance to the speak method they will be spoken one after another.

Let's take the starter code we downloaded earlier and turn this into a small app where we can input the text to be spoken and choose the voice that the browser says it in.

Speech Synthesis in a web application

Open up the HTML file you downloaded earlier in your text editor. We'll start by connecting the form up to speak whatever you enter in the text input when you submit. Later, we'll add the ability to choose the voice to use.

Between the <script> tags at the bottom of the HTML we'll start by listening for the DOMContentLoaded event and then selecting some references to the elements we'll need.

<script>
  window.addEventListener('DOMContentLoaded', () => {
    const form = document.getElementById('voice-form');
    const input = document.getElementById('speech');
  });
</script>

We then need to listen to the submit event on the form and when it fires, grab the text from the input. With that text we'll create a SpeechSynthesisUtterance and then pass it to speechSynthesis.speak. Finally, we empty the input box and wait for the next thing to say.

<script>
  window.addEventListener('DOMContentLoaded', () => {
    const form = document.getElementById('voice-form');
    const input = document.getElementById('speech');

    form.addEventListener('submit', event => {
      event.preventDefault();
      const toSay = input.value.trim();
      const utterance = new SpeechSynthesisUtterance(toSay);
      speechSynthesis.speak(utterance);
      input.value = '';
    });
  });
</script>

Open the HTML in your browser and enter some text in the input. You can ignore the <select> box at this point, we'll use that in the next section. Hit "Say it" and listen to the browser read out your words.

It's not much code to get the browser to say something, but what if we want to pick the voice that it uses. Let's populate the dropdown on the page with the available voices and use it to select the one we want to use.

Picking voices for text to speech

We need to get references to the <select> element on the page and initialise a couple of variables we'll use to store the available voices and the current voice we are using. Add this to the top of the script:

<script>
  window.addEventListener('DOMContentLoaded', () => {
    const form = document.getElementById('voice-form');
    const input = document.getElementById('speech');
    const voiceSelect = document.getElementById('voices');
    let voices;
    let currentVoice;

    form.addEventListener('submit', event => { //... })
  });
</script>

Next up we need to populate the select element with the available voices. We'll create a new function to do this, as we might want to call it more than once (more on that in a bit). We can call on speechSynthesis.getVoices() to return the available [SpeechSynthesisVoice](https://developer.mozilla.org/en-US/docs/Web/API/SpeechSynthesisVoice) objects.

Whilst we are populating the voice options we should also detect the currently selected voice. If we have already chosen a voice we can check against our currentVoice object and if we haven't yet chosen a voice then we can detect the default voice with the voice.default property.

    let voices;
    let currentVoice;

    const populateVoices = () => {
      const availableVoices = speechSynthesis.getVoices();
      voiceSelect.innerHTML = '';

      availableVoices.forEach(voice => {
        const option = document.createElement('option');
        let optionText = `${voice.name} (${voice.lang})`;
        if (voice.default) {
          optionText += ' [default]';
          if (typeof currentVoice === 'undefined') {
            currentVoice = voice;
            option.selected = true;
          }
        }
        if (currentVoice === voice) {
          option.selected = true;
        }
        option.textContent = optionText;
        voiceSelect.appendChild(option);
      });
      voices = availableVoices;
    };

    form.addEventListener('submit', event => { //... })

We can call populateVoice straight away. Some browsers will load the voices page load and will return their list straight away. Other browsers need to load their list of voices asynchronously and will emit a "voiceschanged" event once they have loaded. Some browsers do not emit this event at all though.

To account for all the potential scenarios we'll call populateVoices immediately and also set it as the callback to the "voiceschanged" event.

      voices = availableVoices;
    };

    populateVoices();
    speechSynthesis.onvoiceschanged = populateVoices;

    form.addEventListener('submit', event => { //... })
  });
</script>

Reload the page and you will see the <select> element populated with all the available voices, including the language the voice supports. We haven't hooked up selecting and using the voice yet though, that comes next.

Listen to the "change" event of the select element and whenever it is fired, select the currentVoice using the selectedIndex of the <select> element.

    populateVoices();
    speechSynthesis.onvoiceschanged = populateVoices;

    voiceSelect.addEventListener('change', event => {
      const selectedIndex = event.target.selectedIndex;
      currentVoice = voices[selectedIndex];
    });

    form.addEventListener('submit', event => { //... })
  });

Now, to use the voice with the speech utterance we need to set the voice on the utterance that we create.

    form.addEventListener('submit', event => {
      event.preventDefault();
      const toSay = input.value.trim();
      const utterance = new SpeechSynthesisUtterance(toSay);
      utterance.voice = currentVoice;
      speechSynthesis.speak(utterance);
      input.value = '';
    });
  });
</script>

Reload the page and play around selecting different voices and saying different things.

Bonus: build a visual speaking indicator

We've built a speech synthesiser that can use different voices, but I wanted to throw one more thing in for fun. Speech utterances emit a number of events that you can use to make your application respond to speech. To finish this little app off we're going to make an animation show as the browser is speaking. I've already added the CSS for the animation so to activate it we need to add a "speaking" class to the <main> element while the browser is speaking.

Grab a reference to the <main> element at the top of the script:

<script>
  window.addEventListener('DOMContentLoaded', () => {
    const form = document.getElementById('voice-form');
    const input = document.getElementById('speech');
    const voiceSelect = document.getElementById('voices');
    let voices;
    let currentVoice;
    const main = document.getElementsByTagName('main')[0];

Now, we can listen to the start and end events of the utterance to add and remove the "speaking" class. But, if we remove the class in the middle of the animation it won't fade out smoothly, so we should listen for the end of the animation's iteration, using the "animationiteration" event, and then remove the class.

    form.addEventListener('submit', event => {
      event.preventDefault();
      const toSay = input.value.trim();
      const utterance = new SpeechSynthesisUtterance(toSay);
      utterance.voice = currentVoice;
      utterance.addEventListener('start', () => {
        main.classList.add('speaking');
      });
      utterance.addEventListener('end', () => {
        main.addEventListener(
          'animationiteration',
          () => main.classList.remove('speaking'),
          { once: true }
        );
      });
      speechSynthesis.speak(utterance);
      input.value = '';
    });
  });
</script>

Now when you start the browser talking the background will pulse blue and when the utterance is over it will stop.

Your browser is getting chatty

In this post you've seen how to get started and work with the Speech Synthesis API from the Web Speech API. All the code for this application can be found on GitHub and you can see it in action or remix it on Glitch.

I'm excited about the potential of this API for building my own in browser bots, so look out for more of this in the future.

Have you used the Speech Synthesis API or have any plans for it? I'd love to hear in the comments below, or drop me a note at philnash@twilio.com or on Twitter at @philnash.

Top comments (8)

Jan Küster 🔥 • Aug 26 '19

Good introduction. The engine is great for making text based UIs more accessible. However, many technical and special terms are just not really understandable and it's worse in other languages than English.

I would love to see a follow-up article on creating or improving voices that can be used as custom voices.

Phil Nash • Aug 26 '19

I'm intrigued, do you have examples of terms which are not understandable when read by a voice like this?

Aaron • Aug 26 '19

Great article Phil! I have heard slight mispronunciations at times in certain languages, but overall I think its amazing and such an underused feature of the web!.

Phil Nash • Aug 26 '19

I would always expect some mispronunciation, given that it is a computer generated voice. I have a couple of roads around my house that Google Maps directions cannot pronounce (they're not even that hard) and that keeps me amused.

It is a cool, and surprisingly well supported, feature. Thanks!

Mohammad Fazel • Aug 26 '19

Sounds gripping! Will take the time to work on it...

Dennis m • Nov 19 '19

Great tutorial! I managed to put together a voice assistant using the first line of code with my existing text based chatbot, thanks!

Younouss 🇸🇳 🇩🇪 • Aug 26 '19

Thanks for this great article !
My chrome navigator said nothing after a few uses of the api, I don't know why

Phil Nash • Aug 26 '19

Oh that's weird. The thing with this API is that the browser actually hands off the speech synthesis to the operating system, so there's communication between the two that could break down. Also, there has been some issues on Windows and Linux with Chrome and speech samples that are longer than 15 seconds. Could that be what caused your problem?