DEV Community

Jan Küster
Jan Küster

Posted on

Cross browser speech synthesis - the hard way and the easy way

When I implemented my first speech-synthesis app using the Web Speech API I was shocked how hard it was to setup and execute it with cross-browser support in mind:

  • Some browsers don't support speech synthesis at all, for instance IE (at least I don't care 🤷‍♂️) and Opera (I do care 😠) and a few more mobile browsers (I haven't decided yet, whether I care or not 🤔).
  • On top of that, each browser implements the API differently or with some specific quirks the other browsers don't have

Just try it yourself - go to and execute the MDN speech synthesis example on different browsers and different platforms:

  • Linux, Windows, MacOS, BSD, Android, iOS
  • Firefox, Chrome, Chromium, Safari, Opera, Edge, IE, Samsung Browser, Android Webview, Safari on iOS, Opera Mini

You will realize that this example will only work on a subset of these platform-browser combinations. Worst: when you start researching you'll get shocked how quirky and underdeveloped this whole API still is in 2021/2022.

To be fair: it is still labeled as experimental technology. However, it's almost 10 years now, since it has been drafted and still is not a living standard.

This makes it much harder to leverage for our applications and I hope this guide I will help you to get the most out of it for as many browsers as possible.


Minimal example

Let's approach this topic step-by-step and start with a minimal example that all browsers (that generally support speech synthesis) should run:

if ('speechSynthesis' in window) {
  window.speechSynthesis.speak(
    new SpeechSynthesisUtterance('Hello, world!')
  )
}
Enter fullscreen mode Exit fullscreen mode

You can simply copy that code and execute it in your browser console.

If you have basic support you will hear some "default" voice speaking the text 'Hello, world!' and it may sound natural or not, depending on the default "voice" that is used.


Loading voices

Browsers may detect your current language and select a default voice, if installed. However, this may not represent the desired language you'd like to hear for the text to be spoken.

In such case you need to load the list of voices, which are instances of SpeechSynthesisVoice. This is the first greater obstacle where browsers behave quite differently:

Load voices sync-style

const voices =  window.speechSynthesis.getVoices()
voices // Array of voices or empty if none are installed
Enter fullscreen mode Exit fullscreen mode

Firefox and Safari Desktop just load the voices immediately in sync-style. This however would return an empty array on Chrome Desktop, Chrome Android and may return an empty Array on Firefox Android (see next section).

Load voices async-style

window.speechSynthesis.onvoiceschanged = function () {
  const voices = window.speechSynthesis.getVoices()
  voices // Array of voices or empty if none are installed
}
Enter fullscreen mode Exit fullscreen mode

This methods loads the voices async, so your overall system needs a callback or wrap it with a Promise. Firefox Desktop does not support this method at all, although it's defined as property of window.speechSynthesis, while Safari does not have it at all.

In contrast: Firefox Android loads the voices the first time using this method and on a refresh has them available via the sync-style method.

Loading using interval

Some users of older Safari have reported that their voices are not available immediately (while onvoiceschanged is not available, too). For this case we need to check in a constant interval for the voices:

let timeout = 0
const maxTimeout = 2000
const interval = 250

const loadVoices = (cb) => {
  const voices = speechSynthesis.getVoices()

  if (voices.length > 0) {
    return cb(undefined, voices)
  }

  if (timeout >= maxTimeout) {
    return cb(new Error('loadVoices max timeout exceeded'))
  }

  timeout += interval
  setTimeout(() => loadVoices(cb), interval)
}

loadVoices((err, voices) => {
  if (err) return console.error(err)

  voices // voices loaded and available
})
Enter fullscreen mode Exit fullscreen mode

Speaking with a certain voice

There are use-cases, where the default selected voice is not the same language as the text to be spoken. We need to change the voice for the "utterance" to speak.

Step 1: get a voice by a given language

// assume voices are loaded, see previous section
const getVoicebyLang = lang => speechSynthesis
  .getVoices()
  .find(voice => voice.startsWith(lang))

const german = getVoicebyLang('de')
Enter fullscreen mode Exit fullscreen mode

Note: Voices have standard language codes, like en-GB or en-US or de-DE. However, on Android's Samsung Browser or Android Chrome voices have underscore-connected codes, like en_GB.

Then on Firefox android voices have three characters before the separator, like deu-DEU-f00 or eng-GBR-f00.

However, they all start with the language code so passing a two-letter short-code should be sufficient.

Step 2: create a new utterance

We can now pass the voice to a new SpeechSynthesisUtterance and as your precognitive abilities correctly manifest - there are again some browser-specific issues to consider:

const text = 'Guten Tag!'
const utterance = new SpeechSynthesisUtterance(text)

if (utterance.text !== text) {
  // I found no browser yet that does not support text
  // as constructor arg but who knows!?
  utterance.text = text
}

utterance.voice = german // ios required
utterance.lang = voice.lang // // Android Chrome required
utterance.voiceURI = voice.voiceURI // Who knows if required?

utterance.pitch = 1
utterance.volume = 1

// API allows up to 10 but values > 2 break on all Chrome
utterance.rate = 1
Enter fullscreen mode Exit fullscreen mode

We can now pass the utterance to the speak function as a preview:

speechSynthesis.speak(utterance) // speaks 'Guten Tag!' in German
Enter fullscreen mode Exit fullscreen mode

Step 3: add events and speak

This is of course just the half of it. We actually want to get deeper insights of what's happening and what's missing by tapping into some of the utterance's events:

const handler = e => console.debug(e.type)

utterance.onstart = handler
utterance.onend = handler
utterance.onerror = e => console.error(e)

// SSML markup is rarely supported
// See: https://www.w3.org/TR/speech-synthesis/
utterance.onmark = handler

// word boundaries are supported by
// Safari MacOS and on windows but
// not on Linux and Android browsers
utterance.onboundary = handler

// not supported / fired
// on many browsers somehow
utterance.onpause = handler
utterance.onresume = handler

// finally speak and log all the events
speechSynthesis.speak(utterance)
Enter fullscreen mode Exit fullscreen mode

Step 4: Chrome-specific fix

Longer texts on Chrome-Desktop will be cancelled automatically after 15 seconds. This can be fixed by either chunking the texts or by using an interval of "zero"-latency pause/resume combination. At the same time this fix breaks on Android, since Android devices don't implement speechSynthesis.pause() as pause but as cancel:

let timer

utterance.onstart = () => {
  // detection is up to you for this article as
  // this is an own huge topic for itself
  if (!isAndroid) {
    resumeInfinity(utterance)
  }
}

const clear = () => {  clearTimeout(timer) }

utterance.onerror = clear
utterance.onend = clear

const resumeInfinity = (target) => {
  // prevent memory-leak in case utterance is deleted, while this is ongoing
  if (!target && timer) { return clear() }

  speechSynthesis.pause()
  speechSynthesis.resume()

  timer = setTimeout(function () {
    resumeInfinity(target)
  }, 5000)
}
Enter fullscreen mode Exit fullscreen mode

Furthermore, some browser don't update the speechSynthesis.paused property when speechSynthesis.pause() is executed (and speech is correctly paused). You need to manage these states yourself then.


Issues that can't be fixed with JavaScript:

All the above fixes rely on JavaScript but some issues are platform-specific. You need to your app in a way to avoid these issues, where possible:

  • All browsers on Android actually do a cancel/stop when calling speechSynthesis.pause; pause is simply not supported on Android 👎
  • There are no voices on Chromium-Ubuntu and Ubuntu-derivatives unless the browser is started with a flag 👎
  • If on Chromium-Desktop Ubuntu and the very first page wants to load speech synthesis, then there are no voices ever loaded until the page is refreshed or a new page is entered. This can be fixed with JavaScript but it can lead to very bad UX to auto-refresh the page. 👎
  • If voices are not installed on the host-OS and there are no voices loaded from remote by the browser, then there are no voices and thus no speech synthesis 👎
  • There is no chance to just instant-load custom voices from remote and use them as a shim in case there are no voices 👎
  • If the installed voices are just bad users have to manually install better voices 👎

Making your life easier with EasySpeech

Now you have seen the worst and believe me, it takes ages to implement all potential fixes.

Fortunately I already did this and published a package to NPM with the intent to provide a common API that handles most issues internally and provide the same experience across browsers (that support speechSynthesis):

GitHub logo jankapunkt / easy-speech

Cross browser Speech Synthesis

Easy Speech

JavaScript Style Guide Project Status: Active – The project has reached a stable, usable state and is being actively developed. Test suite CodeQL Semantic Analysis npm npm bundle size npm bundle size

Cross browser Speech Synthesis; no dependencies.

This project was created, because it's always a struggle to get the synthesis part of Web Speech API running on most major browsers.

Note: this is not a polyfill package, if your target browser does not support speech synthesis or the Web Speech API, this package is not usable.

Install

Install from npm via

$ npm install easy-speech
Enter fullscreen mode Exit fullscreen mode

Usage

Import EasySpeech and first, detect, if your browser is capable of tts (text to speech):

import EasySpeech from 'easy-speech'
EasySpeech.detect()
Enter fullscreen mode Exit fullscreen mode

it returns an Object with the following information:

{
  speechSynthesis: SpeechSynthesis|undefined,
  speechSynthesisUtterance: SpeechSynthesisUtterance|undefined,
  speechSynthesisVoice: SpeechSynthesisVoice|undefined,
  speechSynthesisEvent: SpeechSynthesisEvent|undefined,
  speechSynthesisErrorEvent: SpeechSynthesisErrorEvent|undefined,
  onvoiceschanged: Boolean,
  onboundary: Boolean,
  onend: Boolean,
  onerror: Boolean,
  onmark: Boolean,
Enter fullscreen mode Exit fullscreen mode

You should give it a try if you want to implement speech synthesis the next time. It also comes with a DEMO page so you can easy test and debug your devices there: https://jankapunkt.github.io/easy-speech/

Let's take a look how it works:

import EasySpeech from 'easy-speech'

// sync, returns Object with detected features
EasySpeech.detect()

EasySpeech.init()
  .catch(e => console.error('no speech synthesis:', error.message)
  .then(() = > {
     EasySpeech.speak({ text: 'Hello, world!' })
   })
Enter fullscreen mode Exit fullscreen mode

It will not only detect, which features are available but also loads an optimal default voice, based on a few heuristics.

Of course there is much more to use and the full API is also documented via JSDoc: https://github.com/jankapunkt/easy-speech/blob/master/API.md

If you like it leave a star and please file an issue if you found (yet another) browser-specific issue.


References

Top comments (10)

Collapse
 
balthazur profile image
Balthasar Huber

Hi Jan,
thanks for this post and for creating easy-speech. I am really glad I found this article before diving deep into the Web Speech API.

On iOS, I noticed that only the following german languages are available (see this post).
However, all of them are completely broken and not usable for german language, even though they are tagged with "de". I saw that you are german, so I was wondering if you maybe managed to make a german voice work on iOS? On MacOS, Anna DE seems to work fairly well.

Also, I created typings for easy-speech, so if you are interested I could make a PR.

Best,
Balthasar

Collapse
 
balthazur profile image
Balthasar Huber • Edited

Addition:
If I try this demo and this demo on Chrome-iOS, the german voices seem pretty broken.
However, on this demo, even though they sound not great at least they seem to work. Don't know how and why this is possible.

Collapse
 
jankapunkt profile image
Jan Küster

Hi Balthasar, sorry for late response, smh I get no notifications using the dev.to app....

Basically, the voices are are system issue we actually can't solve with JavaScript. Some browsers provide additional (remote) voices, such as Chrome with the Google voices. MacOS and iOS are sometimes different even between minor updates and you have to install additional voices on the OS level to reach a decent TTS experience.
I have no iOS device available currently so I can't really tell which voice we used last time.
To compare things:
On most Linux distros you will find only chrome to have decent German voices, basically because it's using the remote google voices.
I really hope the vendors will improve on all this in the near future.
My own mid term goal is to provide my own speech service voices as fallback but I found that I can't implement the SpeechSynthesisVoice interface and simply load my custom voice from my server.
However, any improvement from my end will directly be added to EasySpeech.

Regarding typings I'm curious for your PR! Would this be a types.ts file or a TS rewrite? I have only surface-level knowledge of Typescript so I can't really review a full rewrite. Let's continue on GitHub regarding the TS Integration.

Thanks and all the best

Collapse
 
balthazur profile image
Balthasar Huber

No worries!
Thanks for your insights, really interesting. "Remote google voices" sounds great tbh, maybe it isn't a thing on iOS, as Chrome and all other browser on iOS are WebKit and not Chromium based, but it's just a guess. Too bad that there isn't much we can do to improve it right now.

I created types for the exported functions and classes in a seperate typings file. I will create a PR in the upcoming days and we can further discuss on GitHub :-).

Thread Thread
 
jankapunkt profile image
Jan Küster

Great! Looking forward to your PR!

Collapse
 
dariberrie profile image
DariBerrie

I know you posted this a while ago, but I ran into the same issue while trying to create a more accessible web app using SpeechSynthesis to help confirm certain actions.

It was a pain realizing that the voices provided in Chrome were not the same as Safari (let alone any other browsers). After reading through the issues you faced, it was a nice surprise to see your EasySpeech package at the end! Thank you! 🚀

Collapse
 
jankapunkt profile image
Jan Küster

Thank you! If you include EasySpeech in your app and encounter any issues, don't hesitate to open an issue on it's GitHub page or reach out to me here on dev.to. All the best

Collapse
 
codedwells profile image
Abel Misiocha

Great project!

Collapse
 
sunco profile image
SuNcO • Edited

Using your code or the API directly, the iOS voices for es-MX are very ugly 😕

Collapse
 
jankapunkt profile image
Jan Küster

Thanks, the quality of the voice is unfortunately vendor specific. You may need to Install voices on a system level in order to get a decent output.