Text to Speech + Image — A Talkie in JS

Kostia Palchyk Updated on ・4 min read

In the previous part we created a website where users can generate GIF animations using Emoji, domain-specific language (DSL) and a Canvas. In this post we'll upgrade our animations to talkies!


example animation we achieved in part I


I thought that it'd be funny to create animations where Emoji can talk. I already had Emoji moving around and displaying phrases as text. Obviously it was missing sound. In this article I'll show you how I added it!

tl;dr: try this animation
⚠️ warning: contains sound!


Accidentally I stumbled upon "Text To Speech In 3 Lines Of JavaScript" article (thanks, @asaoluelijah !) and that "3 lines" quickly migrated to my project.

const msg = new SpeechSynthesisUtterance();
msg.text = 'Hello World';
// ☝️ You can run this in the console, BTW
example taken from Asaolu Elijah's article. Go read it for more API details

Surely "3 lines" turned out to be 80. But I'll get to that later.

Text-to-Speech — is a part of browser Web Speech API that allows us to read text out loud and recognize speech.

But before we can go further with adding Text-to-Speech to animation, I need to show you how I rendered animation in the first place.

Animation and RxJS

After parsing DSL and rendering it to canvas (see part I), I had an array of frames:

[ { image: 'http://.../0.png' 
  , phrases: [ 'Hello!' ]
  , duration: 1000
, { image: 'http://.../1.png' 
  , phrases: [ 'Hi!' ]
  , duration: 1000

Each frame had a rendered image, phrases within it and frame duration.

To show the animation I used a React component with RxJS stream inside:

import React, { useState, useEffect } from 'react';

function Animation({ frames }) {
  // state for current frame
  const [frame, setFrame] = useState(null);

  useEffect(() => {
    // turn array intro stream of arrays
    const sub = from(frames).pipe(
      // with each frame delayed by frame.duration
      delayWhen(frame => timer(frame.duration)),
      // mapped to an Image
      map(frame => <img src={frame.image} />)

    return () => sub.unsubscribe(); // teardown logic
  }, [frames]);

  return frame;
to simplify things, I'll use pseudocodish JS around the article

Here I use a useEffect hook to create a RxJS Observable and a subscription to it. The from function will iterate over the rendered frames array, delayWhen will delay each frame by frame.duration and map will turn each frame into a new <img /> element. And I can easily loop the animation by simply adding a repeat() operator.

Note that subscription has to be cancelled at some point (specially the endless repeat()): the component might be destroyed or the frames might change. So the function passed to useEffect hook needs to return a teardown callback. In this case I unsubscribe from the animation observable, effectively terminating the flow.

With that covered, we can now discuss the Text-to-Speech!

Text-to-Speech and RxJS

Now I needed to pronounce the text using Speech API, but that frame.duration delay I used wouldn't work: I had to wait until the phrase is spoken and only then switch to the next frame. Also, if user edits the scenario or navigates away — I need to stop current synthesis. Happily, RxJS is ideal for such things!

First I needed to create an Observable wrapper around Speech Synthesis API:

export function speak(text) {
  return new Observable((observer) => {
    // create and config utterance
    const utterance = new SpeechSynthesisUtterance();
    utterance.text = text;

    // subscribe our observer to utterance events
    utterance.onend = () => observer.complete();
    utterance.onerror = (err) => observer.error(err);

    // start the synthesis

    return () => {
this is a shortened version, see sources for more

When utterance will end Observable will complete, thus letting us chaining the synthesis. Also, if we unsubscribe from Observable — the synthesis will be stopped.

I've actually decided to publish this Observable wrapper as an npm package. There's a link in the footer 👇!

Now we can safely compose our phrases and be notified when they end:

    complete(){ console.log('done'); }

Try this code online at https://stackblitz.com/edit/rxjs-tts?file=index.ts

And to integrate the Text-to-Speech back into our Animation component:

  concatMap(frame => {
    // concat all phrases into a chain
    const phrases$ = concat(
        ...frame.phrases.map(text => speak(text))

    // we'll wait for phrase to end
    // even if duration is shorter
    const duration$ = merge(

    // to acknowledge the duration we need to merge it
    // while ignoring it's values
    return merge(
        of(<img src={frame.image} />),

Thats it! Now our Emoji can walk and talk!

Turn the volume up and try this "Dancing" animation

a recorded version of "Dancing"

And surely try creating your own 🙂


It was pretty simple, huh?

But there was a hidden trick: previously the web app was hosted on GitHub pages and users shared their animations using downloaded GIFs. But GIF cannot contain sound, you know... so I needed another way for users to share animations.

In the next article I'll share details on how I migrated the create-react-app to NextJS/Vercel platform and added MongoDB to it.

Have a question or idea? Please, share your thoughts in the comments!

Thanks for reading this and see you next time!

❤️ 🦄 📖



