Building a Real-time Speech-to-text Web App with Web Speech API

#tutorial #webdev #javascript

Happy New Year, everyone! In this short tutorial, we will build a simple yet useful real-time speech-to-text web app using the Web Speech API. Feature-wise, it will be straightforward: click a button to start recording, and your speech will be converted to text, displayed in real-time on the screen. We'll also play with voice commands; saying "stop recording" will halt the recording. Sounds fun? Okay, let's get into it. 😊

Web Speech API Overview

The Web Speech API is a browser technology that enables developers to integrate speech recognition and synthesis capabilities into web applications. It opens up possibilities for creating hands-free and voice-controlled features, enhancing accessibility and user experience.

Some use cases for the Web Speech API include voice commands, voice-driven interfaces, transcription services, and more.

Let's Get Started

Now, let's dive into building our real-time speech-to-text web app. I'm going to use vite.js to initiate the project, but feel free to use any build tool of your choice or none at all for this mini demo project.

Create a new vite project:

   npm create vite@latest

Choose "Vanilla" on the next screen and "JavaScript" on the following one. Use arrow keys on your keyboard to navigate up and down.

HTML Structure

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="UTF-8" />
    <link rel="icon" type="image/svg+xml" href="/vite.svg" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <script type="module" src="/main.js"></script>
    <title>Real-time Speech to Text App</title>
  </head>
  <body>
    <div class="container">
      <h1>Real-time Stt App</h1>

      <div class="btn-wrapper">
        <button id="startBtn" class="btn-start">
          <svg viewBox="0 0 100 100" class="hidden">
            <!-- Outer circle -->
            <circle
              cx="50"
              cy="50"
              r="40"
              stroke="#ccc"
              stroke-width="5"
              fill="none"
            />

            <!-- Inner circle indicating recording -->
            <circle
              cx="50"
              cy="50"
              r="30"
              stroke="#ccc"
              stroke-width="5"
              fill="none"
            >
              <animate
                attributeName="r"
                values="30; 25; 30"
                dur="1.5s"
                repeatCount="indefinite"
              />
            </circle>

            <!-- Record icon in the center -->
            <circle cx="50" cy="50" r="5" fill="#ccc" />
          </svg>

          <span> Start Recording </span>
        </button>
        <button id="stopBtn" class="btn-stop" disabled>Stop Recording</button>
      </div>

      <div id="result" class="result"></div>
    </div>
  </body>
</html>

CSS Styling

:root {
  font-family: Inter, system-ui, Avenir, Helvetica, Arial, sans-serif;
  line-height: 1.5;
  font-weight: 400;

  font-synthesis: none;
  text-rendering: optimizeLegibility;
  -webkit-font-smoothing: antialiased;
  -moz-osx-font-smoothing: grayscale;
}

* {
  margin: 0;
  padding: 0;
  box-sizing: border-box;
}

body {
  background: radial-gradient(
      circle at 100%,
      rgba(3, 6, 21, 0.9) 15%,
      rgba(189, 205, 226, 0.5) 5%,
      rgba(7, 9, 22, 0.9) 15%
    ),
    url('./public/chevron.png') center/cover;

  height: 100vh;
  padding: 40px 0;
}

.container {
  max-width: 1100px;
  margin: 0 auto;
  display: flex;
  flex-direction: column;
  align-items: center;
  padding: 0 15px;
}

h1 {
  color: #fff;
  font-size: 1.5rem;
  text-transform: uppercase;
}

.btn-wrapper {
  margin-top: 20px;
  display: flex;
  flex-wrap: wrap;
  justify-content: center;
  align-items: center;
  gap: 10px;
}

button {
  display: flex;
  align-items: center;
  column-gap: 5px;
  border: none;
  cursor: pointer;
  padding: 12px 24px;
  border-radius: 3px;
  font-weight: 600;
  box-shadow: 0 0 10px rgba(0, 0, 0, 0.3);
  transition: opacity 400ms ease-in-out;
}

button:disabled {
  opacity: 0.47;
  cursor: default;
}

button:hover:not(:disabled) {
  opacity: 0.9;
}

button > svg {
  height: 1rem;
}

.btn-start {
  background-color: #ff2c4f;
  color: #fff;
}

.btn-stop {
  background-color: rgb(7, 2, 44);
  color: #fff;
}

.result {
  background-color: #fff;
  width: 100%;
  min-height: 200px;
  padding: 10px;
  border-radius: 3px;
  margin-top: 20px;
  box-shadow: 0 0 10px rgba(0, 0, 0, 0.3);
  text-transform: capitalize;
}

.result:empty {
  display: none;
}

.hidden {
  display: none !important;
}

@media screen and (min-width: 768px) {
  h1 {
    font-size: 3.125rem;
    text-transform: capitalize;
  }

  .container {
    padding: 0 30px;
  }

  .result {
    padding: 15px;
  }
}

JavaScript Implementation

const resultElement = document.getElementById('result');
const startBtn = document.getElementById('startBtn');
const animatedSvg = startBtn.querySelector('svg');
const stopBtn = document.getElementById('stopBtn');

startBtn.addEventListener('click', startRecording);
stopBtn.addEventListener('click', stopRecording);

let recognition = window.SpeechRecognition || window.webkitSpeechRecognition;

if (recognition) {
  recognition = new recognition();
  recognition.continuous = true;
  recognition.interimResults = true;
  recognition.lang = 'en-US';

  recognition.onstart = () => {
    startBtn.disabled = true;
    stopBtn.disabled = false;
    animatedSvg.classList.remove('hidden');
    console.log('Recording started');
  };

  recognition.onresult = function (event) {
    let result = '';

    for (let i = event.resultIndex; i < event.results.length; i++) {
      if (event.results[i].isFinal) {
        result += event.results[i][0].transcript + ' ';
      } else {
        result += event.results[i][0].transcript;
      }
    }

    resultElement.innerText = result;

    if (result.toLowerCase().includes('stop recording')) {
      resultElement.innerText = result.replace(/stop recording/gi, '');
      stopRecording();
    }
  };

  recognition.onerror = function (event) {
    startBtn.disabled = false;
    stopBtn.disabled = true;
    console.error('Speech recognition error:', event.error);
  };

  recognition.onend = function () {
    startBtn.disabled = false;
    stopBtn.disabled = true;
    animatedSvg.classList.add('hidden');
    console.log('Speech recognition ended');
  };
} else {
  console.error('Speech recognition not supported');
}

function startRecording() {
  resultElement.innerText = '';
  recognition.start();
}

function stopRecording() {
  if (recognition) {
    recognition.stop();
  }
}

Conclusion

This simple web app utilizes the Web Speech API to convert spoken words into text in real-time. Users can start and stop recording with the provided buttons. Customize the design and functionalities further based on your project requirements.

Final demo: https://stt.nixx.dev

Feel free to explore the complete code on the GitHub repository.

Now, you have a basic understanding of how to create a real-time speech-to-text web app using the Web Speech API. Experiment with additional features and enhancements to make it even more versatile and user-friendly. 😊 🙏

DEV Community

Building a Real-time Speech-to-text Web App with Web Speech API

Top comments (0)

Read next

Contributing to open-source will 10x your chances to land a new job 🚀

How To Get Started With "DOM" in JavaScript.

Gitlens Tutorial - Visual Studio Extension 2024

What we learned building our SaaS with Rust 🦀