DEV Community

addupe
addupe

Posted on • Updated on

Google 'Speech' APIs

Google Cloud offers two APIs as AI & Machine Learning Products.

  1. Cloud Text-To-Speech
  2. Cloud Speech-To-Text

From the API docs:

'Google Cloud Text-to-Speech converts text into human-like speech in more than 180 voices across 30+ languages and variants. It applies groundbreaking research in speech synthesis (WaveNet) and Google's powerful neural networks to deliver high-fidelity audio.'

'Google Cloud Speech-to-Text enables developers to convert audio to text by applying powerful neural network models in an easy-to-use API.'

Who remembers this?

We've obviously progressed.
There's Siri and Alexa and APIs like these available for use right in our applications.

Google's API docs are also very user friendly compared to other API docs I've used. (Let me know your opinions on this below)

In the docs*, you can test the API directly with an example JSON body to send to the provided endpoint.


{
  "audioConfig": {
    "audioEncoding": "LINEAR16",
    "pitch": 0,
    "speakingRate": 1
  },
  "input": {
    "text": "Nay, answer me: stand, and unfold yourself."
  },
  "voice": {
    "languageCode": "en-GB",
    "name": "en-GB-Wavenet-A"
  }
}


{
  "audio": {
    "content": "/* Your audio */"
  },
  "config": {
    "enableAutomaticPunctuation": true,
    "encoding": "LINEAR16",
    "languageCode": "en-US",
    "model": "default"
  }
}

Google refers to the transformation process in both tools as Speech Synthesis: 'The process of translating text input into audio data is called synthesis and the output of synthesis is called synthetic speech'.

These are the three methods Google writes of implementing in their synthesis processes:

Synchronous Recognition (REST and gRPC) sends audio data to the Speech-to-Text API, performs recognition on that data, and returns results after all audio has been processed. Synchronous recognition requests are limited to audio data of 1 minute or less in duration.

Asynchronous Recognition (REST and gRPC) sends audio data to the Speech-to-Text API and initiates a Long Running Operation. Using this operation, you can periodically poll for recognition results. Use asynchronous requests for audio data of any duration up to 480 minutes.

Streaming Recognition (gRPC only) performs recognition on audio data provided within a gRPC bi-directional stream. Streaming requests are designed for real-time recognition purposes, such as capturing live audio from a microphone. Streaming recognition provides interim results while audio is being captured, allowing result to appear, for example, while a user is still speaking.

(some notes on encoding)

The text-to-speech API generates a raw audio file as a base64 encoded string. This encoded string must be decoded into a playable audio file. Something like an MP3.

When using the speech-to-text API example upload, the audio conversion is handled for you in a gooey made by google, but when sending an API request in code, you'll have to consider the encoding of your audio received from the user. See this note and link on encoding: 'Audio data is binary data, so you will need to convert such binary data into text using Base64 encoding.'

Tbh, I have a surface level knowledge of encoding formats, so I'll link the wikipedia article as a jumping off point for everyone so I don't stick my foot in my ear here.

Watch out for your encoding types. (And leave technical insight in the comments pls.)

What's interesting to me is that

  1. this is an awesome pair of tools that look! we have access to
  2. the use cases are great. we already see them with google translate and the like, but also could be used to make our apps more accessible and make OS we can fall in love with

What are your ideas? What have you used this for? Where else have you seen it? What are your experiences?

*text to speech docs
*speech to text docs
wikipedia audio coding types

Top comments (0)