Rishabh Tatiraju

Posted on May 30, 2020

Speech-to-text and Text-to-speech with Android

#android #kotlin

Have you ever wondered how does Google's speech search work, or ever thought of building an ebook narration app? At the first glance it might seem some complex piece of technology. While it is complicated to implement it on your own, thankfully Android (via Google Services) has built in speech-to-text and text-to-speech APIs which make it extremely easy to setup these features.

See it in action

How does this work?

For Speech-to-text, Android provides an Intent based API which launches Google's Speech Recognition service and returns back the text result to you. There is a catch though - the device will require Google Search app for the service to work.

The Text-to-speech API, unlike Speech Recognition, is available without Google Services, and can be found in android.speech.tts package.

Source code

You can find the source of this tutorial on GitHub.

Let's develop!

Fire up Android Studio and create a project with a Blank Activity.

User interface

The user interface is going to be simple - a LinearLayout as the root view group, inside wich there will be a Button which launches the Speech Recognition API, an EditText that shows the Speech Recognition output as well as serves as input to Text-to-speech functionality, and another Button to trigger Text-to-speech output.

The resultant XML file is as follows:



<?xml version="1.0" encoding="utf-8"?>
<LinearLayout xmlns:android="http://schemas.android.com/apk/res/android"
    xmlns:tools="http://schemas.android.com/tools"
    android:layout_width="match_parent"
    android:layout_height="match_parent"
    android:gravity="center"
    android:orientation="vertical"
    android:padding="24dp"
    tools:context=".MainActivity">

    <Button
        android:id="@+id/btn_stt"
        android:layout_width="wrap_content"
        android:layout_height="wrap_content"
        android:text="Speak" />

    <EditText
        android:id="@+id/et_text_input"
        android:layout_width="match_parent"
        android:layout_height="0dp"
        android:layout_marginTop="24dp"
        android:layout_marginBottom="24dp"
        android:layout_weight="1"
        android:gravity="center"
        android:hint="Text from STT or for TTS goes here." />

    <Button
        android:id="@+id/btn_tts"
        android:layout_width="wrap_content"
        android:layout_height="wrap_content"
        android:text="Listen" />

</LinearLayout>

Setting up speech recognition

The Speech Recognition API comes bundled with the Google Search app, and can be launched using an Intent. The result of this Intent holds the recognized text, which can be extracted from the result intent in onActivityResult.

All the code beyond here is in Kotlin.

Firstly, let's define our request code constant.



    companion object {
        private const val REQUEST_CODE_STT = 1
    }

Then, we'll attach an onClickListener to our button, in which we will construct and launch the Speech Recognition Intent.



    btn_stt.setOnClickListener {
        // Get the Intent action
        val sttIntent = Intent(RecognizerIntent.ACTION_RECOGNIZE_SPEECH)
        // Language model defines the purpose, there are special models for other use cases, like search.
        sttIntent.putExtra(RecognizerIntent.EXTRA_LANGUAGE_MODEL, RecognizerIntent.LANGUAGE_MODEL_FREE_FORM)
        // Adding an extra language, you can use any language from the Locale class.
        sttIntent.putExtra(RecognizerIntent.EXTRA_LANGUAGE, Locale.getDefault())
        // Text that shows up on the Speech input prompt.
        sttIntent.putExtra(RecognizerIntent.EXTRA_PROMPT, "Speak now!")
        try {
            // Start the intent for a result, and pass in our request code.
            startActivityForResult(sttIntent, REQUEST_CODE_STT)
        } catch (e: ActivityNotFoundException) {
            // Handling error when the service is not available.
            e.printStackTrace()
            Toast.makeText(this, "Your device does not support STT.", Toast.LENGTH_LONG).show()
        }
    }

The above code will launch the Speech Recognition API. But how do we get the result? We'll override the activity's onActivityResult and get the recognized text.



    override fun onActivityResult(requestCode: Int, resultCode: Int, data: Intent?) {
        super.onActivityResult(requestCode, resultCode, data)
        when (requestCode) {
            // Handle the result for our request code.
            REQUEST_CODE_STT -> {
                // Safety checks to ensure data is available.
                if (resultCode == Activity.RESULT_OK && data != null) {
                    // Retrieve the result array.
                    val result = data.getStringArrayListExtra(RecognizerIntent.EXTRA_RESULTS)
                    // Ensure result array is not null or empty to avoid errors.
                    if (!result.isNullOrEmpty()) {
                        // Recognized text is in the first position.
                        val recognizedText = result[0]
                        // Do what you want with the recognized text.
                        et_text_input.setText(recognizedText)
                    }
                }
            }
        }
    }

At this point, if your run your code, you will be able to use the Speech Recognition.

Setting up Text-to-speech

Unlike Speech Recognition API, Text-to-speech has it own class and doesn't run on Intents. We'll start off by creating a TextToSpeech object. The TextToSpeech class constructor expects a Context and an OnInitListener.



    private val textToSpeechEngine: TextToSpeech by lazy {
        // Pass in context and the listener.
        TextToSpeech(this,
            TextToSpeech.OnInitListener { status ->
                // set our locale only if init was success.
                if (status == TextToSpeech.SUCCESS) {
                    textToSpeechEngine.language = Locale.UK
                }
            })
    }

Then, we'll set an OnClickListener to our TTS button and call the text-to-speech API on our input text.



btn_tts.setOnClickListener {
    // Get the text to be converted to speech from our EditText.
    val text = et_text_input.text.toString().trim()
    // Check if user hasn't input any text.
    if (text.isNotEmpty()) {
        // Lollipop and above requires an additional ID to be passed.
        if (Build.VERSION.SDK_INT >= Build.VERSION_CODES.LOLLIPOP) {
            // Call Lollipop+ function
            textToSpeechEngine.speak(text, TextToSpeech.QUEUE_FLUSH, null, "tts1")
        } else {
            // Call Legacy function
            textToSpeechEngine.speak(text, TextToSpeech.QUEUE_FLUSH, null)
        }
    } else {
        Toast.makeText(this, "Text cannot be empty", Toast.LENGTH_LONG).show()
    }
}

As a safety measure and to prevent memory leaks, we must override onPause and onDestroy methods and appropriately stop or shutdown the TextToSpeech object.



override fun onPause() {
    textToSpeechEngine.stop()
    super.onPause()
}

override fun onDestroy() {
    textToSpeechEngine.shutdown()
    super.onDestroy()
}

And that's it. Give it a try!

Closing Thoughts

With the standard APIs, Speech Recognition (or Speech-to-text) and Text-to-speech in Android is extremely easy to implement. While this might suffice most use cases, some advanced use cases would require more sophisticated third-party APIs or a custom implementation in your backend. We'll probably cover that sometime later.

Until then, keep coding, and as always do let me know if you have any questions in the comments section!

Top comments (5)

samartinell • Jan 16 '21

Thx for article, but I still have some questions:
How make speech recognition, which would be return recognized words in Set?
How to make permanent voice recognition until the user turns it off?
And how to remove google activity when it recognize speech, I want to show all words in text inside my app

Rishabh Tatiraju • Jan 17 '21

Hey @samartinell

By default, speech recognition returns a string, you could do a String.split() call on on with a regex that identifies words as per your preference. This will give you a list of words.

For permanent voice recognition, you will have to play with Recognition. The trick is to listen to when the speech recognition ends and then restart it. If you don't want the Google Dialog while recognizing speech, and also keep an always on speech recognition feature, check out this answer on StackOverflow, might help: stackoverflow.com/a/45833487

Do let the community here know if it worked for you!

samartinell • Jan 18 '21

Thank you very much! I'm already found how recognize speech without google dialog and how split string to set. Now i'm gonna try your way to permanent recognition