Have you ever wondered how does Google's speech search work, or ever thought of building an ebook narration app? At the first glance it might seem some complex piece of technology. While it is complicated to implement it on your own, thankfully Android (via Google Services) has built in speech-to-text and text-to-speech APIs which make it extremely easy to setup these features.
See it in action
How does this work?
For Speech-to-text, Android provides an Intent
based API which launches Google's Speech Recognition service and returns back the text result to you. There is a catch though - the device will require Google Search app for the service to work.
The Text-to-speech API, unlike Speech Recognition, is available without Google Services, and can be found in android.speech.tts
package.
Source code
You can find the source of this tutorial on GitHub.
Let's develop!
Fire up Android Studio and create a project with a Blank Activity.
User interface
The user interface is going to be simple - a LinearLayout
as the root view group, inside wich there will be a Button
which launches the Speech Recognition API, an EditText
that shows the Speech Recognition output as well as serves as input to Text-to-speech functionality, and another Button
to trigger Text-to-speech output.
The resultant XML file is as follows:
<?xml version="1.0" encoding="utf-8"?>
<LinearLayout xmlns:android="http://schemas.android.com/apk/res/android"
xmlns:tools="http://schemas.android.com/tools"
android:layout_width="match_parent"
android:layout_height="match_parent"
android:gravity="center"
android:orientation="vertical"
android:padding="24dp"
tools:context=".MainActivity">
<Button
android:id="@+id/btn_stt"
android:layout_width="wrap_content"
android:layout_height="wrap_content"
android:text="Speak" />
<EditText
android:id="@+id/et_text_input"
android:layout_width="match_parent"
android:layout_height="0dp"
android:layout_marginTop="24dp"
android:layout_marginBottom="24dp"
android:layout_weight="1"
android:gravity="center"
android:hint="Text from STT or for TTS goes here." />
<Button
android:id="@+id/btn_tts"
android:layout_width="wrap_content"
android:layout_height="wrap_content"
android:text="Listen" />
</LinearLayout>
Setting up speech recognition
The Speech Recognition API comes bundled with the Google Search app, and can be launched using an Intent. The result of this Intent holds the recognized text, which can be extracted from the result intent in onActivityResult
.
All the code beyond here is in Kotlin.
Firstly, let's define our request code constant.
companion object {
private const val REQUEST_CODE_STT = 1
}
Then, we'll attach an onClickListener
to our button, in which we will construct and launch the Speech Recognition Intent
.
btn_stt.setOnClickListener {
// Get the Intent action
val sttIntent = Intent(RecognizerIntent.ACTION_RECOGNIZE_SPEECH)
// Language model defines the purpose, there are special models for other use cases, like search.
sttIntent.putExtra(RecognizerIntent.EXTRA_LANGUAGE_MODEL, RecognizerIntent.LANGUAGE_MODEL_FREE_FORM)
// Adding an extra language, you can use any language from the Locale class.
sttIntent.putExtra(RecognizerIntent.EXTRA_LANGUAGE, Locale.getDefault())
// Text that shows up on the Speech input prompt.
sttIntent.putExtra(RecognizerIntent.EXTRA_PROMPT, "Speak now!")
try {
// Start the intent for a result, and pass in our request code.
startActivityForResult(sttIntent, REQUEST_CODE_STT)
} catch (e: ActivityNotFoundException) {
// Handling error when the service is not available.
e.printStackTrace()
Toast.makeText(this, "Your device does not support STT.", Toast.LENGTH_LONG).show()
}
}
The above code will launch the Speech Recognition API. But how do we get the result? We'll override the activity's onActivityResult
and get the recognized text.
override fun onActivityResult(requestCode: Int, resultCode: Int, data: Intent?) {
super.onActivityResult(requestCode, resultCode, data)
when (requestCode) {
// Handle the result for our request code.
REQUEST_CODE_STT -> {
// Safety checks to ensure data is available.
if (resultCode == Activity.RESULT_OK && data != null) {
// Retrieve the result array.
val result = data.getStringArrayListExtra(RecognizerIntent.EXTRA_RESULTS)
// Ensure result array is not null or empty to avoid errors.
if (!result.isNullOrEmpty()) {
// Recognized text is in the first position.
val recognizedText = result[0]
// Do what you want with the recognized text.
et_text_input.setText(recognizedText)
}
}
}
}
}
At this point, if your run your code, you will be able to use the Speech Recognition.
Setting up Text-to-speech
Unlike Speech Recognition API, Text-to-speech has it own class and doesn't run on Intents. We'll start off by creating a TextToSpeech
object. The TextToSpeech
class constructor expects a Context
and an OnInitListener
.
private val textToSpeechEngine: TextToSpeech by lazy {
// Pass in context and the listener.
TextToSpeech(this,
TextToSpeech.OnInitListener { status ->
// set our locale only if init was success.
if (status == TextToSpeech.SUCCESS) {
textToSpeechEngine.language = Locale.UK
}
})
}
Then, we'll set an OnClickListener to our TTS button and call the text-to-speech API on our input text.
btn_tts.setOnClickListener {
// Get the text to be converted to speech from our EditText.
val text = et_text_input.text.toString().trim()
// Check if user hasn't input any text.
if (text.isNotEmpty()) {
// Lollipop and above requires an additional ID to be passed.
if (Build.VERSION.SDK_INT >= Build.VERSION_CODES.LOLLIPOP) {
// Call Lollipop+ function
textToSpeechEngine.speak(text, TextToSpeech.QUEUE_FLUSH, null, "tts1")
} else {
// Call Legacy function
textToSpeechEngine.speak(text, TextToSpeech.QUEUE_FLUSH, null)
}
} else {
Toast.makeText(this, "Text cannot be empty", Toast.LENGTH_LONG).show()
}
}
As a safety measure and to prevent memory leaks, we must override onPause
and onDestroy
methods and appropriately stop
or shutdown
the TextToSpeech
object.
override fun onPause() {
textToSpeechEngine.stop()
super.onPause()
}
override fun onDestroy() {
textToSpeechEngine.shutdown()
super.onDestroy()
}
And that's it. Give it a try!
Closing Thoughts
With the standard APIs, Speech Recognition (or Speech-to-text) and Text-to-speech in Android is extremely easy to implement. While this might suffice most use cases, some advanced use cases would require more sophisticated third-party APIs or a custom implementation in your backend. We'll probably cover that sometime later.
Until then, keep coding, and as always do let me know if you have any questions in the comments section!
Top comments (5)
Thx for article, but I still have some questions:
How make speech recognition, which would be return recognized words in Set?
How to make permanent voice recognition until the user turns it off?
And how to remove google activity when it recognize speech, I want to show all words in text inside my app
Hey @samartinell
By default, speech recognition returns a string, you could do a String.split() call on on with a regex that identifies words as per your preference. This will give you a list of words.
For permanent voice recognition, you will have to play with Recognition. The trick is to listen to when the speech recognition ends and then restart it. If you don't want the Google Dialog while recognizing speech, and also keep an always on speech recognition feature, check out this answer on StackOverflow, might help: stackoverflow.com/a/45833487
Do let the community here know if it worked for you!
Thank you very much! I'm already found how recognize speech without google dialog and how split string to set. Now i'm gonna try your way to permanent recognition
Thank you for the details, is it possible to use it while recording videos with audio ? With Android cameraX?
Can we add custom words to help better recognise words