Speech Recognition in Unity: Adding Voice Input

#100daysofcode #challenge #unity3d #programming

Voice inputs in Unity elevates the user experience significantly. However, adding speech recognition to Unity is not that easy. In Unity asset store there a few alternatives - that run on one platform or two. For cross-platform applications there are even less alternatives. Those support cross-platform applications rely on 3rd party cloud providers. Cloud computing comes at a cost with inherent limitations such as unpredictable latency and require a constant network connection. They will hinder the user experience. However, Picovoice by processing voice data on the device overcomes these challenges. That's why on day 33, we'll cover how to speech recognition to Unity without compromising user experience.

By the end of the tutorial we'll be able to control a player with voice by using commands like Porcupine, skip ahead 1 minute

This tutorial is to build a hands-free video player for virtual reality applications. For VR applications using physical controllers has not been convenient.

** Virtual Video Screen**
By using Unity’s Render Texture, that receives frames of a video and render them as a texture. This any surface that can receive a texture can be turned into a screen.

Import a video into your Unity project.
Drag it into your scene
Click on the Video Player object and change the Render Mode property in the Inspector to Render Texture.
Right-click in your Project panel and select Create > Render Texture. Give this new object a name
Drag it into the “Target Texture” property of the video player.

Congratulations, you created a video player that will generate frames of your video and render them to a texture. Now, let's make the screen as well.

Create a new material with the shader type “Unlit/Texture”
Drag the render texture to the empty texture box.
Create a new piece of 3D geometry in the scene to apply the material to.
Drag the material onto this new object and hit the play button

Right now you should be able to see your video playing on the surface object!

Getting your app to understand voice inputs

We'll use Picovoice platform Unity SDK - which combines Porcupine Wake Word and Rhino Speech-to-Intent

Download the Picovoice Unity package
import it into your Unity project.
Download pre-trained models: "Porcupine" from Porcupine Wake Word and Video Player Context from Rhino Speech-to-Intent repositories - You can also train a custom models on Picovoice Console.
Sign up for the Picovoice Console for free and grab your AccessKey
Drop the Porcupine (.ppn file) and Rhino models (.rhn file) into your project under the StreamingAssets folder
Create a script called VideoController.cs and attach it to the video screen. In this script, we’ll initialize a PicovoiceManager with the keyword and context files, as well as a callback for when Porcupine detects the wake word (OnWakeWordDetected) and a callback for when Rhino has finished an inference (OnInferenceResult).

using Pv.Unity;

PicovoiceManager _picovoiceManager;  

void Start()
{
    string accessKey = "..."; // your Picovoice AccessKey
    string keywordPath = Path.Combine(Application.streamingAssetsPath, 
                                      "porcupine.ppn");
    string contextPath = Path.Combine(Application.streamingAssetsPath, 
                                      "video_player.rhn");
    try
    {
        _picovoiceManager = PicovoiceManager.Create(
            accessKey,
            keywordPath, 
            OnWakeWordDetected, 
            contextPath, 
            OnInferenceResult);
    }
    catch (Exception ex)
    {
        Debug.LogError("PicovoiceManager was unable to initialize: " + ex.ToString());
    }
}

private void OnWakeWordDetected()
{    
    // wake word detected!
}

private void OnInferenceResult(Inference inference)
{
    if (inference.IsUnderstood)
    {
        string intent = inference.Intent;
        Dictionary<string, string> slots = inference.Slots;
        // interpret intent and slots
    }
}

-do not forget to add your AccessKey and path for models-

Well, you might previously have challenges with recording audio from Unity. But PicovoiceManager, handles it automatically. simply call .start() to begin audio capture and .stop() to cease it.
For pre-existing audio pipeline, you can use the Picovoice class to control passing audio frames to the speech recognition engine.

Integrating Voice Command Interface

Run below for wake word detection:

private void OnWakeWordDetected()
{
    isListening = true;
    Debug.Log("Listening...");
    _border.material.SetColor("_EmissionColor", picoBlue * 0.5f);
}

Run below for intent detection private void OnInferenceResult(Inference inference)

{        
    if (inference.IsUnderstood)
    {
        PrintInference(inference);
        if (inference.Intent == "changeVideoState")
        {
            ChangeVideoState(inference.Slots);
        }
        else if (inference.Intent == "seek")
        {
            SeekVideo(inference.Slots);
        }
        else if (inference.Intent == "changeVolume")
        {
            ChangeVolume(inference.Slots);
        }
        else if (inference.Intent == "changePlaybackSpeed") 
        {
            ChangePlaybackSpeed(inference.Slots);
        }
        else if(inference.Intent == "help")
        {
            ToggleHelp(inference.Slots);
        }
    }
    else
    {
        Debug.Log("Didn't understand the command.\n");
        _notificationText.text = "Didn't understand the command";
        StartCoroutine(FadeNotification());
    }

    isListening = false;
    _border.material.SetColor("_EmissionColor", picoBlue * 0f);
}

The slots are like arguments associated with the intent. When you receive the intent seek you’ll probably get minutes and/or seconds slots to tell us what time to set the video to. Using the slots, our function for seeking through the video will look something like this:

 private void SeekVideo(Dictionary<string, string> slots)
{
    int hours = 0;
    int minutes = 0;
    int seconds = 0;
    if (slots.ContainsKey("hours"))
    {
        hours = int.Parse(slots["hours"]);
        hours *= 3600;
    }

    if (slots.ContainsKey("minutes"))
    {
        minutes = int.Parse(slots["minutes"]);
        minutes *= 60;
    }

    if (slots.ContainsKey("seconds"))
    {
        seconds = int.Parse(slots["seconds"]);
    }

    if (slots.ContainsKey("direction"))
    {
        if (slots["direction"] == "forward" || 
            slots["direction"] == "forwards" || 
            slots["direction"] == "ahead")
        {
            _videoPlayer.time += hours + minutes + seconds;
        }
        else
        {
            _videoPlayer.time -= hours + minutes + seconds;
        }
    }
    else
    {
        _videoPlayer.time = hours + minutes + seconds;
    }
}

Lastly, connect connect each intent to a change in the UI and you're done!

Resources:
Open-source code for the tutorial
Picovoice Console
Picovoice Platform SDK