Dilek Karasoy for Picovoice

Posted on Jan 24, 2023

Day 17: Speech-to-Text with gRPC and Golang

#challenge #100daysofcode #go #microservices

To have a working gRPC microservice, three components are essential:

.proto file to define the gRPC services and messages
server to process the submitted audio and returns back the transcription
client to talk to the server

.proto file

syntax = "proto3";
package messaging;
option go_package = "go-grpc/messaging";
service LeopardService {
  rpc GetTranscriptionFile(stream Chunk) returns (transcriptResponse) {}
}
message Chunk {
  bytes Content = 1;
}

We define only one service (GetTranscriptionFile) in the proto file for simplicity.

gRPC has a limit of 4MB for incoming messages. Hence, transcription service type needs to be set to the client-side stream. So files can be sent in chunks of bytes.

enum StatusCode {
  Unknown = 0;
  Ok = 1;
  Failed = 2;
}
message transcriptResponse {
  string transcript = 1;
  StatusCode Code = 2;
}

Now, let's compile the .proto file with protoc as we are going to write both server and client in Go.
Client:
First, we need a client for the defined LeopardService service.

func main() {
    f, err := os.Open(inputAudioPath)
    defer f.Close()
    opts := grpc.WithInsecure()
    conn, err := grpc.Dial(*serverAddressArg, opts)
    defer conn.Close()
    client := messaging.NewLeopardServiceClient(conn)
    runTranscriptionFile(client, *inputAudioPathArg)
}

Inside the runsTranscriptFile function, the audio file is read in chunks of 1 MB and transmitted over to the server, and a timeout of 10 seconds is considered here. Finally, the stream is closed, and the server response is received by calling the CloseAndRecv function.

func runTranscriptionFile(client messaging.LeopardServiceClient, filePath string) (err error) {
    var (
        writing = true
        buf     []byte
        n       int
        file    *os.File
    )
    file, err = os.Open(filePath)
    defer file.Close()
    ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
    defer cancel()
    stream, err := client.GetTranscriptionFile(ctx)
    defer stream.CloseSend()
    buf = make([]byte, 1024*1024*1024) // 1 MB
    for writing {
        n, err = file.Read(buf)
        if err != nil {
            if err == io.EOF {
                writing = false
                err = nil
                continue
            }
            return err
        }
        // send the loaded bytes to the server
        err = stream.Send(&messaging.Chunk{Content: buf[:n]})
    }
    // signal the server that it is done and ready to receive a response
    reply, err := stream.CloseAndRecv()
    log.Printf("replay: %v", reply)
    return err
}

Server:
On the server-side, a gRPC service instance is defined and registered to answer to LeopardService calls.

func main() {
    add := fmt.Sprintf("localhost:%d", *port)
    lis, err := net.Listen("tcp", add)
    grpcServer := grpc.NewServer()
    messaging.RegisterLeopardServiceServer(grpcServer, newServer(*accessKeyArg))
    grpcServer.Serve(lis)
}

After getting a transcription request, the server starts an instance of Leopard and keeps reading the shipped bytes until the EOF. Then, the bytes are stored as a temporary file and passed to Leopard. Finally, the transcription is sent back to the client side along with a status code.

func (s *leopardServer) GetTranscriptionFile(stream messaging.LeopardService_GetTranscriptionFileServer) (err error) {
    // define an instance of Leopard and init it
    engine := leopard.NewLeopard(s.accessKey)
    error := engine.Init()
    defer engine.Delete()

    var audio []byte = make([]byte, 0)
    // default returned values if any error happens
    var transcription string = ""
    var statusCode messaging.StatusCode = messaging.StatusCode_Failed

    for !is_done {
        // keep reading bytes from the stream till it reaches to the end
        audioFileChunk, err := stream.Recv()
        if err == io.EOF {
            // create a temporary file to store the received audio stream
            f, err := os.CreateTemp("", "auido_temp_file")
            defer os.Remove(f.Name())
            _, err = f.Write(audio)
            // process the audio file without any preprocessing with ProccessFile method of Leopard
            transcription, _, err = engine.ProcessFile(f.Name())
            statusCode = messaging.StatusCode_Ok
            is_done = true
        } else {
            audio = append(audio, audioFileChunk.Content...)
        }
    }
    // send back the result and close the stream connection
    return stream.SendAndClose(&messaging.TranscriptResponse{
        Transcript: transcription,
        Code:       statusCode,
    })
}

We could also have sent the audio in raw (pcm) format and directly fed it to Leopard without storing, but there are two caveats.

more preprocessing needed on the client-side to decode the audio file.
amount of data to be transferred is significantly more for the raw format than than compressed formats such as MP3 or OGG.

Learn more about Leopard, and check out the open-source demos.

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

DEV Community

Day 17: Speech-to-Text with gRPC and Golang

The Next Generation Developer Platform

Top comments (0)

The fastest way to detect downtimes

Read next

How to Create a Static Site Generator with Go

Post2: Golang Print Functions

How to mitigate SSRF vulnerabilities in Go

Go Routines and Node.js with RabbitMQ and Kubernetes: A Comparative Analysis for Green Threads

Okay