How to Get Audio Transcriptions from Whisper without a File System

#webdev #openai #javascript

Whisper is OpenAI's intelligent speech-to-text transcription model. It allows developers to enter audio and an optional styling prompt, and get transcribed text in response.

However, the official OpenAI Node.js SDK API docs only show one way to use Whisper - reading an audio file with fs.

async function main() {
  const transcription = await openai.audio.transcriptions.create({
    file: fs.createReadStream("audio.mp3"),
    model: "whisper-1",
  });

  console.log(transcription.text);
}

That works fine if you have static files... but in any consumer application, we'll be processing data from an end-user client such as an app or web browser. To receive audio from thousands of users and save it as files is a major waste of disk space and a huge ineffeciency. Plus, serverless deployment is extremely popular today, and in a serverless environment we usually don't have persistent file storage. I wrote this article because it was surprisingly hard to figure out how to achieve audio transcription without saving the audio as a file first.

How to use Whisper without files

On the client-side, you'll need to get your audio into a Base64 encoded string. I'm using the library "@ricky0123/vad-react" for this purpose, which comes with utilities to accomplish that:

onSpeechEnd: (audio) => {
      const wavBuffer = utils.encodeWAV(audio);
      const base64 = utils.arrayBufferToBase64(wavBuffer);
      const audioUrlAsData = `${base64}`;
      // chose POST here with a payload to ensure the Base64 string doesn't violate the max length of a URL
      fetch("/api/transcribe", {
        method: "POST",
        body: JSON.stringify({ audioData: audioUrlAsData }), 
      })
}

Then on the server-side, the trick is to create a buffer from the base64 data and use the undocumented toFile function from OpenAI's library.

import OpenAI, { toFile } from "openai";

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

export default async function handler(
  req,
  res
) {
  try {
    // Extract Base64 encoded data from the request
    const bodyData = JSON.parse(req.body);
    const base64Audio = bodyData.audioData;

    // Decode Base64 to binary
    const audioBuffer = Buffer.from(base64Audio, "base64");

    // Use OpenAI API to transcribe the audio
    const transcription = await openai.audio.transcriptions.create({
      file: await toFile(audioBuffer, "audio.wav", {
        contentType: "audio/wav",
      }),
      model: "whisper-1",
    });

    // Send the transcription text as response
    res.json({ transcription: transcription.text });
  } catch (error) {
    console.error("Error during transcription:", error);
    res.status(500).send("Error during transcription");
  }
}

Voila! Through this process, you can use Whisper without saving audio from every user as static files, allowing it to be used in a serverless environment.