This week, as I continued diving more into ChatCraft - the Developer Oriented ChatGPT, I found a few opportunities to meaningfully contribute to the project.
In this post, I'll be sharing those contributions, with the major one focusing on OpenAI's recently released Text-to-Speech API. I'll be referring to it as TTS from now on, so bear with me.
Table of Contents
1. The Requirement 📋
2. Implementation 🐱👤
2.1. Experimenting with the SDK 🛠️
2.2. Integration with App 🔗
a. TTS Toggle Button
b. Audio Queuing
c. Avoiding duplicate announcements
d. Buffering LLM Responses
e. Optimizing Buffering Algorithm
3. More Work 🫡
4. Upcoming
The Requirement 📋
Earlier this week, I received a GitHub Notification from ChatCraft regarding a new issue that was filed by Taras - the project owner.
For a long time, I was looking for something exciting to work on and this was it. The fact that ChatCraft already supported Speech to Text transcription using Whisper, which is another one of OpenAI's models with unique capabilities, integrating Text-to-Speech would essentially turn our application into something like an Amazon Alexa but with a brain powered the same LLM that ChatGPT uses.
And the fact that this feature was released not so long ago made this challenge even more exciting.
TTS at 42s ⏲️
Implementation 🐱👤
Without wasting any time, I started exploring the offical documentation where I found some samples for getting started with SDKs for various languages,
different models for audio quality,
and the most exciting one for me being different configurable voices with preview for each.
I also found the ability to stream real-time audio, but couldn't make it work in Node as in many discussions online.
Which is why, I crafted my own algorithm for better performance which I'll discuss later in the post.
Experimenting with the SDK 🛠️
After going through the documentation, it was time to play around and actually get something working before signing up for the task.
And as always, its NEVER a smooth ride 🥹
I got weird compile errors suggesting that OpenAI did not support any such feature.
After banging my head I against the wall for a few minutes, I found that the version of OpenAI we were using did not support it.
Thanks to this guy
And so, I impulsively upgraded to the latest version of openai (I guess not anymore) without the fear of getting cut by cutting edge 😝
and got it working for some random text
export const textToSpeech = async (message: string) => {
const { apiKey, apiUrl } = getSettings();
if (!apiKey) {
throw new Error("Missing API Key");
}
const { openai } = createClient(apiKey, apiUrl);
const mp3 = await openai.audio.speech.create({
model: "tts-1",
voice: "onyx",
input: message,
});
const blob = new Blob([await mp3.arrayBuffer()], { type: "audio/mpeg" });
const objectUrl = URL.createObjectURL(blob);
// Testing for now
const audio = new Audio(objectUrl);
audio.play();
};
and gathered enough confidence to sign up for the issue.
Integration with App 🔗
Getting it working for testing randomly was fairly easy, but the real deal would be integrating it with a complex application like ChatCraft.
This would mean implementing necessary UI and functionality.
I started thinking of a way to announce the response from LLM as it was being generated and a button that could allow users to enable/disable this behaviour.
TTS Toggle Button
To begin with, I added the toggle control in the prompt send button component.
{isTtsSupported() && (
<Tooltip label={settings.announceMessages ? "TTS Enabled" : "TTS Disabled"}>
<IconButton
type="button"
size="lg"
variant="solid"
aria-label={settings.announceMessages ? "TTS Enabled" : "TTS Disabled"}
icon={settings.announceMessages ? <AiFillSound /> : <AiOutlineSound />}
onClick={() =>
setSettings({ ...settings, announceMessages: !settings.announceMessages })
}
/>
</Tooltip>
)}
isTtsSupproted
simply checks if we using OpenAI as the provider.
// Audio Recording and Transcribing depends on a bunch of technologies
export function isTtsSupported() {
return usingOfficialOpenAI();
}
However, this needs to be changed as other providers like OpenRouter can also start supporting this feature in the future.
To persist the user pref, I added an announceMessages
option to our settings model
export type Settings = {
apiKey?: string;
model: ChatCraftModel;
apiUrl: string;
temperature: number;
enterBehaviour: EnterBehaviour;
countTokens: boolean;
sidebarVisible: boolean;
alwaysSendFunctionResult: boolean;
customSystemPrompt?: string;
announceMessages?: boolean;
};
which I would later leverage to determine if responses need to be announced or not!
Audio Queuing
After that, I had to find the code 🔍 that was handling response streaming, which I eventually found in this file.
So I left a comment there, to continue after a short tea break ☕
Okay, I am back!!!
Now it was time to work on the actual logic.
When looking at the entire problem at once, it was too intimidating which means there was a need to break it into manageable pieces.
The first thing was to make sure that any audio clips I generated were played in order and the best thing to use for such purposes is the good old queue data structure. I used ChatCraft to help me get started, and it gave some code for what I wanted to do. That gave me an idea about how I could do it, but I was very sure that audio operations and queue management belonged to its own separate file.
So I asked ChatCraft to generate a custom hook for me, essentially abstracting away all the implementation logic.
I called it useAudioPlayer
.
import { useState, useEffect } from "react";
const useAudioPlayer = () => {
const [queue, setQueue] = useState<Promise<string>[]>([]);
const [isPlaying, setIsPlaying] = useState<boolean>(false);
useEffect(() => {
if (!isPlaying && queue.length > 0) {
playAudio(queue[0]);
}
}, [queue, isPlaying]);
const playAudio = async (audioClipUri: Promise<string>) => {
setIsPlaying(true);
const audio = new Audio(await audioClipUri);
audio.onended = () => {
setQueue((oldQueue) => oldQueue.slice(1));
setIsPlaying(false);
};
audio.play();
};
const addToAudioQueue = (audioClipUri: Promise<string>) => {
setQueue((oldQueue) => [...oldQueue, audioClipUri]);
};
return { addToAudioQueue };
};
export default useAudioPlayer;
You'll notice that its managing Promises
returned by textToSpeech
function that you might remember from before
/**
*
* @param message The text for which speech needs to be generated
* @returns The URL of generated audio clip
*/
export const textToSpeech = async (message: string): Promise<string> => {
const { apiKey, apiUrl } = getSettings();
if (!apiKey) {
throw new Error("Missing API Key");
}
const { openai } = createClient(apiKey, apiUrl);
const mp3 = await openai.audio.speech.create({
model: "tts-1",
voice: "onyx",
input: message,
});
const blob = new Blob([await mp3.arrayBuffer()], { type: "audio/mp3" });
const objectUrl = URL.createObjectURL(blob);
return objectUrl;
};
Previously, I was awaiting
this url here before pushing it into the queue
This defeated the whole purpose of queuing as the order of audio urls depended upon which one finished awaiting first.
To go around this, I decided to pass in Promise<string>
i.e. raw promises like in the screenshot above and await
them when playAudio
was called.
To summarize, any audio url that is pushed into the queue triggers a side effect
that checks if an audio clip is already playing. If not, it converts it to an Audio
element and starts playing it. When any audio clip stops playing, the isPlaying
state is set to false
triggering that side effect again, that plays the next audio clip in the queue and so on...
Avoiding duplicate announcements
Okay, now I was confident that my audio clips would play in the order that I push them into the queue.
But I forgot to account for the fact that, whenever onData
function was called, the entire currentText
was passed to TTS method
leading to speech like so
"I"
"I am"
"I am ChatCraft"
and so on...
That sure was in order but you get the idea what's wrong.
To fix this, ChatCraft suggested to keep track of the last processed word and only generate audio for newWords
let lastIndex = 0;
const chat = chatWithLLM(messages, {
model,
functions,
functionToCall,
onPause() {
setPaused(true);
},
onResume() {
setPaused(false);
},
async onData({ currentText }) {
if (!pausedRef.current) {
// TODO: Hook tts code here
const newWords = currentText.split(" ").slice(lastIndex).join(" ");
lastIndex = currentText.split(" ").length;
if (newWords.length > 0) {
const audioClipUri = textToSpeech(newWords);
addToAudioQueue(audioClipUri);
}
setStreamingMessage(
And as you might guess, no more repeated words.
Buffering LLM Responses
Now there were no repeated words, and they played in order. But the problem was that the LLM response stream always had only one new word. This meant every audio clip consisted of just one word and there were as many calls to the tts api as the number of words in the response.
Extremely large number of requests in such short amount of time are completely unnecessary and lead to this
Even if there was no rate limiting, the speech sounded weird as every audio clip takes time to play and think yourself if those clips are one word long.
It sounded like:
"I ... am ... ChatCraft"
In order to fix that, I came up with the idea of buffering the LLM response to a certain maximum number of words before calling tts api.
https://stackoverflow.com/questions/648309/what-does-it-mean-by-buffer
Here's the logic:
let lastTTSIndex = 0; // To calculate new words in the AI generated text stream
// Buffer the response stream before calling tts function
// This reduces latency and number of TTS api calls
const TTS_BUFFER_THRESHOLD = 50;
const ttsWordsBuffer: string[] = [];
const chat = chatWithLLM(messages, {
model,
functions,
functionToCall,
onPause() {
setPaused(true);
},
onResume() {
setPaused(false);
},
async onData({ currentText }) {
if (!pausedRef.current) {
// Hook tts code here
const newWords = currentText.split(" ").slice(lastTTSIndex);
const newWordsCount = currentText.split(" ").length;
lastTTSIndex = newWordsCount;
ttsWordsBuffer.push(...newWords);
if (
isTtsSupported() &&
getSettings().announceMessages &&
ttsWordsBuffer.length >= TTS_BUFFER_THRESHOLD
) {
const audioClipUri = textToSpeech(ttsWordsBuffer.join(" "));
addToAudioQueue(audioClipUri);
// Clear the buffer
ttsWordsBuffer.splice(0);
}
...
...
The following commit has all the changes that went in
https://github.com/tarasglek/chatcraft.org/pull/357/commits/1f828ae5cfbe6ff2a07a647eada96b14023bde4f
And Voila! It was finally working as I expected.
So I opened a Pull Request
There have been many conversations since I opened the Pull Request, and there are many more things I have to work on in the future.
Optimizing Buffering Algorithm
The solution that I mentioned above was working FINE, but the time it took for speech to start was too long as it took a while for at least 50 words to pool in the buffer.
The solution was sentence based buffering. Instead of waiting for a certain number of words, I could start the speech as soon as there was one full sentence available in the buffer.
Here's the logic I came up with this time:
You can check the entire code in this commit. It took hours to make it work 👀
// Set a maximum words in a sentence that we need to wait for.
// This reduces latency and number of TTS api calls
const TTS_BUFFER_THRESHOLD = 25;
// To calculate the current position in the AI generated text stream
let ttsCursor = 0;
let ttsWordsBuffer = "";
const sentenceEndRegex = new RegExp(/[.!?]+/g);
const chat = chatWithLLM(messages, {
model,
functions,
functionToCall,
onPause() {
setPaused(true);
},
onResume() {
setPaused(false);
},
async onData({ currentText }) {
if (!pausedRef.current) {
// Hook tts code here
ttsWordsBuffer = currentText.slice(ttsCursor);
if (
isTtsSupported() &&
getSettings().announceMessages &&
sentenceEndRegex.test(ttsWordsBuffer) // Has full sentence
) {
// Reset lastIndex before calling exec
sentenceEndRegex.lastIndex = 0;
const sentenceEndIndex = sentenceEndRegex.exec(ttsWordsBuffer)!.index;
// Pass the sentence to tts api for processing
const textToBeProcessed = ttsWordsBuffer.slice(0, sentenceEndIndex + 1);
const audioClipUri = textToSpeech(textToBeProcessed);
addToAudioQueue(audioClipUri);
// Update the tts Cursor
ttsCursor += sentenceEndIndex + 1;
} else if (ttsWordsBuffer.split(" ").length >= TTS_BUFFER_THRESHOLD) {
// Flush the entire buffer into tts api
const audioClipUri = textToSpeech(ttsWordsBuffer);
addToAudioQueue(audioClipUri);
ttsCursor += ttsWordsBuffer.length;
}
setStreamingMessage(
new ChatCraftAiMessage({
id: message.id,
date: message.date,
model: message.model,
text: currentText,
})
);
incrementScrollProgress();
}
},
});
Here's the final result 🎉
More Work 🫡
Apart from this, I also worked on improving the Audio Recording UI this week.
The aim was to adopt the "press to start, press to stop" behaviour for the recording button.
I also helped a Pull Request from Yumei get merged by reviewing and suggesting some changes.
Even though I was supposed get a Pull Request merged this week and I couldn't, I technically got some code in using the suggestion feature.
Don't say that's cheating now 😉
Here's the Pull Request
https://github.com/tarasglek/chatcraft.org/pull/369
Upcoming
In this post, I discussed about my various contributions to ChatCraft this week.
There's still a lot of work that needs to be done for TTS Support in follow ups like the ability to choose between different voices, downloading speech for a response, cancelling the currently playing audio and so on...
First I'll have to redo the sentence tokenizing logic using a library suggested by my professor .
I'll soon post about the work that I do for TTS.
In the meantime, STAY TUNED!
Top comments (4)
Thanks for writing this up, saved me those hours =D
Happy to know this helped :D
Fantastic article. Very detailed and helped me out a lot!!
Thanks Scott, Glad it helped!