This week, I came across an interesting problem while working on an issue in ChatCraft.
A few weeks ago, I added text to speech support to the app, which when combined with other features, leads to some 😎 use cases.
I have already written a few posts about it. Feel free to take a look for more context.
In this post, I'll talk about a problem that I encountered with Text to Speech, how I solved it and leveraged Concurrency to optimize the performance of end result.
Table of Contents
1. The Problem
2. Breaking larger messages into Chunks
2.1. Algorithm for Natural Chunks
2.2. Using Text Chunks to generate Audio
2.3. Optimizing Download Speeds
3. Results
4. Release 2.0
The Problem
The Text to Speech feature had been working really well, until I encountered this error a few days ago.
What I was trying to do was download an interview as audio after importing its transcriptions to ChatCraft using Web Handlers (as simple as pasting Youtube video's url in prompt).
But while implementing the Download and Speak features for chat messages, I forgot to consider that OpenAI's TTS API might have a limit on how much it can process in a single request (4096 characters).
That is the reason I got this error as I was sending the entire chat content in one request for processing.
Breaking larger messages into Chunks
An obvious solution for this problem was to break larger messages into smaller chunks, send those chunks for audio generation, concatenate the generated audio blob pieces into a single blob, and finally download the weaved audio on user's file system.
On surface, this may seem like a very straightforward problem to solve, but there is another aspect to it.
For the TTS model to generate natural audio, the source it ingests also needs to be natural. In other words, I couldn't just break larger sentences to a certain max length without considering scenarios that could lead to inaccurate results.
For example,
- Parts of a single word could end up in different chunks.
- Parts of the same sentence could end up in different chunks.
Algorithm for Natural Chunks
To go around this problem, I leveraged the Natural Language Processing library we are using in the project (Compromise) to make sure that each chunk was less than a specified character length, and
the chunks ended with a full sentence (as long as the sentence itself wasn't longer than the limit).
Here's the algorithm:
- First, I try to break the entire text into sentences using the NLP library mentioned above.
- Then I initialize a text buffer variable and start iterating over each sentence.
- If the adding the current sentence to the text buffer does not exceed the preferred chunk length, I add it to the buffer.
- If it does exceed the limit, I push the buffer content as a new chunk to
chunks
array and assign the text buffer with the current sentence's value. - There is still one exception here. If the current sentence's length itself is greater than preferred chunk length, I have to force break it into smaller chunks as there is no way to have a natural meaningful chunk at that point.
- Lastly, I return the list of generated chunks.
The code looks like this:
import nlp from "compromise/one";
export function tokenize(text: string) {
const sentences: string[] = nlp(text)
.json()
.map((s: { text: string }) => s.text);
const terms = nlp(text).terms().out("array");
return { sentences, terms };
}
/**
*
* Tries to split the provided text into
* an array of text chunks where
* each chunk is composed of one or more sentences.
*
* The function attempts to limit each chunk to maximum
* preferred characters.
* If a single sentence exceeds preferred character length,
* that sentence will be force broken into chunks of preferred length
* with no guarantee that individual chunks make sense.
*
* @param text The text content that needs to be split into Chunks
* @param maxCharsPerSentence Maximum number of characters preferred per chunk
* @returns Array of text chunks
*/
export function getSentenceChunksFrom(text: string, maxCharsPerSentence: number = 4096): string[] {
const { sentences } = tokenize(text);
const chunks: string[] = [];
let currentText = "";
for (const sentence of sentences) {
if (sentence.length >= maxCharsPerSentence) {
// If the sentence itself is greater than maxCharsPerSentence
// Flush existing text buffer as a chunk
if (currentText.length) {
chunks.push(currentText);
currentText = "";
}
// Force break the long sentence without caring
// about natural language
const sentencePieces =
sentence.match(new RegExp(`.{1,${maxCharsPerSentence}}\\b`, "g")) || [];
chunks.push(...sentencePieces);
} else {
// Check if adding the new sentence to the buffer
// exceeds the allowed limit.
// If not, add another sentence to the buffer
if (currentText.length + sentence.length < maxCharsPerSentence) {
currentText += ` ${sentence.trim()}`;
} else {
// Flush the buffer as a chunk
if (currentText.length) {
chunks.push(currentText);
}
currentText = sentence;
}
}
}
if (currentText.length) {
chunks.push(currentText);
currentText = "";
}
return chunks;
}
Using Text Chunks to generate Audio
Now that I was able to break large messages into natural language chunks, it was time to generate audio clips using those and weave them together into full audio message for the user.
Here's what I did:
const handleDownloadAudio = useCallback(async () => {
if (messageContent.current) {
const text = messageContent.current.textContent;
if (text) {
const { loading, closeLoading } = await utilizeAlert();
const alertId = loading({
title: "Downloading...",
message: "Please wait while we prepare your audio download.",
});
try {
const textChunks = getSentenceChunksFrom(text, 4096);
const audioClips: Blob[] = [];
for (const textChunk of textChunks) {
const audioClipUrl = await textToSpeech(
textChunk,
settings.textToSpeech.voice,
"tts-1-hd"
);
audioClips.push(await fetch(audioClipUrl).then((r) => r.blob()));
}
const audioClip = new Blob(audioClips, { type: audioClips[0].type });
download(
audioClip,
`${settings.currentProvider.name}_message.${audioClip.type.split("/")[1]}`,
audioClip.type
);
closeLoading(alertId);
info({
title: "Downloaded",
message: "Message was downloaded as Audio",
});
} catch (err: any) {
console.error(err);
closeLoading(alertId);
error({ title: "Error while downloading audio", message: err.message });
}
}
}
}, [error, info, settings.currentProvider.name, settings.textToSpeech.voice]);
If you are wondering about the textToSpeech
function, I wrote about it in this post.
After this change, ChatCraft could download content of infinite length as audio.
But... there was still one more problem with this.
Optimizing Download Speeds
Even though breaking large messages into chunks solved our problem, the user experience was still not optimal as download speed for long messages was painfully low.
The reason was that the current algorithm only processed one chunk at a time.
for (const textChunk of textChunks) {
const audioClipUrl = await textToSpeech(
textChunk,
settings.textToSpeech.voice,
"tts-1-hd"
);
audioClips.push(await fetch(audioClipUrl).then((r) => r.blob()));
}
It never struck me before until my professor suggested that we could run these promises concurrently!
He also asked to use a library called p-limit to limit the number of concurrent promises at a time.
I am not that knowledgeable about concurrency, so I asked ChatCraft why would we need to limit concurrent execution.
This is what I got.
After understanding certain nuances of this approach, I modified my approach to concurrently generate 8 audio chunks at a time, and limited each message's length to a maximum of 500 characters.
Here's my implementation.
const textChunks = getSentenceChunksFrom(text, 500);
const audioClips: Blob[] = new Array<Blob>(textChunks.length);
// Limit the number of concurrent tasks
const pLimit = (await import("p-limit")).default;
const limit = pLimit(8); // Adjust the concurrency limit as needed
const tasks = textChunks.map((textChunk, index) => {
return limit(async () => {
const audioClipUrl = await textToSpeech(
textChunk,
settings.textToSpeech.voice,
"tts-1-hd"
);
const audioClip = await fetch(audioClipUrl).then((r) => r.blob());
audioClips[index] = audioClip;
});
});
// Wait for all the tasks to complete
await Promise.all(tasks);
const audioClip = new Blob(audioClips, { type: audioClips[0].type });
download(
audioClip,
`${settings.currentProvider.name}_message.${audioClip.type.split("/")[1]}`,
audioClip.type
);
You'll notice that I am using a dynamic import for p-limit, to avoid unnecessarily increasing initial bundle size of the application.
And I also had to make sure the generated chunks were in order by keeping track of indices, now that the execution was not sequential.
Results
With all these changes, I was able to download an audio clip that was more than 30 minutes long in just less than a minute!!!
I guess Taras is excited about downloading an Audio Book next 😝
Release 2.0
We have been building on ChatCraft 1.0 for about 4 months now. After 9 minor releases, it was finally time to do a major release this week and I am glad this work could go in that 🥳
Top comments (0)