DEV Community

Cover image for AI Integration with streamtasks
Leo
Leo

Posted on

AI Integration with streamtasks

Streamtasks empowers you to configure real-time data pipelines with ease. By leveraging these capabilities, you can build your video production pipelines in software, seamlessly integrating tools like AI that would otherwise be hard to integrate.

With Streamtasks, the possibilities for incorporating AI into your workflows are vast and varied. I'll showcase some of the ways to harness the power of AI with streamtasks.

The AI tasks are not integrated into the MSI and Flatpak installers, since they would use up to much space and make the installers explode in size. To use the AI tasks, you must install streamtasks manually with pip, as described in the Documentation.

pip install streamtasks[media,inference]
Enter fullscreen mode Exit fullscreen mode

Improving Speech

Two tasks are available to enhance the quality of your audio data: Spectral Mask Enhancement and Waveform Speech Enhancement. Both tasks remove background noise from the audio you supply to them, improving its overall quality.

  1. Spectral Mask Enhancement uses spectral mask enhancement techniques to boost speech clarity. It effectively removes background noise and produces consistent results, but may occasionally introduce additional noise.
  2. Waveform Speech Enhancement is skilled at eliminating background noise, but sometimes truncates the speech signal.

These tasks only accept mono channel, floating point audio data at 16 kHz.️ The SpeechBrain library is used to make speech enhancement work. If you would like to use different models or explore how it works, I recommend visiting the SpeechBrain Huggingface.

In the following example we use an audio input, like a microphone, resample its audio data to mono 16 kHz and apply speech enhancement, before we output it using an audio output.

Warning: do not do this if there is a feedback loop between your audio output and input (between your speakers and your microphone).

In addition we have an audio switch and a radio button UI in front of our audio output, allowing us to switch between the different enhancement methods and test what the different options sound like.

switching between speech enhancements

Live Transcription

In addition to speech enhancement SpeechBrain also provides us with great ways of integrating automatic speech recognition (ASR).

We can use the Asr Speech Recognition task to transform our audio data into a text transcription. Like the speech enhancement tasks, it only accepts mono channel floating point audio data at 16 kHz.

In this example, we take data from an audio input, resample it, transcribe it, and then display the resulting transcription using a text display.️

asr to text display

To boost the performance of our speech recognition system, we enhance the quality of speech first. We then employ a switch to toggle through various enhancement methods and determine which one yields the best results.️

speech enhancement deployment

I create a dashboard and then arrange the radio buttons and text display in a way that makes it easy to test. In my case, it looks something like this:️

speech enhancement dashboard

Text to Speech

We can also utilize speech brain's models for text-to-speech functionality.

Let's give it a try!

In this example, I'm using a text input - a text field with a send button - and connecting it to the Tts Fast Speech 2 task. This task will then generate audio data, which we'll output through an audio output.️

tts deployment simple

LLaMA.cpp

We can utilize LLaMA.cpp (GGUF) models alongside SpeechBrain models. The Llama.cpp Chat task processes text data, generating text outputs as a chat assistant. User messages serve as input, while the assistant's responses are outputted. To test this functionality, we'll employ the Text Input and Text Display tasks as input and output.️

llamacpp chat simple

After sending the text "Hello!" to our Llama.cpp Chat task, we get this output:

llamacpp chat simple dashboard

It works!

LLaMA.cpp + ASR + TTS

Lets now try building something more complicated and put it all together to make a speaking and listening LLaMA chat bot.

In order to create a proper message for our chat bot we have to concatenate the fragments of text the ASR task outputs. We use a String Concatenator task to do this. It receives our text on "input" and a "control" signal. When the control signal goes from low to high (from something smaller than 0.5 to something greater than 0.5) it will output the concatenated text and clear its buffer. We use a Message Detector task and a Calculator task to create this control signal. The Message Detector is configured to have a timeout of 2 seconds, which means that after not receiving any message for 2 seconds it will output 0 and while receiving text it will output 1. By using the calculator we can invert this signal and produce the desired control signal, which goes from 0 to 1 after not receiving a message for 2 seconds. We then use the output of the String Concatenator as the input for our LLaMA chat bot. The chat bot output is transformed into speech using the TTS task, which is then played using an audio output.
In addition, we have 3 text displays. One live display which appends all incoming text fragments and displays them, one for the actual chat bot input, and one for the output of our chat bot.

llama asr tts deployment

We then create a dashboard to arrange our text displays.

llama asr tts dashboard

We've finally achieved our goal of creating a conversational AI - a chat bot that allows us to engage in discussions with it.️

The Future

I'm planning to implement more advanced models into streamtasks, including Whisper and Bark, as well as image generating and segmentation models that are currently missing. These new models will expand the possibilities with the system.

To further integrate AI-generated data with multimedia content, I plan to add support for subtitles to the Output Container task, allowing users to include live transcriptions from ASR tasks in their live streams and video files.

Portability is a priority, as machine learning frameworks are often too large to be included in prebuilt installers. However, emerging frameworks like tinygrad offer smaller sizes that can be reasonably included, making it easier for less technical users to install. I'm planning to eventually switch all inference tasks to one framework, which will further simplify the installation process.️

Try streamtasks!
GitHub: https://github.com/leopf/streamtasks
Documentation: https://streamtasks.3-klicks.de
X: https://x.com/leopfff

Top comments (0)