Ogbotemi Ogungbamila

Posted on Nov 25, 2024 • Edited on Nov 26, 2024

Streaming voice to SQL with AssemblyAI: Execute the generated SQL, use Ollama, RAG templates and vector embeddings

#devchallenge #assemblyaichallenge #ai #api

This is a submission for the AssemblyAI Challenge : Sophisticated Speech-to-Text.

What I Built

I built an online Voice to SQL environment for convert the recorded speech of users into SQL statements with the following features:

Voice to SQL

Convert user speech to text, preferably SQL.
Optional feature of streaming the currently recorded voice to the server to display the equivalent SQL statement
Applies intelligence by replacing words in the converted SQL statements with the glyph they are defined as i.e 'less than' gets replaced with '<', in a customizable and extensible widget.
Audio visualizer during record with options to pause and play
Users can specify the bitrate for geeks for optimum results

SQL statements execution
Provides an interface for switching between MySQL and PostgreSQL databases on the fly
Displays details of errors for every database interaction gone wrong

Generation of Vector embeddings, using a RAG widget and PostgreSQL databases: Timescale, Neon.tech

Provides a widget for obtaining embeddings for custom prompts or text, messages from Ollama models running locally
Provides SQL templates: SELECT and INSERT for applying generated embeddings along with their metadata on PostgreSQL databases that support them

Downloads

{query, result} object from executed queries
Recorded audio.
Option to upload {query, result} object to Pinata

Demo

Node.js server on Vercel

https://voice-sql-ai.vercel.app/

Python server for POST requests

https://voice-ai-sql-python.vercel.app/

https://voice-ai-sql-python.vercel.app/upload with {recording: <base64data>} in the POST request body

Psst: GET requests to the Python server still serves the page I copied from https://developer.mozilla.org/en-US/docs/Web/API/MediaStream_Recording_API/Using_the_MediaStream_Recording_API. It was a great, simple demo which I used to learn how to handle base64 encoded and binary data in Python as well as to POST it to AssemblyAI's API.

Screenshots

Enabled dark mode via browser devtools

Expanded view of widget for Voice-to-SQL

View of the other widgets for creating and using vector embeddings

Journey

Falling back to Python

Curiously enough, python code examples AssemblyAI's docs worked while the JavaScript ones in Node.js either crashed with "Not allowed" errors or returned {error: null} as a response via Node.js SDK and API respectively

AssemblyAI's Speech-to-Text API

The API was very straight forward and more flexible than the Python SDK for my use case with the following workflow

Upload binary data from decoded base64 string to AssemblyAI to obtain a URL
Use the received URL along with my API key to request for audio transcription to text and receive the sent JSON.

Usage

I used AssemblyAI's Speech-to-Text to convert recorded speech of users to SQL statements which are then refined further as follows:

Words in the received text are replaced with the glyphs they represent in SQL.

This submission doesn't quite qualify for the additional prompts since I didn't use them but I did something similar to the other two in the webapp I created.

Issues that thwarted the work

Credits issue with real-time and LeMuR

I was not allowed to use the other tools - LeMUR and real-time streaming with the free credits: I was advised to buy credits despite having over $40 worth of free credits, hence why I sort of implemented something similar to them along with speech-to-text on this web app: https://voice-sql-ai.vercel.app/

Real-time streaming

I was going to implement voice to SQL as a stream but the said credits issue got in the way and I got creative by implementing it instead in Speech-to-Text via code.

Final Thoughts

This was a fun project that broadened my knowledge on using python as a server along with Node.js. It also made me add more functionalities to the SQL playground I had built.
Finally, it made me explore how to get creative with handling and sending binary media data in browsers.

Thank you for reading!

Top comments (1)

Winzod AI • Nov 29 '24

Amazing!! Also folks, I came across this post and thought it might be helpful for you all! Rag Generation Component.

DEV Community