This is a submission for the AssemblyAI Challenge : Sophisticated Speech-to-Text.
What I Built
I built an online Voice to SQL environment for convert the recorded speech of users into SQL statements with the following features:
Voice to SQL
- Convert user speech to text, preferably SQL.
Optional feature of streaming the currently recorded voice to the server to display the equivalent SQL statement
Applies intelligence by replacing words in the converted SQL statements with the glyph they are defined as i.e 'less than' gets replaced with '<', in a customizable and extensible widget.
Audio visualizer during record with options to pause and play
-
Users can specify the bitrate for geeks for optimum results
SQL statements execution
Provides an interface for switching between MySQL and PostgreSQL databases on the fly
Displays details of errors for every database interaction gone wrong
Generation of Vector embeddings, using a RAG widget and PostgreSQL databases: Timescale, Neon.tech
- Provides a widget for obtaining embeddings for custom prompts or text, messages from Ollama models running locally
- Provides SQL templates:
SELECT
andINSERT
for applying generated embeddings along with their metadata on PostgreSQL databases that support them
Downloads
- {query, result} object from executed queries
- Recorded audio.
- Option to upload {query, result} object to Pinata
Demo
Node.js server on Vercel
https://voice-sql-ai.vercel.app/
Python server for POST requests
https://voice-ai-sql-python-c2v573s4k.vercel.app/
https://voice-ai-sql-python-c2v573s4k.vercel.app/upload with {recording: <base64data>}
in the POST request body
Psst: GET requests to the Python server still serves the page I copied from https://developer.mozilla.org/en-US/docs/Web/API/MediaStream_Recording_API/Using_the_MediaStream_Recording_API. It was a great, simple demo which I used to learn how to handle base64 encoded and binary data in Python as well as to POST it to AssemblyAI's API.
Screenshots
Enabled dark mode via browser devtools
Expanded view of widget for Voice-to-SQL
View of the other widgets for creating and using vector embeddings
Journey
Falling back to Python
Curiously enough, python code examples AssemblyAI's docs worked while the JavaScript ones in Node.js either crashed with "Not allowed" errors or returned {error: null}
as a response via Node.js SDK and API respectively
AssemblyAI's Speech-to-Text API
The API was very straight forward and more flexible than the Python SDK for my use case with the following workflow
- Upload binary data from decoded base64 string to AssemblyAI to obtain a URL
- Use the received URL along with my API key to request for audio transcription to text and receive the sent JSON.
Usage
I used AssemblyAI's Speech-to-Text to convert recorded speech of users to SQL statements which are then refined further as follows:
- Words in the received text are replaced with the glyphs they represent in SQL.
This submission doesn't quite qualify for the additional prompts since I didn't use them but I did something similar to the other two in the webapp I created.
Issues that thwarted the work
Credits issue with real-time and LeMuR
I was not allowed to use the other tools - LeMUR and real-time streaming with the free credits: I was advised to buy credits despite having over $40 worth of free credits, hence why I sort of implemented something similar to them along with speech-to-text on this web app: https://voice-sql-ai.vercel.app/
Real-time streaming
I was going to implement voice to SQL as a stream but the said credits issue got in the way and I got creative by implementing it instead in Speech-to-Text via code.
Final Thoughts
This was a fun project that broadened my knowledge on using python as a server along with Node.js. It also made me add more functionalities to the SQL playground I had built.
Finally, it made me explore how to get creative with handling and sending binary media data in browsers.
Thank you for reading!
Top comments (0)