This is a submission for the AssemblyAI Challenge : Sophisticated Speech-to-Text. and No More Monkey Business.*
What I Built
I developed EchoSense, a portable hardware device that captures spoken content in settings like meetings, classes, brainstorming sessions, and conferences. It features a web interface with real-time transcriptions of everything echoing through its microphone. Users can ask questions about the discussion or generate summaries in real-time, making it an invaluable tool for live events.
The device operates on a modest 240MHz SoC with 4MB of RAM. It’s lightweight, efficient, and can run on a tiny lithium battery, making it highly portable.
Tech Used
- Vue, TypeScript, shadcn/ui
- ESP32, Rust, Espressif IoT Development Framework (IDF)
- WebSocket, SendGrid, AssemblyAI
Demo
Since this is a hardware device, providing a link to a demo isn’t feasible. However, I’ve recorded a video showcasing it in action, along with instructions on how to build one yourself.
and here is the github repository with the source code:
milewski / echosense-challenge
Portable device for real-time audio transcription and interactive summaries.
EchoSense
Portable device for real-time audio transcription and interactive summaries.
This is the main repository for my submission to AssemblyAI Challenge.
- Esp32: The firmware source code for the ESP32 device.
- Frontend: The UI that communicates with the device via websocket.
Each subfolder includes instructions for running the project locally.
Screenshots
Journey
When powered on, the device automatically connects to the configured Wi-Fi network and requests a temporary token from AssemblyAI, valid for one hour. It establishes a real-time transcription WebSocket connection and generates a local network URL, displayed as a QR code on the OLED screen.
The QR code directs users to the device’s IP address, where a web server runs on port 80. The server hosts a Vue.js-based interface, with all assets (CSS, JS, images) inlined into a single minified and mangled HTML file.
This optimization ensures minimal memory usage—essential in a resource-constrained environment where every byte counts.
As the user speaks, audio is streamed in ~500ms chunks, sampled at 16000Hz in PCM16 format, via the WebSocket connection to AssemblyAI. Transcriptions are returned and displayed live to any user who scans the QR code. Simultaneously, the audio is saved locally on the device’s SD card for further use.
The following diagram illustrate this functionality:
Prompts Qualification
My submission qualifies for 2 prompts:
- Really Rad Real-Time
- No More Monkey Business
Incomplete functions
The SD card was initially intended to store recordings and later attach them to emails. However, I realized that file sizes would grow too large, exceeding email attachment limits. To address this, a backend would be required to receive and convert the files from raw PCM16 to MP3. Since this wasn't the main focus of the challenge, I left this feature unfinished, as it would require building and hosting a backend.
Currently, there’s no way to configure Wi-Fi, API keys, or recording options via the web UI. All keys are injected at build time during compilation. Ideally, users would set up the device via a local Wi-Fi connection between their phone and the device, but this setup would require additional work.
I had planned to design and 3D print a case, possibly as a cube, to align with names like MeetingBox or MetaCube. Unfortunately, I didn’t have time to complete this, so the prototype was built and presented on a breadboard.
If anyone has any question, feel free to ask below or open an issue on GitHub—I’ll be happy to help!
Top comments (1)
@milewski This is very creative. Fantastic work. Love it.