Wesley Chun (@wescpy)

Posted on Dec 16

Generate audio clips with Gemini 2.0 Flash

#ai #python #api #machinelearning

TL;DR:

Happy holidays! Google recently "gifted" us the new Gemini 2.0 Flash model, expanding on what's available in the original 1.x models. One of the new features is the ability to generate text-based audio clips. Sure good ol' fashioned predictive AI's text-to-speech functionality is certainly useful, but this takes it to the next level, giving genAI users "idea-to-speech" capabilities. Learn how to access this new feature from Python today!

Introduction

Welcome to the blog focusing on using Google APIs from Python and sometimes Node.js. Today's post focuses on Gemini, but there's plenty of content beyond Gemini:

Google Cloud/GCP: serverless and AI/ML
Google Workspace/GWS: Google Docs (actually Google Drive) and Google Sheets (also Drive!)
YouTube
Google Maps/GMP
Common topics across all APIs like auth & security: API keys and OAuth client IDs

This post takes a break from the flow of the previous posts in this series covering the Gemini API by exploring one new feature. While some users may be content using ChatGPT or Gemini online or via app, the Gemini API brings generative AI abilities to your apps, so if you're new or exploring, check out the other posts on how to get started as well as see some of its basic capabilities. This post only looks at one feature from the Gemini 2.0 Flash model: text-based audio clip generation.

Prerequisites

New client library improves user experience (UX)

You need a client library to talk to Gemini from code. While several client libraries already exist for Gemini, Google has recently introduced a new one. The new library features an improved UX, so I have to give Google some credit. In the first Gemini post in the series, I lamented that making the API available from two different platforms confuses developers:

Differing client libraries, numerous code samples, documentation in different locations under different web domains, etc., all add to a less-than-optimal UX. A replacement client library with the ability to work across platforms allows users to get started and experiment on Google AI then "upgrade" to Vertex AI, when ready for production, without changing their code.

💡 Yes, it's an "ifdef"
If you're like me and like to dig around in code, you may be curious about how the new client library works across both Google AI and Vertex AI. It's not magic, so you'll find if-else blocks where it matters, like an "ifdef" (C/C++). In the new client library, any time you see mldev, think Google AI, and as expected, vertex is Vertex AI.

One example is found the Live API code while another is in the models code. (NOTE: these links will probably be wrong when a new version is pushed, but I'll update them once the library has an official release)

At the time of this writing, the new client library is only available in Python and Go. (Java and JS/Node.js are next. Keep checking the Gemini API SDKs page for the latest releases. The sample app is only available in Python^, but I'm happy to explore a Golang PR if you get to an equivalent port before I do.

^{^} -- Python 3 only; Python 2 support is not available for the Gemini API

Installation and setup

Follow these steps to install the client library and get set up:

Install the new client library: pip install -U google-genai
Create an API key (if you don't already have one)
Save API key as a string to settings.py as API_KEY = 'YOUR_API_KEY' (and follow the suggestions in the sidebar below to protect it)

⚠️ WARNING: Keep API keys secure
Storing API keys in files (or hard-coding them for use in actual code or even assigning to environment variables) is for prototyping and learning purposes only. When going to production, put them in environment variables or in a secrets manager. Files like settings.py or .env containing API keys are susceptible. Under no circumstances should you upload files like those to any public or private repo, have sensitive data like that in TerraForm config files, add such files to Docker layers, etc., as once your API key leaks, everyone in the world can use it.

If you're new to Google developer tools, API keys are one of the credentials types supported by Google APIs, and they're the only type supported by Maps APIs. Other credentials types include OAuth client IDs, mostly used by GWS APIs, and service accounts, mostly used by Google Cloud (GCP) APIs. While this post doesn't cover Google Maps, the Maps team put together a great guide on API key best practices, so check it out!

The app

The sample app sends a prompt of Describe a cat in a few sentences to Gemini and requests an audio clip in response, so the app's functionality is pretty brief: make the request, get the response, and save the audio file.

The code

import asyncio
import contextlib
import wave

from google import genai
from settings import API_KEY

CLIENT = genai.Client(api_key=API_KEY, http_options={'api_version': 'v1alpha'})
MODEL = 'gemini-2.0-flash-exp'
CONFIG = {'generation_config': {'response_modalities': ['AUDIO']}}
PROMPT = 'Describe a cat in a few sentences'
FILENAME = 'whatacatis.wav'

@contextlib.contextmanager
def wave_file(filename, channels=1, rate=24000, sample_width=2):
    'set up .wav file writer'
    with wave.open(filename, 'wb') as wf:
        wf.setnchannels(channels)
        wf.setsampwidth(sample_width)
        wf.setframerate(rate)
        yield wf

async def request_audio(prompt=PROMPT, filename=FILENAME):
    'request LLM generate audio file given prompt'
    print(f'\n** LLM prompt: "{prompt}"')
    async with CLIENT.aio.live.connect(model=MODEL, config=CONFIG) as session:
        with wave_file(filename) as f:
            await session.send(prompt, end_of_turn=True)
            async for response in session.receive():
                if response.data:
                    f.writeframes(response.data)
    print(f'** Saved audio to "{filename}"')

asyncio.run(request_audio())

[CODE] gem20-audio.py: Audio "Hello World!" sample

App components

There are 4 major chunks to this script:

Imports
Constants
Audio file writer
Core functionality

Imports

From the Python standard library, asyncio is required because the Multimodal Live API is only available asynchronously. The contextlib.contextmanager decorator is needed so we can wrap and use the audio file-writer with Python's with statement. The last "stdlib" package used is wave, which processes WAVE audio files. This is followed by importing Google's new "genAI" client library.

Like in previous code samples in this series, the API key is saved to settings.py. Alternatively, you can save your API key to the GOOGLE_API_KEY environment variable, or use the python-dotenv package, storing the API key in .env to more closely mirror working in a Node.js environment. There's also the GCP Secret Manager as yet another option.

Constants and audio file writer

Constants for the API client, generative large language model (genAI LLM), and model configuration follow. The last pair of constants are the user's prompt and filename to save the generated audio to.

The WAV file (wave_file()) writer just sets up the basic parameters as a generator and wraps it in a context manager, allowing it to be used with the with statement. You'll find nearly-identical code in various samples and Notebooks in the Gemini 2.0 cookbook repo.

Core functionality

All of the "real work" takes place in request_audio(). It's a single session using the Gemini 2.0 Multimodal Live API, kicking it off by opening the WAV file for write and sending the prompt to the LLM. The rest of it continuously waits for a server response, writing the chunks of audio data received until it's been exhausted, terminating the session.

This is minimal code required to do the job. In other examples from Google, you'll find reference to server_content, inline_data and writing out parts. Most of this relates to supporting a multi-turn conversation, but for a single request-response "cycle," less code is less confusing.

Running the script

Running the script produces an audio file along with the expected output:

$ python3 gem20-audio.py

** LLM prompt: "Describe a cat in a few sentences"
** Saved audio to "whatacatis.wav"

Your mileage may vary, but this is the audio track I got from Gemini:

Summary

Developers are eager to jump into the world of AI/ML, especially GenAI & LLMs, and accessing Google's Gemini models via API is part of that picture. The previous posts in the series got your foot in the door, and today, we explored a new feature available from Gemini 2.0 Flash. Next, we'll continue the journey from the previous post (link below) and show you how to deploy basic genAI web apps to Google Cloud!

If you find errors or have suggestions on content you'd like to see in future posts, also leave a comment below, and if your organization needs help integrating Google technologies via its APIs, reach out to me by submitting a request at https://cyberwebconsulting.com. Thanks for reading, and I hope to meet you if I come through your community... you'll find my travel calendar at the bottom of that page as well. Season's greetings and see you next year!

PREV POST: Part 3: Gemini API 102a... Putting together basic GenAI web apps

NEXT POST: Part 5: Deploying basic GenAI web apps to Google Cloud (coming soon)

DEV Community

Generate audio clips with Gemini 2.0 Flash

TL;DR:

Introduction

Prerequisites

New client library improves user experience (UX)

Installation and setup

The app

The code

App components

Imports

Constants and audio file writer

Core functionality

Running the script

Summary

References

Code samples

Gemini API (Google AI)

Gemini 2.0 Flash

Other Generative AI and Gemini resources

Other Gemini API content by the author

Top comments (0)

Read next

How These Free Open Source Projects Can Jumpstart Your Career (No Experience? No Problem!)

BREAKING: Intel CEO Resigns After Controversy

AI System Learns How Objects Interact in Images to Generate New Scenes with Same Relationships

使用 selenium 讀取需要登入會員的網頁