TL;DR:
Happy holidays! Google recently "gifted" us the new Gemini 2.0 Flash model, expanding on what's available in the original 1.x models. One of the new features is the ability to generate text-based audio clips. Sure good ol' fashioned predictive AI's text-to-speech functionality is certainly useful, but this takes it to the next level, giving genAI users "idea-to-speech" capabilities. Learn how to access this new feature from Python today!
Introduction
Welcome to the blog focusing on using Google APIs from Python and sometimes Node.js. Today's post focuses on Gemini, but there's plenty of content beyond Gemini:
- Google Cloud/GCP: serverless and AI/ML
- Google Workspace/GWS: Google Docs (actually Google Drive) and Google Sheets (also Drive!)
- YouTube
- Google Maps/GMP
- Common topics across all APIs like auth & security: API keys and OAuth client IDs
This post takes a break from the flow of the previous posts in this series covering the Gemini API by exploring one new feature. While some users may be content using ChatGPT or Gemini online or via app, the Gemini API brings generative AI abilities to your apps, so if you're new or exploring, check out the other posts on how to get started as well as see some of its basic capabilities. This post only looks at one feature from the Gemini 2.0 Flash model: text-based audio clip generation.
Prerequisites
New client library improves user experience (UX)
You need a client library to talk to Gemini from code. While several client libraries already exist for Gemini, Google has recently introduced a new one. The new library features an improved UX, so I have to give Google some credit. In the first Gemini post in the series, I lamented that making the API available from two different platforms confuses developers:
Differing client libraries, numerous code samples, documentation in different locations under different web domains, etc., all add to a less-than-optimal UX. A replacement client library with the ability to work across platforms allows users to get started and experiment on Google AI then "upgrade" to Vertex AI, when ready for production, without changing their code.
💡 Yes, it's an "
ifdef
"
If you're like me and like to dig around in code, you may be curious about how the new client library works across both Google AI and Vertex AI. It's not magic, so you'll find if-else blocks where it matters, like an "ifdef
" (C/C++). In the new client library, any time you seemldev
, think Google AI, and as expected,vertex
is Vertex AI.One example is found the Live API code while another is in the models code. (NOTE: these links will probably be wrong when a new version is pushed, but I'll update them once the library has an official release)
At the time of this writing, the new client library is only available in Python and Go. (Java and JS/Node.js are next. Keep checking the Gemini API SDKs page for the latest releases. The sample app is only available in Python^, but I'm happy to explore a Golang PR if you get to an equivalent port before I do.
^ -- Python 3 only; Python 2 support is not available for the Gemini API
Installation and setup
Follow these steps to install the client library and get set up:
- Install the new client library:
pip install -U google-genai
- Create an API key (if you don't already have one)
- Save API key as a string to
settings.py
asAPI_KEY = 'YOUR_API_KEY'
(and follow the suggestions in the sidebar below to protect it)
⚠️ WARNING: Keep API keys secure
Storing API keys in files (or hard-coding them for use in actual code or even assigning to environment variables) is for prototyping and learning purposes only. When going to production, put them in environment variables or in a secrets manager. Files likesettings.py
or.env
containing API keys are susceptible. Under no circumstances should you upload files like those to any public or private repo, have sensitive data like that in TerraForm config files, add such files to Docker layers, etc., as once your API key leaks, everyone in the world can use it.If you're new to Google developer tools, API keys are one of the credentials types supported by Google APIs, and they're the only type supported by Maps APIs. Other credentials types include OAuth client IDs, mostly used by GWS APIs, and service accounts, mostly used by Google Cloud (GCP) APIs. While this post doesn't cover Google Maps, the Maps team put together a great guide on API key best practices, so check it out!
The app
The sample app sends a prompt of Describe a cat in a few sentences
to Gemini and requests an audio clip in response, so the app's functionality is pretty brief: make the request, get the response, and save the audio file.
The code
App components
There are 4 major chunks to this script:
- Imports
- Constants
- Audio file writer
- Core functionality
Imports
From the Python standard library, asyncio
is required because the Multimodal Live API is only available asynchronously. The contextlib.contextmanager
decorator is needed so we can wrap and use the audio file-writer with Python's with
statement. The last "stdlib" package used is wave
, which processes WAVE audio files. This is followed by importing Google's new "genAI" client library.
Like in previous code samples in this series, the API key is saved to settings.py
. Alternatively, you can save your API key to the GOOGLE_API_KEY
environment variable, or use the python-dotenv
package, storing the API key in .env
to more closely mirror working in a Node.js environment. There's also the GCP Secret Manager as yet another option.
Constants and audio file writer
Constants for the API client, generative large language model (genAI LLM), and model configuration follow. The last pair of constants are the user's prompt and filename to save the generated audio to.
The WAV file (wave_file()
) writer just sets up the basic parameters as a generator and wraps it in a context manager, allowing it to be used with the with
statement. You'll find nearly-identical code in various samples and Notebooks in the Gemini 2.0 cookbook repo.
Core functionality
All of the "real work" takes place in request_audio()
. It's a single session using the Gemini 2.0 Multimodal Live API, kicking it off by opening the WAV file for write and sending the prompt to the LLM. The rest of it continuously waits for a server response, writing the chunks of audio data received until it's been exhausted, terminating the session.
This is minimal code required to do the job. In other examples from Google, you'll find reference to server_content
, inline_data
and writing out parts
. Most of this relates to supporting a multi-turn conversation, but for a single request-response "cycle," less code is less confusing.
Running the script
Running the script produces an audio file along with the expected output:
$ python3 gem20-audio.py
** LLM prompt: "Describe a cat in a few sentences"
** Saved audio to "whatacatis.wav"
Your mileage may vary, but this is the audio track I got from Gemini:
Summary
Developers are eager to jump into the world of AI/ML, especially GenAI & LLMs, and accessing Google's Gemini models via API is part of that picture. The previous posts in the series got your foot in the door, and today, we explored a new feature available from Gemini 2.0 Flash. Next, we'll continue the journey from the previous post (link below) and show you how to deploy basic genAI web apps to Google Cloud!
If you find errors or have suggestions on content you'd like to see in future posts, also leave a comment below, and if your organization needs help integrating Google technologies via its APIs, reach out to me by submitting a request at https://cyberwebconsulting.com. Thanks for reading, and I hope to meet you if I come through your community... you'll find my travel calendar at the bottom of that page as well. Season's greetings and see you next year!
PREV POST: Part 3: Gemini API 102a... Putting together basic GenAI web apps
NEXT POST: Part 5: Deploying basic GenAI web apps to Google Cloud (coming soon)
References
Below are various links relevant to this post:
Code samples
Gemini API (Google AI)
Gemini 2.0 Flash
Other Generative AI and Gemini resources
- General GenAI docs
- Gemini home page
- Gemini models overview
- Gemini models information (& quotas)
- Gemini whitepaper (PDF)
Other Gemini API content by the author
WESLEY CHUN, MSCS, is a Google Developer Expert (GDE) in Google Cloud (GCP) & Google Workspace (GWS), author of Prentice Hall's bestselling "Core Python" series, co-author of "Python Web Development with Django", and has written for Linux Journal & CNET. He runs CyberWeb specializing in GCP & GWS APIs and serverless platforms, Python & App Engine migrations, and Python training & engineering. Wesley was one of the original Yahoo!Mail engineers and spent 13+ years on various Google product teams, speaking on behalf of their APIs, producing sample apps, codelabs, and videos for serverless migration and GWS developers. He holds degrees in Computer Science, Mathematics, and Music from the University of California, is a Fellow of the Python Software Foundation, and loves to travel to meet developers worldwide at conferences, user group events, and universities. Follow he/him @wescpy & his technical blog. Find this content useful? Contact CyberWeb for professional services or buy him a coffee (or tea)!
Top comments (0)