Roomal Seferaj

Posted on Aug 17 • Edited on Aug 19

Transforming PDFs into Audio

#ai #python #pdf #programming

In this guide, I will walk you through the process of converting PDF content into real-time audio playback using a combination of Python libraries. This approach is particularly useful for those who prefer to consume information audibly or for accessibility purposes. The code leverages text-to-speech technology and handles user interruptions gracefully.

Part 1 - Importing the Necessary Libraries

To begin, we need to import several Python libraries that will assist in loading PDFs, processing text, generating audio, and managing user interactions.

from gtts import gTTS
from io import BytesIO
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.chat_models import ChatOllama
from langchain_community.document_loaders import PyPDFLoader
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from pydub import AudioSegment
from pydub.playback import play
import signal
import sys
import threading

Overview of the Libraries:

gTTS: Google Text-to-Speech for converting text to audio.
BytesIO: In-memory binary stream for handling audio data.
LangChain: Tools for splitting text and processing it using language models.
PyPDFLoader: Specialized loader for extracting text from PDFs.
pydub: For audio manipulation and playback.
signal and threading: To manage user interruptions during playback.

Part 2: Handling User Interruption

We will allow the user to interrupt the audio playback gracefully. To achieve this, we set up a signal handler that listens for a Ctrl+C command.

# Flag to control the loop
stop_playback = False

def signal_handler(sig, frame):
    global stop_playback
    print("\nGracefully stopping playback...")
    stop_playback = True

# Assign the signal handler to SIGINT (Ctrl+C)
signal.signal(signal.SIGINT, signal_handler)

In addition to handling Ctrl+C, we also create a separate thread that listens for the Enter key press, providing an alternative way to stop playback.

# Function to listen for an Enter key press in a separate thread
def listen_for_stop():
    global stop_playback
    input("Press Enter to stop playback...\n")
    print("\nStopping playback...")
    stop_playback = True

# Start the listener thread
listener_thread = threading.Thread(target=listen_for_stop)
listener_thread.daemon = True
listener_thread.start()

Part 3: Loading and Splitting the PDF Document

Next, we load the PDF document using PyPDFLoader and split it into manageable chunks using RecursiveCharacterTextSplitter.

# Load and split the PDF document
loader = PyPDFLoader("/path/to/your/document.pdf")
pages = loader.load_and_split()

# Split text into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2048, chunk_overlap=100)
all_splits = text_splitter.split_documents(pages)

This approach allows us to process the document piece by piece, making it easier to generate and play audio incrementally.

Part 4: Generating Text Summaries

We use the ChatOllama model to generate summaries of the text chunks. The model is initialized with specific parameters, and a prompt template is created to guide the model's responses.

# Initialize the ChatOllama model
llm = ChatOllama(model="llama3:instruct", temperature=0.6)

# Create a prompt template
prompt = ChatPromptTemplate.from_template("Summarize the findings of: {page_content}")

# Define the chain
chain = prompt | llm | StrOutputParser()

Text Generation Function

We define a function to generate text summaries in chunks, which will be used later for audio playback.

def generate_text_chunks(page_content):
    try:
        text = chain.invoke({"page_content": page_content})
        sentences = text.split('. ')
        for sentence in sentences:
            yield sentence + '.'
    except Exception as e:
        print(f"Error generating text: {e}")

Part 5: Converting Text to Speech and Playing Audio

Once we have the text chunks, the next step is converting these chunks into speech and playing them.

# Function to play audio from a text chunk
def play_audio_chunk(text_chunk):
    if not text_chunk.strip():
        return
    try:
        tts = gTTS(text=text_chunk, lang='en')
        with BytesIO() as audio_fp:
            tts.write_to_fp(audio_fp)
            audio_fp.seek(0)
            audio_segment = AudioSegment.from_file(audio_fp, format="mp3")
            play(audio_segment)
    except Exception as e:
        print(f"Error generating or playing audio: {e}")

This function uses Google Text-to-Speech (gTTS) to generate audio from text and pydub to play the audio in real-time.

Part 6: Real-Time Text Generation and Audio Playback

Finally, we combine everything into a single function that handles real-time text generation and audio playback. This function will also respect user interruptions.

# Function to generate and play text in real-time
def generate_and_play():
    global stop_playback
    for split in all_splits:
        for chunk in generate_text_chunks(split.page_content):
            if stop_playback:
                print("Playback stopped by user.")
                return
            print(".", end="", flush=True)  # Visual feedback
            play_audio_chunk(chunk)
    print("\nPlayback finished.")

Starting the Process

To start the generation and playback process, simply call the generate_and_play() function.

generate_and_play()

Conclusion

With this approach, you can convert lengthy PDF documents into summarized audio files that are played back in real-time. This method is particularly useful for those who prefer auditory learning or need accessible formats for consuming information. The integration of text-to-speech with user-interruption handling makes this solution robust and user-friendly.

By following the steps outlined in this guide, you can develop a custom tool that turns text into audio, providing an alternative way to engage with content.

Final Code:

from gtts import gTTS
from io import BytesIO
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.chat_models import ChatOllama
from langchain_community.document_loaders import PyPDFLoader
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from pydub import AudioSegment
from pydub.playback import play
import signal
import sys
import threading

# Flag to control the loop
stop_playback = False

def signal_handler(sig, frame):
    global stop_playback
    print("\nGracefully stopping playback...")
    stop_playback = True

# Assign the signal handler to SIGINT (Ctrl+C)
signal.signal(signal.SIGINT, signal_handler)

# Function to listen for an Enter key press in a separate thread
def listen_for_stop():
    global stop_playback
    input("Press Enter to stop playback...\n")
    print("\nStopping playback...")
    stop_playback = True

# Start the listener thread
listener_thread = threading.Thread(target=listen_for_stop)
listener_thread.daemon = True
listener_thread.start()

# Load and split the PDF document
loader = PyPDFLoader("/home/roomal/Downloads/Stephen Mulhall/1 - Mulhall, Stephen - Heidegger and Being and Time - Scepticism, Cognition And Agency.pdf")
pages = loader.load_and_split()

# Split text into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2048, chunk_overlap=100)
all_splits = text_splitter.split_documents(pages)

# Initialize the ChatOllama model
llm = ChatOllama(model="llama3:instruct", temperature=0.6)

# Create a prompt template
prompt = ChatPromptTemplate.from_template("Summarize the findings of: {page_content}")

# Define the chain
chain = prompt | llm | StrOutputParser()

# Function to generate text in chunks
def generate_text_chunks(page_content):
    try:
        text = chain.invoke({"page_content": page_content})
        sentences = text.split('. ')
        for sentence in sentences:
            yield sentence + '.'
    except Exception as e:
        print(f"Error generating text: {e}")

# Function to play audio from a text chunk
def play_audio_chunk(text_chunk):
    if not text_chunk.strip():
        return
    try:
        tts = gTTS(text=text_chunk, lang='en')
        with BytesIO() as audio_fp:
            tts.write_to_fp(audio_fp)
            audio_fp.seek(0)
            audio_segment = AudioSegment.from_file(audio_fp, format="mp3")
            play(audio_segment)
    except Exception as e:
        print(f"Error generating or playing audio: {e}")

# Function to generate and play text in real-time
def generate_and_play():
    global stop_playback
    for split in all_splits:
        for chunk in generate_text_chunks(split.page_content):
            if stop_playback:
                print("Playback stopped by user.")
                return
            print(".", end="", flush=True)  # Visual feedback
            play_audio_chunk(chunk)
    print("\nPlayback finished.")

# Start the generation and playback process
generate_and_play()

Until next time.

Best,

Roomal

DEV Community

Transforming PDFs into Audio

Part 1 - Importing the Necessary Libraries

Overview of the Libraries:

Part 2: Handling User Interruption

Part 3: Loading and Splitting the PDF Document

Part 4: Generating Text Summaries

Text Generation Function

Part 5: Converting Text to Speech and Playing Audio

Part 6: Real-Time Text Generation and Audio Playback

Starting the Process

Conclusion

Final Code:

Top comments (0)

Read next

Self Writing Lang Graph State

A Practical Guide to Reducing LLM Hallucinations with Sandboxed Code Interpreter

Introducing uv: Next-Gen Python Package Manager

TypeScript for Domain-Driven Design (DDD)