How to Extract Content From Word Documents Using BeautifulSoup

Introduction

This article shows you how to extract content or sections from Word documents using beautifulSoup and regular expressions. I assume you understand Python programming and regular expressions.

A Word Document

Behind the scenes, a Word (docx) document is a zip file containing a collection of XML files. Using Python, you can easily view these files. For example:

import zipfile

f = zipfile.ZipFile('test.docx')
print(f.namelist())
# [..., 'word/document.xml', ..., 'word/settings.xml', ...]

I created a Word document containing several "Lorem Ipsum" text as the body and "Introduction to Python programming" as the title.

The main XML file, word/document.xml, contains the text content you would normally see while the rest XML files are related to settings, styling, and so on.

Viewing the Content

To view the content of document.xml, you can use the BeautifulSoup() class from the bs4 package.

Here, I defined a function read_doc_to_soup() to make it easier.

Note that you have to install the lxml package along with the beautisoup4 package to get BeautifulSoup() working with XML correctly.

def read_doc_to_bsoup(filename: str) -> BeautifulSoup:
    with zipfile.ZipFile(filename) as file:
        document = file.read('word/document.xml')
    return BeautifulSoup(document, 'xml')


soup = read_doc_to_bsoup('test.doc')
print(soup.prettify())

A sample output is shown below.

<?xml version="1.0" encoding="utf-8"?>
<w:document mc:Ignorable="w14 w15 w16se w16cid w16 w16cex w16 ...

   ...

   <w:r w:rsidRPr="0002422F">
    <w:rPr>
     <w:rFonts w:ascii="Times New Roman" w:cs="Times New Roman" w:hAnsi="Times New Roman"/>
     <w:b/>
     <w:bCs/>
     <w:sz w:val="24"/>
     <w:szCs w:val="24"/>
     <w:lang w:val="en-US"/>
    </w:rPr>
    <w:t>
     Introduction to Python Programming
    </w:t>
   </w:r>
  </w:p>

  ...

  <w:sectPr w:rsidR="00F152B4" w:rsidRPr="0002422F">
   <w:pgSz w:h="16838" w:w="11906"/>
   <w:pgMar w:bottom="1440" w:footer="708" w:gutter="0" w:header="708" w:left="1440" w:right="1440" w:top="1440"/>
   <w:cols w:space="708"/>
   <w:docGrid w:linePitch="360"/>
  </w:sectPr>
 </w:body>
</w:document>

The text we're interested in is usually within a w:t XML tag. Notice that the text "Introduction to Python Programming" is wrapped within a w:t XML tag.

Extracting The Content

To get the first text, simply use the .find() method of beautiful soup like so

first_text = soup.find('w:t')

print(first_text) # <w:t>Introduction to Python Programming</w:t>

# To extract the text without the tags
print(first_text.get_text()) # Introduction to Python Programming

To get all text tags, use the .find_all() method.

content = soup.find_all('w:t')
print(content)

[<w:t>Introduction to Python Programming</w:t>, 
<w:t xml:space="preserve">Lorem ipsum </w:t>, <w:t>dolor</w:t>, 
<w:t xml:space="preserve"> sit </w:t>, <w:t>amet</w:t>, 
<w:t xml:space="preserve">, </w:t>, <w:t>consectetur</w:t>, 
<w:t xml:space="preserve"> </w:t>, <w:t>adipiscing</w:t>, <w:t xml:space="preserve"> </w:t>, 
<w:t>elit</w:t>, <w:t xml:space="preserve">. </w:t>, 
<w:t>Pellentesque</w:t>, <w:t xml:space="preserve"> </w:t>, <w:t>metus</w:t>, ...]

To extract the text without the tags, you can use a list comprehension. For example

content = [node.get_text() for node in soup.find_all('w:t') if node is not None]
print(content)

# ['Introduction to Python Programming', 'Lorem ipsum ', 'dolor', ' sit ', 
# 'amet', ', ', 'consectetur', ' ', 'adipiscing', ' ', 'elit', '. ', 
# 'Pellentesque', ' ', 'metus', ' ', 'elit', ', ', 'consectetur', ' id ', 
# 'mollis', ' non, ', 'fringilla', ' in eros. Mauris ', 'aliquam', ' ', 
# 'quis', ' ', 'odio', ' id tempus. ', 'Aliquam', ' ', 'erat', ' ', 
# 'volutpat', '. Donec id ', 'iaculis', ' ipsum. In ', 'tincidunt', ' ', 
# 'massa', ' non ', 'aliquam', ' ', 'dignissim', '. Donec semper ', ...]

Processing The Content

While you have successfully extracted the content, you'll notice that in some cases, the text does not make complete sentences. While some texts are within <w:t xml:space="preserve">, others are within plain <w:t>. This makes working with Word documents a huge problem.

The solution to the problem is entirely dependent on the structure of the document. Therefore, you'll have to spend time understanding the structure of your document(s).

In the simplest case, you could choose to combine all text into a single string and then split by period (.).

Example: Extracting Research Objectives from Research Proposals

Thankfully, research proposals have a fairly regular structure across various fields. Just for fun, I wanted to extract the objectives from a series of research proposals.

In cases where all the objectives are on a single line, the function below just works!

import re

def process_simple(filename: str, target: str = 'To determine'):
    """Extract objectives using a target keyword."""
    soup = read_doc_to_bsoup(filename)
    wt = soup.find_all('w:t')

    results = [re.sub(r'\d+', '', t.get_text()).strip()
               for t in wt if t.get_text()]

    results = list(filter(lambda x: x, results))

    return list(filter(lambda x: x.startswith(target), results))

Extract the content as you saw above
Remove digits and empty strings from the text array
Collect the texts that start with "To determine".

In a more complicated scenario, the objectives spanned multiple lines, section numbers were different, and so on. The above function failed woefully. Luckily, I found that across objectives spanning multiple lines, each text within the <w:t> tag was part of the preceding <w:t xml:space="preserve"> tag.

The code below walks through the XML soup and combines all <w:t> tags with the previous <w:t xml:space="preserve"> tag. It also works for cases where the objectives are on single lines.

def process_xml_soup(soup: BeautifulSoup):
    # extract content
    wt = soup.find_all('w:t')

    results = []

    # Keep track of cursor positions
    current = 0
    next = current + 1
    stop = len(wt) - 1

    while current < stop:
        cur_elem = wt[current]
        next_elem = wt[next]

        # get the current text
        current_text = cur_elem.get_text().strip() or ''

        # Are we between <w:t xml:space="preserve"> and <w:t>?
        if cur_elem.has_attr('xml:space') and not next_elem.has_attr('xml:space'):

            # Join subsequent <w:t> to the preceding <w:t xml:space="preserve">
            while not next_elem.has_attr('xml:space') and next < stop:
                current_text += f' {next_elem.get_text().strip() or ""}'

                # Advance the cursor until
                # we see another <w:t xml:space="preserve">
                next += 1
                next_elem = wt[next]

        # Remove too many spaces
        current_text = re.sub(r'\s+', ' ', current_text.strip())

        # make objectives title consistent
        # That is, if you find 3.1 objectives make it 3.1 OBJECTIVES
        current_text = re.sub(r'(\d\.\d\.?\d?) objectives',
                              r'\1 OBJECTIVES', current_text, flags=re.I)

        # save the text
        results.append(current_text)

        # Move the cursor forward
        current = next
        next = current + 1

    results = [c.strip() for c in results]

    return ' '.join(list(filter(lambda x: x, results)))

Research objectives are usually sandwiched between different sections depending on the field/department. For example, between "AIM" and "HYPOTHESIS".

The following functions complete the extraction.



class NoneObject:
    """Create a NoneObject to avoid return None from find_last()."""
    def start(self):
        return None

    def end(self):
        return None


# Find last uses NoneObject to maintain API consistency
def find_last(text: str, target: str):
    """Find the last occurrence of a target string."""

    # finditer returns an iterator of match objects or None
    result = list(re.finditer(target, text))

    # Let's avoid None checks in any function that uses the find last
    # by using NoneObject
    return result[-1] if len(result) >= 1 else NoneObject()


# Use find last to extract a section
def extract_section(text: str, start: str, end: str, verbose: bool = True):
    """Extract a section of a text."""
    start = find_last(text, start)
    end = find_last(text, end)

    if verbose:
        print(start)
        print(end)

    return text[start.end():end.start()].strip()


# Account for different headers after the list of objectives
def gen_end_regex():
    """Generate a series of regexes to match the section after the objectives."""

    end_titles = ['HYPOTHESES', 'HYPOTHESIS',
                 'RESEARCH QUESTIONS', 'NULL AND ALTERNATE HYPOTHESIS',
                 'CHAPTER TWO LITERATURE REVIEW', 'CHAPTER THREE LITERATURE REVIEW']
    fdgts = r'\d?\.\d?\.?\s*'
    regex = ''

    for i, w in enumerate(end_titles):
        regex += (fdgts + w)
        if i != len(end_titles) - 1:
            regex += '|'

    return regex


def extract_objectives(text: str, verbose: bool = True):
    """Extract the objectives section"""
    start = r'\d?\.?\d?\.?\s*OBJECTIVES[:;]?|\d?\.?\d?\.?\s*Objectives|OBJECTIVES OF THE STUDY'
    end = gen_end_regex()

    return extract_section(text, start, end, verbose)


def seperate_objectives(objectives: str):
    """Separate or split the objectives."""
    objectives = re.sub(r'\d\.?', '', objectives).removeprefix('OBJECTIVES ')
    objectives = list(filter(lambda x: x.strip(), objectives.split('To ')))
    objectives = list(
        map(lambda x: 'To ' + x.strip().replace('.', '').strip(), objectives))

    return objectives


def process(filename: str, verbose: bool = True):
    """Run all functions on a word document."""
    soup = read_doc_to_bsoup(filename)
    document = process_xml_soup(soup)
    objectives = extract_objectives(document, verbose)
    objectives = seperate_objectives(objectives)

    return objectives

In research proposals, each section title or heading is often defined in two places - within the table of contents and the body of the document. find_last() finds the last occurrence within the body of the document.
The re.finditer() returns a match object which has two methods .start() and .end(). These methods return the start and end index respectively. In the absence of a match, instead of returning None, I use the NoneObject() to mimic the result of re.finditer().
extract_section() uses find_last() to do its job.
extract_objectives() and separate_objectives() functions are fairly obvious, I think :)
Everything is combined within the process() function.

Testing

Create a Word document objectives.docx containing the following content:

OBJECTIVES
- To determine the effect of exercise on weight loss in elderly people.
- To determine if the effect of exercise on weight loss in elderly people is influenced by factors such as location and gender.
- To determine the effect of dieting on weight loss in elderly people.
- To determine if the effect of dieting on weight loss in elderly people is influenced by factors such as location and gender.

from pprint import pp

pp(process('objectives.docx'))

['To determine the effect of exercise on weight loss in elderly people',      
 'To determine if the effect of exercise on weight loss in elderly people is '
 'influenced by factors such as location and gender',
 'To determine the effect of dieting on weight loss in elderly people',       
 'To determine if the effect of dieting']

Summary

In this article, you saw how to extract text from a Word document. You also saw a sample case that combined various methods including regex to extract the objective section of a research proposal.

Thank you for reading.