Helping pandoc generate a correct table of contents from HTML input

#pandoc #python

TL;DR:

pandoc expects chapter headers to be placed directly inside the body node. No <div> wrappers allowed.
pandoc sets the book title after the last <title> tag it sees (e.g. the last file on the commandline).

It's been a long time since I last had to convert an HTML ebook to EPUB. Last time I did, I couldn't make calibre put the chapters in the correct order ¹ and got so angry that I tried to hand-craft the file with bash and a bunch of regexen. It was certainly an interesting experiment and I've learned much about EPUB internals in the process. I've also learned that while it's OK to parse a limited, known set of HTML with regex, it's much more convenient to use an actual HTML parser.

Since then, I fell in love with pandoc and have been using it extensively for various projects ². So when I recently wanted to read the F# Programming Wikibook on my Kindle, I knew I would use pandoc for conversion.

My enthusiasm somewhat dropped when I examined the resulting file and found out that the generated table of contents consisted of a single entry, named after the last chapter of the book. And this was not just a problem of broken navigation.

An EPUB file is essentially a zip archive with chapters stored in separate HTML files. Thanks to that, ebook readers can open them one by one, which means quicker load times and lower memory footprint. Because pandoc didn't know how to split the book into chapters, it put them all into a single file so my reader had to slurp and format the entire text before displaying anything - grinding it to halt for over a minute each time the book was opened.

For HTML input, pandoc is supposed to generate the TOC automatically from the <h1>, <h2>, ... <h6> markup. After some experimentation, it turned out that pandoc expects chapter headers to be placed directly inside the body node. While this makes sense for documents written for the sole purpose of being packaged as EPUB, this is rarely the case with HTML pages on the Internet, where you will often find the actual content wrapped in several layers of divs (or tables, if you are unfortunate to roam such dangerous, god-forgotten places).

Here's a test case. Say, we have a book titled The Book, which consists of four chapters:

seq 1 4 | while read idx; do
    > "ch$idx.html" <<EOF
<html>
    <head>
        <title>Chapter $idx - The Book</title>
    </head>
    <body>
        <div>
            <div>
                <div id="content">
                    <h1>Chapter $idx</h1>
                    <p>Lorem ipsum, dolor sit amet.</p>
                </div>
            </div>
        </div>
    </body>
</html>    
EOF

done

When we feed them to pandoc, we get a broken TOC with a title page and a single chapter, spanning all the input files:

pandoc -o the_book.epub ch*.html

To fix the table of contents, we have to help pandoc a little and move the <h1>s up the tree until they are children of body. Here's how we can do this with Python and Beautiful Soup:

fix.py:

import bs4

filenames = [
    'ch1.html', 'ch2.html', 'ch3.html', 'ch4.html'
]

for filename in filenames:
    with open(filename, 'r') as f:
        soup = bs4.BeautifulSoup(f, 'lxml')

    current = soup.find(id='content')
    while current.name != 'body':
        parent = current.parent
        current.unwrap()
        current = parent

    out_filename = filename.replace('.', '-flat.')
    with open(out_filename, 'w') as f:
        f.write(soup.prettify())

For each file specified, the script creates the DOM and finds the node with the actual content - in this case, the one with content ID (if your div doesn't have an id assigned but it has a specific class, you can get it with soup.find(class_='...') instead). The call to the unwrap method replaces the node with its children and we move move up the tree to the parent of the deleted node. The code is repeated until the body node is reached. Finally, the DOM is saved to a file with -flat appended to its name.

python fix.py

pandoc \
    -o the_book.epub \
    ch1-flat.html \
    ch2-flat.html \
    ch3-flat.html \
    ch4-flat.html

That's better. The chapters were detected and split correctly, but the top two entries, which are supposed to be the title of the book, are incorrectly captioned Chapter 4 - The Book. You might have noticed that this is the text in the <title> of the last file on the command line.

pandoc sets the book title after the contents of the <title> node. When invoked with multiple input files and there is more then one <title> tag, pandoc uses the last one seen. But for ebooks spanning several HTML documents, the <title>s usually denote the chapter names, and shouldn't have impact on the title of the book.

To fix that, we have to dive into the HTML once again ³, make sure there is only one <title> tag in our input files, and that is set to the desired book title:

actual_title = 'The Book'

title_node = soup.find('title')
if filename == filenames[0]:
    title_node.string = actual_title
else:
    title_node.extract()

Finally, the table of contents looks as expected:

pandoc \
    -o the_book.epub \
    ch1-flat.html \
    ch2-flat.html \
    ch3-flat.html \
    ch4-flat.html

This was supposedly controlled by the breadth-first order toggle in Preferences → Plugins → HTML to ZIP plugin but setting it on seemed to have no effect at all.↩

My thesis being the obvious one, but also static website generation - both this blog and ninjastyles.tznvy.eu run on pandoc and some Python magic.↩

This actually sounds like a good use case for a regex. Or $EDITOR, if there are only a few files. But let's do this in Python, just to be consistent.↩

This post was originally published on blog.tznvy.eu

DEV Community

Helping pandoc generate a correct table of contents from HTML input

Latest comments (0)