While there are numerous ways to handle PDF documents with Python, I find generating or editing HTML far easier and more reliable than trying to figure out the intricacies of the PDF format. Sure, there is the venerable ReportLab, and if HTML is not your cup of tea, I encourage you to look into that option. There is also PyPDF2. Or maybe PyPDF3? No, perhaps PyPDF4! Hmmm... see the problem? My best guess is PyPDF3, for what that is worth.
So many choices...
But there is an easy choice if you are comfortable with HTML.
Enter WeasyPrint. It takes HTML and CSS, and converts it to a usable and potentially beautiful PDF document.
The code samples in this article can be accessed in the associated Github repo. Feel free to clone and adapt.
Installation
To install WeasyPrint, I recommend you first set up a virtual environment with the tool of your choice.
Then, installation is as simple as performing something like the following in an activated virtual environment:
pip install weasyprint
Alternatives to the above, depending on your tooling:
poetry add weasyprint
conda install -c conda-forge weasyprint
pipenv install weasyprint
You get the idea.
If you only want the weasyprint
command-line tool, you could even use pipx and install with pipx install weasyprint
. While that would not make it very convenient to access as a Python library, if you just want to convert web pages to PDFs, that may be all you need.
A command line tool (Python usage optional)
Once installed, the weasyprint
command line tool is available. You can convert an HTML file or a web page to PDF. For instance, you could try the following:
weasyprint \
"https://en.wikipedia.org/wiki/Python_(programming_language)" \
python.pdf
The above command will save a file python.pdf
in the current working directory, converted from the HTML from the Python programming language article in English on Wikipedia. It ain't perfect, but it gives you an idea, hopefully.
You don't have to specify a web address, of course. Local HTML files work fine, and they provide necessary control over content and styling.
weasyprint sample.html out/sample.pdf
Feel free to download a sample.html
and an associated sample.css
stylesheet with the contents of this article.
See the WeasyPrint docs for further examples and instructions regarding the standalone weasyprint
command line tool.
Utilizing WeasyPrint as a Python library
The Python API for WeasyPrint is quite versatile. It can be used to load HTML when passed appropriate file pointers, file names, or the text of the HTML itself.
Here is an example of a simple makepdf()
function that accepts an HTML string, and returns the binary PDF data.
from weasyprint import HTML
def makepdf(html):
"""Generate a PDF file from a string of HTML."""
htmldoc = HTML(string=html, base_url="")
return htmldoc.write_pdf()
The main workhorse here is the HTML
class. When instantiating it, I found I needed to pass a base_url
parameter in order for it to load images and other assets from relative urls, as in <img src="somefile.png">
.
Using HTML
and write_pdf()
, not only will the HTML be parsed, but associated CSS, whether it is embedded in the head of the HTML (in a <style>
tag), or included in a stylesheet (with a <link href="sample.css" rel="stylesheet"\>
tag).
I should note that HTML
can load straight from files, and write_pdf()
can write to a file, by specifying filenames or file pointers. See the docs for more detail.
Here is a more full-fledged example of the above, with primitive command line handling capability added:
from pathlib import Path
import sys
from weasyprint import HTML
def makepdf(html):
"""Generate a PDF file from a string of HTML."""
htmldoc = HTML(string=html, base_url="")
return htmldoc.write_pdf()
def run():
"""Command runner."""
infile = sys.argv[1]
outfile = sys.argv[2]
html = Path(infile).read_text()
pdf = makepdf(html)
Path(outfile).write_bytes(pdf)
if __name__ == "__main__":
run()
You may download the above file directly, or browse the Github repo.
A note about Python types: the
string
parameter when instantiatingHTML
is a normal (Unicode)str
, butmakepdf()
outputsbytes
.
Assuming the above file is in your working directory as weasyprintdemo.py
and that a sample.html
and an out
directory are also there, the following should work well:
python weasyprintdemo.py sample.html out/sample.pdf
Try it out, then open out/sample.pdf
with your PDF reader. Are we close?
Styling HTML for print
As is probably apparent, using WeasyPrint is easy. The real work with HTML to PDF conversion, however, is in the styling. Thankfully, CSS has pretty good support for printing.
Some useful CSS print resources:
This simple stylesheet demonstrates a few basic tricks:
body {
font-family: sans-serif;
}
@media print {
a::after {
content: " (" attr(href) ") ";
}
pre {
white-space: pre-wrap;
}
@page {
margin: 0.75in;
size: Letter;
@top-right {
content: counter(page);
}
}
@page :first {
@top-right {
content: "";
}
}
}
First, use media queries. This allows you to use the same stylesheet for both print and screen, using @media print
and @media screen
respectively. In the example stylesheet, I assume that the defaults (such as seen in the body
declaration) apply to all formats, and that @media print
provides overrides. Alternatively, you could include separate stylesheets for print and screen, using the media
attribute of the <link>
tag, as in <link rel="stylesheet" src="print.css" media="print" />
.
Second, use @page
CSS rules. While browser support is pretty abysmal in 2020, WeasyPrint does a pretty good job of supporting what you need. Note the margin and size adjustments above, and the page numbering, in which we first define a counter in the top-right, then override with :first
to make it blank on the first page only. In other words, page numbers only show from page 2 onward.
Also note the a::after
trick to explicitly display the href
attribute when printing. This is either clever or annoying, depending on your goals.
Another hint, not demonstrated above: within the @media print
block, set display: none
on any elements that don't need to be printed, and set background: none
where you don't want backgrounds printed.
Django and Flask support
If you write Django or Flask apps, you may benefit from the convenience of the respective libraries for generating PDFs within these frameworks:
-
django-weasyprint provides a
WeasyTemplateView
view base class or aWeasyTemplateResponseMixin
mixin on a TemplateView -
Flask-WeasyPrint provides a special
HTML
class that works just like WeasyPrint's, but respects Flask routes and WSGI. Also provided is arender_pdf
function that can be called on a template or on theurl_for()
of another view, setting the correct mimetype.
Generate HTML the way you like
WeasyPrint encourages the developer to make HTML and CSS, and the PDF just happens. If that fits your skill set, then you may enjoy experimenting with and utilizing this library.
How you generate HTML is entirely up to you. You might:
- Write HTML from scratch, and use Jinja templates for variables and logic.
- Write Markdown and convert it to HTML with cmarkgfm or other Commonmark implementation.
- Generate HTML Pythonically, with Dominate or lxml's E factory
- Parse, modify, and prettify your HTML (or HTML written by others) with BeautifulSoup
Then generate the PDF using WeasyPrint.
Anything I missed? Feel free to leave comments!
Top comments (1)
I recently wrote an article about a few different options on modern ways to generate PDfs with Python.