DEV Community

ryanrosello-og
ryanrosello-og

Posted on

Verify PDF contents using Playwright and pdf2json

In this tutorial we will use Playwright inconjuction with pdf2json to validate contents of a pdf file. This is very common task that you will normally encounter when creating end to end automated tests.

The pdf file we will use for this example is plain old textual based pdf containing 6 pages. For simplicity, I have stored this file pdf_sample.pdf into the root folder of the project.

PDF file to validate

Our goals are:

  • validate the meta informaiion (keywords:"Standard Fees and Charges, 003-750, 3-750") contained within the file
  • ensure the pdf file indeed has 6 pages
  • assert whether the PDF file contains the correct text "When we may charge fees"

First up, you will need to add pdf2json to your project using yarn (or npm):

yarn add pdf2json -D

Import pdf2json into your spec file and create the initial scaffolding for our tests:

import PDFParser from 'pdf2json';
import { test, expect } from '@playwright/test';

test.describe('assert PDF contents using Playwright', () => {
  test.beforeAll(async () => {
  })

  test('pdf file should have 6 pages', async () => {
  });

  test('contains the correct subheading text', async () => {
  });

  test('shows the correct meta information (keywords)', async () => {
  });
});
Enter fullscreen mode Exit fullscreen mode

Create a simple helper function that does the heavy lifting of parsing and loading the pdf contents into a variable:

async function getPDFContents(pdfFilePath: string): Promise<any> {
  let pdfParser = new PDFParser();
  return new Promise((resolve, reject) => {
    pdfParser.on('pdfParser_dataError', (errData: {parserError: any}) =>
      reject(errData.parserError)
    );
    pdfParser.on('pdfParser_dataReady', (pdfData) => {
      resolve(pdfData);
    });

    pdfParser.loadPDF(pdfFilePath);
  });
}
Enter fullscreen mode Exit fullscreen mode

Create variable called pdfContents scoped within the describe block:

let pdfContents: any

Update the beforeAll to read the contents of the pdf into the variable

  test.beforeAll(async ({}) => {
    pdfContents = await getPDFContents('./pdf_sample.pdf')
  })
Enter fullscreen mode Exit fullscreen mode

If you were to debug and inspect the shape of the pdfContents you will notice that the first 2 tests are quite easy to assert.

Easy assertion

  test('pdf file should have 6 pages', async () => {
    expect(pdfContents.Pages.length, 'The pdf should have 6 pages').toEqual(6);
  });

  test('shows the correct meta informaion (keywords)', async () => {
    expect(pdfContents.Meta.Keywords, 'PDF keyword was incorrect').toEqual('Standard Fees and Charges, 003-750, 3-750');
  });
Enter fullscreen mode Exit fullscreen mode

However, the last test (assert if "When we may charge fees" is contained in the file) is a little bit more convulted. You will need to expand the Pages array and find the page where you expect the text to exists. You will then need to inspect Texts array to find the text that you are looking for. In our example it was found in first page on the fourth line. This equates to pdfContents.Pages[0].Texts[3].R[0].T

Raw text for assertion

One last complication remains, the raw text that we require "When%20we%20may%20charge%20fees" seems to be encoded. We can easily strip out the encoding use the decodeURI function.

  test('contains the correct subheading text', async () => {
    const rawText = pdfContents.Pages[0].Texts[3].R[0].T
    expect(decodeURI(rawText), 'The subheading text was incorrect').toEqual('When we may charge fees');    
  });
Enter fullscreen mode Exit fullscreen mode

Our final test

Conclusion

I have demonstrated how you can easily verify contents of a pdf using Playwright and pdf2json. We have worked with a very basic pdf containing textual information. Unfortunately, pdf2json may not be able to handle more complex PDF files. YMMV 🥳🚀

Top comments (2)

Collapse
 
amitasil profile image
Amit Rawat

Is it working with typescript for you?

Collapse
 
yaromehaber015166_biorad profile image
Yarome Haber

Hi,
You didn't convert the pdf to a json file