DEV Community

loading...

Update on my PDF file manipulation Script

gemgr profile image George Moustakas ・1 min read

Hi again, in my previous post i had a problem on how to search a large PDF file for a keyword which can be found in multiple pages of the file and in some cases more than once in single page!

I've used PyPDF2 to open a given PDF file, then extract the text page by page, search that text for the given keyword and then check in what page the keyword was found and how many times per page and finally split those pages from the original file and merge them all together to create my final file so it can be printed with the useful data and not with other non-useful data from the original file.

All works fine with test/dummy data in English Characters but the original file is in Greek and the

PdfPageObj.extractText()

function of PyPDF2 returns an empty string.

So how would you approach this problem?
Any Suggestions?

Discussion

pic
Editor guide
Collapse
fronkan profile image
Fredrik Bengtsson

According to the documentation, the extractText method works poorly for some PDF-files depending on the generator.
Some questions I would think about:

  • Are the PDFs generated using the same tool?
  • If not can you access the tool used to generate the greek PDF and use it to generate a test case?

If you are unfortunate the extract method just doesn't work for your PDF and in that case, maybe you have look for another library. I hope it works out for you.

Collapse
babis30322853 profile image
Babis

For my opinion, pdfminer.six is the most accurate for reading pdfs. I would also suggest to you to make modifications in laparams to make the text output the closest possible for your needs.
Also, I have used pdfminer.six in greek pdfs and the text was extracted. (before this I had tried with pypdf2 without success)