Abayomi Ogunnusi

Posted on Mar 11, 2022

Extract texts from PDFs.

#javascript #beginners #tutorial #node

Finding, screening, recruiting, and training job applicants, as well as administering employee-benefit programs, are the responsibilities of hiring managers and human resource (HR).
At times, the process may necessitate extracting their information in the most computerized and automated manner possible.

We'll learn how to extract text from PDF using the pdf-parse npm lib in this short post.

Setup

npm init -y to start your node project
npm i pdf-parse
Add your pdf file

This is how your folder structure should look.

Here's the code base

const fs = require("fs");
const pdfParse = require("pdf-parse");

const pdfFile = fs.readFileSync("test.pdf");

pdfParse(pdfFile).then(function (data) {
  console.log(data.numpages);
  console.log(data.text);
  console.log(data.info);
});

Other available options

    // number of pages
    console.log(data.numpages);
    // number of rendered pages
    console.log(data.numrender);
    // PDF info
    console.log(data.info);
    // PDF metadata
    console.log(data.metadata); 
    // PDF.js version
    // check https://mozilla.github.io/pdf.js/getting_started/
    console.log(data.version);
    // PDF text
    console.log(data.text);

Run your code with this command: `node index`

Result:

The 2 highlighted in green represents the number of text as indicated in our code.

Basic Usage with HTTP

We will install 2 additional packages multer and crawler-request

const express = require("express");
const pdf = require("pdf-parse");
const crawler = require("crawler-request");
const multer = require("multer");

var upload = multer();

const app = express();
const port = process.env.PORT || 3434;

// Body parser middleware
app.use(express.json());
app.use(express.raw());


app.post("/upload-pdf", upload.single("file"), (req, res) => {
  console.log(`Request File: ${JSON.stringify(req.file)}`);

  let buff = req.file.buffer;

  pdf(buff).then((data) => {
    // PDF text
    console.log(data.text);
    res.send({ pdfText: data.text });
  });
});

app.listen(port, () => {
  console.log(`app started on localhost:${port}`);
});

Let's test with postman

Result:

Discuss

What are the other ways you can use to extract text from PDF other than the aforementioned

Resources

pdf-parse
Dev Odyssey

Top comments (1)

sara john • Dec 6 '23

I tried using it in a next.js api route and it wouldn't even load when I wrote "const pdfParse = require("pdf-parse");".

This library was last updated 5 years ago. You should delete this article. snyk.io/advisor/npm-package/pdf-parse

DEV Community

Extract texts from PDFs.

Setup

This is how your folder structure should look.

Run your code with this command: `node index`

Basic Usage with HTTP

Let's test with postman

Discuss

Resources

Top comments (1)

Read next

Using Weak Pointers in Go

Angular vs Next.js: A Detailed Comparison

Dockerize CodeIgniter 3: A Step-by-Step Guide

What is shift-left ⬅️ programming?

Setup

This is how your folder structure should look.

Run your code with this command: node index

Basic Usage with HTTP

Let's test with postman

Discuss

Resources

Read next

Using Weak Pointers in Go

Angular vs Next.js: A Detailed Comparison

Dockerize CodeIgniter 3: A Step-by-Step Guide

What is shift-left ⬅️ programming?

Run your code with this command: `node index`