Finding, screening, recruiting, and training job applicants, as well as administering employee-benefit programs, are the responsibilities of hiring managers and human resource (HR).
At times, the process may necessitate extracting their information in the most computerized and automated manner possible.
We'll learn how to extract text from PDF using the pdf-parse npm lib in this short post.
Setup
npm init -y
to start your node project
npm i pdf-parse
Add your pdf file
This is how your folder structure should look.
- Here's the code base
const fs = require("fs");
const pdfParse = require("pdf-parse");
const pdfFile = fs.readFileSync("test.pdf");
pdfParse(pdfFile).then(function (data) {
console.log(data.numpages);
console.log(data.text);
console.log(data.info);
});
- Other available options
// number of pages
console.log(data.numpages);
// number of rendered pages
console.log(data.numrender);
// PDF info
console.log(data.info);
// PDF metadata
console.log(data.metadata);
// PDF.js version
// check https://mozilla.github.io/pdf.js/getting_started/
console.log(data.version);
// PDF text
console.log(data.text);
Run your code with this command: node index
The 2 highlighted in green represents the number of text as indicated in our code.
Basic Usage with HTTP
We will install 2 additional packages multer
and crawler-request
const express = require("express");
const pdf = require("pdf-parse");
const crawler = require("crawler-request");
const multer = require("multer");
var upload = multer();
const app = express();
const port = process.env.PORT || 3434;
// Body parser middleware
app.use(express.json());
app.use(express.raw());
app.post("/upload-pdf", upload.single("file"), (req, res) => {
console.log(`Request File: ${JSON.stringify(req.file)}`);
let buff = req.file.buffer;
pdf(buff).then((data) => {
// PDF text
console.log(data.text);
res.send({ pdfText: data.text });
});
});
app.listen(port, () => {
console.log(`app started on localhost:${port}`);
});
Let's test with postman
Discuss
What are the other ways you can use to extract text from PDF other than the aforementioned
Top comments (1)
I tried using it in a next.js api route and it wouldn't even load when I wrote "const pdfParse = require("pdf-parse");".
This library was last updated 5 years ago. You should delete this article. snyk.io/advisor/npm-package/pdf-parse