I love programming-related memes and jokes, and I'm sure you do as well. @ben's weekly "Meme Monday" posts are an amazing source for humor I always look forward to weekly.
What we're building
We will build a simple project that outputs a markdown file with all the memes on a Meme Monday thread. Each meme will be outputted with the OCR (Optical character recognition) -detected text.
OCR detection will be done with Tesseract.
Setup
Spin up a Node.js Repl on Replit.
Installing Tesseract
If you run tesseract
in the shell, you will notice the command does not exist since it isn't installed.
In the top-right-corner of the filetree, click the three dots and select "Show hidden files".
Navigate to the replit.nix
configuration file and add pkgs.tesseract4
to the package dependency list.
{ pkgs }: {
deps = [
pkgs.tesseract4
pkgs.nodejs-18_x
pkgs.nodePackages.typescript-language-server
pkgs.yarn
pkgs.replitPackages.jest
];
}
Run tesseract
in the shell. It should show some options now.
Dependencies
Install node-tesseract-ocr
and node-fetch
.
npm install node-tesseract-ocr node-fetch
We're all set, let's get coding.
Building the thing
Navigate to index.js
.
Require/import the following dependencies at the top of the file.
const tesseract = require("node-tesseract-ocr");
const fetch = require("node-fetch");
const fs = require("fs");
Fetching article comments
Create an asynchronous function fetchArticleComments
that takes a slug
argument.
const fetchArticleComments = async (slug) => {
}
Let's hit the dev.to API and get an article by its slug. If the response fails, let's throw an error.
if (!articleRes.ok) throw new Error("Failed to fetch article")
const article = await articleRes.json();
Derive the article's ID and fetch the article comments with it. Return the comments if the response is successful.
const fetchArticleComments = async (slug) => {
const articleRes = await fetch("https://dev.to/api/articles/" + slug)
if (!articleRes.ok) throw new Error("Failed to fetch article")
const article = await articleRes.json();
const commentsRes = await fetch("https://dev.to/api/comments?a_id=" + article.id);
if (!commentsRes.ok) throw new Error("Failed to fetch comments")
return await commentsRes.json();
}
Extracting URLs
Create and call an asynchronous main
function at the end of the file.
async function main() {
}
main();
Within the main
function, fetch the comments of a dev.to article and create a urls
array in which we'll store the extracted URLs.
const comments = await fetchArticleComments("ben/meme-monday-59gk");
// Embedded Image URLs found in the comments
const urls = [];
Create a for
loop and iterate through the comments. For each comment, let's use a regular expression to match an image URL from an image src
prop and push it to urls
.
for (const comment of comments) {
// Get embedded images from the comment
const images = comment.body_html.match(/src=\"[^\"]+\.(jpg|png|webp|jpeg)\"/g);
// Extract the image URLs from the embedded images
if (images?.length) {
const imageUrls = images.map(str => str.replace(/src="/, "").replace(/"/, ""));
urls.push(...imageUrls);
}
}
OCR Text Extraction
Create an array variable images
for storing URLs and the extracted OCR text.
const images = [];
Create a for
loop to iterate through urls
. Use fetch
and res.ok
to ensure that the image exists.
for (const i in urls) {
const url = urls[i];
// Make sure the image exists
const res = await fetch(url);
if (res.ok) {
}
}
Within the if (res.ok)
statement, use await tesseract.recognize(url)
to get the text from the respective URL and push it to images
.
if (res.ok) {
const text = await tesseract.recognize(url);
images.push({
url,
text
});
console.log("Finished Processing URL", Number(i) + 1, "of", urls.length);
}
Finally, at the end of the main
function, use fs.writeFileSync
to write the changes to a file named memes.md
.
fs.writeFileSync(
"memes.md",
images
.map(({ url, text }) => {
// Sanitize the text to be an image alt by removing newlines and special markdown tokens
const sanitizedText = text.replace(/\[|\]|\"/g, c => "\\" + c).replaceAll("\n", "");
// Return the text followed by a markdown-formatted image
return `${text}\n\n![${sanitizedText}](${url})`
})
.join("\n\n")
);
Run the Repl. You should see as each image gets processed and at the end you will see a memes.md
file full of the memes along with the OCR-extracted text.
If you use the Markdown tool, you can preview the output markdown file.
And that's it! Thanks for reading
Top comments (7)
Btw, from node 18 and above you don't need to install node-fetch, it's already part of Node.js
Oh right, I often forget about that since I used Node 16 so often in the past.
Neat!
Thank you!
great article!
Thanks!!
Great article! I knew you were in replit, but I had no idea you were on here too!