DEV Community πŸ‘©β€πŸ’»πŸ‘¨β€πŸ’»

Cover image for Building Twitter Image Downloader using Puppeteer
Kim Kiragu
Kim Kiragu

Posted on

Building Twitter Image Downloader using Puppeteer

Scraping data from KPLC flow
This is the first part of my Analyzing Kenya Power Interruption Data project. In this part we build a Twitter Image Downloader using Puppeteer.js.

Over the past 2 years Puppeteer has become my go to choice for web scraping and automation because it is JavaScript which is my main stack among other advantages in my opinion:

  • It is easy to configure and execute
  • Puppeteer is really fast, it uses headless Chrome.
  • It is easy to take screenshots and PDFs of pages for UI testing

Tool

Twitter Image Downloader is the tool I built to be able to scrap images from Twitter accounts, of course for educational purposes. I know several such tools exist but I decided to expand my Puppeteer and JS skills by building one myself.

The main libraries I used to build this tool are:

  • Puppeteer - Node.js library which provides a high-level API to control headless Chrome or Chromium or to interact with the DevTools protocol. I use it for web crawling and scarping in this project.
  • Request - Simplified http request client
  • Inquirer - An easily embeddable and beautiful command line interface for Node.js
  • Chalk - Chalk is a library that provides a simple and easy to use interface for applying ANSI colors and styles to your command-line output.

Puppeteer Launch

This article is not a step by step guide to building the tool rather an unofficial documentation of my thought process while building it. The instructions for running the tool can be found in the README.md here

The code below is my puppeteer config. I set headless to false in my normal developer environment so that I can be able to see what is happening especially if the scroll is effective.

const browser = await puppeteer.launch({
    headless: false,
    args: ["--disable-notifications"],
  });
  const page = await browser.newPage();
  await page.setViewport({
    width: 1366,
    height: 768,
  });
Enter fullscreen mode Exit fullscreen mode

args: ["--disable-notifications"] is used to disable any notifications which may overlay and hide elements we probably want to click or get data from.

The main file is the twitter.js

The url accessed to scrape the images is found on line 67 where username is the Twitter account username entered when running the script

const pageUrl = `https://twitter.com/${username.replace("@", "")}`;
Enter fullscreen mode Exit fullscreen mode

The script opens a new tab on the Chrome based browser that Puppeteer opens and gets the url of all images :

 if (response.request().resourceType() === "image") {
      /**
       * Filter to only collect tweet images and ignore profile pictures and banners.
       */
      if (url.match("(https://pbs.twimg.com/media/(.*))")) {
        /**
         * Convert twitter image urls to high quality
         */
        const urlcleaner = /(&name=([a-zA-Z0-9_]*$))\b/;
        let cleanurl = url.replace(urlcleaner, "&name=large");

        try {
          const imageDetails = cleanurl.match(
            "https://pbs.twimg.com/media/(.*)?format=(.*)&name=(.*)"
          );
          const imageName = imageDetails[1];
          const imageExtension = imageDetails[2];
          console.log(chalk.magenta("Downloading..."));
          await downloader(cleanurl, imageName, imageExtension, username);
        } catch (error) {}
      }
    }
Enter fullscreen mode Exit fullscreen mode

The response.request().resourceType() === "image" part is responsible for only checking for images because that is what we are currently interested in.

Regex

We see a lot of regex matching and I am going to explain what is going on.

1.

   url.match("(https://pbs.twimg.com/media/(.*))")
Enter fullscreen mode Exit fullscreen mode

A normal Twitter user profile contains many types of images:

  • Their profile picture and header
  • Images posted/retweeted
  • Other retweeted users' profile pictures.

Each of these images have urls and one of my main headaches when starting out was being able to filter out only images in the 2nd category.
Luckily I found out that images posted by tweeting follow the pattern https://pbs.twimg.com/media/.. and that is what we are doing with the url.match function. We ignore all other types of images and only work with posted images.

2.

   const urlcleaner = /(&name=([a-zA-Z0-9_]*$))\b/;
   let cleanurl = url.replace(urlcleaner, "&name=large");
Enter fullscreen mode Exit fullscreen mode

Posted images all follow the same pattern except the &name= part which specifies the dimensions of the image, for example, https://pbs.twimg.com/media/FDSOZT9XMAIo6Sv?format=jpg&name=900x900 900x900 is the dimension of the image.

I needed high quality images because my use case involves extracting data from text which is why I replace the &name=... part of all image urls with &name=large to get the best quality using the urlcleaner regex to match all possibilities.

3.

   const imageDetails = cleanurl.match(
     "https://pbs.twimg.com/media/(.*)?format=(.*)&name=(.*)"
   );
   const imageName = imageDetails[1];
   const imageExtension = imageDetails[2];
Enter fullscreen mode Exit fullscreen mode

The 3rd part retrieves the results of matching the clean modified string and returns results an array where I can be able to access the image name and the extension.

   Array ["https://pbs.twimg.com/media/FDSOZT9XMAIo6Sv?format=jpg&name=large", "FDSOZT9XMAIo6Sv?", "jpg", "large"]
Enter fullscreen mode Exit fullscreen mode

This is what the typical imageDetails will look like.

Autoscroll

Twitter uses infinite Scroll where tweets in current page view are loaded and to load more tweets you have to keep scrolling. This is why I needed an autoscroll function so that our browser could automatically scroll and scroll until it could not load more tweets.

async function autoScroll(page) {
  await page.evaluate(async () => {
    await new Promise((resolve, reject) => {
      let totalHeight = 0;
      const distance = 100;
      const timer = setInterval(() => {
        var scrollHeight = document.body.scrollHeight;
        window.scrollBy(0, distance);
        totalHeight += distance;
        if (totalHeight >= scrollHeight) {
          clearInterval(timer);
          resolve();
        }
      }, 300);
    });
  });
}

Enter fullscreen mode Exit fullscreen mode

Download Images

The function that downloads the images can be found here downloader.js

function download(uri, name, extension, twitterUsername) {
  return new Promise((resolve, reject) => {
    request.head(uri, function (err, res, body) {
      const twitterUsernamePath = `${"./"}/images/${twitterUsername}`;
      if (!fs.existsSync(twitterUsernamePath)) {
        fs.mkdirSync(twitterUsernamePath);
      }
      const filePath = path.resolve(
        twitterUsernamePath,
        `${name}.${extension}`
      );
      request(uri).pipe(fs.createWriteStream(filePath)).on("close", resolve);
    });
  });
}
Enter fullscreen mode Exit fullscreen mode

The function takes in a uri, name, extension and twitterUsername. These parameters are passed in from line 61 of twitter.js

A folder named after the Twitter Username is created here. The images are then written/downloaded to the folder one by one.

The images are named using the passed name and extension, remember the ones we were extracting using the Regex part 3.

Conclusion

There will be several images downloaded but for the purpose of the Analyzing Kenya Power Interruption project, we are interested in the images that look like this.

The code and instructions for running this tool can be found on https://github.com/Kimkykie/twitter-image-downloader

This is still a work in progress and I am open to corrections, ideas and enhancements.

The next part will be extracting text from our images and converting them into txt files. Thank you.

Top comments (0)

🌚 Life is too short to browse without dark mode