Web scraping is a vast topic. In this article we are going to look at what it is? Where can we use it? and a basic example of how to go about it.
What is it?
Web scraping is a method used by web developers to extract large amount of data from any given website. This is mostly used to save developers time in case if you want to make calculations on massive amount of data off any website so that they actually don't have to visit these sites and manually log all the data themselves.
Web scraping a web page involves fetching it and extracting from it. Fetching is the downloading of a page (which is what a browser does when you view the page). Therefore, web crawling is a main component of web scraping, to fetch pages for later processing. Once fetched, then extraction can take place. The content of a page may be parsed, searched, reformatted, its data copied into a spreadsheet or save to a server, and so on. Web scrapers typically take something out of a page, to make use of it for another purpose somewhere else.
There are methods that some websites use to prevent web scraping, such as detecting and disallowing bots from crawling (viewing) their pages. In response, there are web scraping systems that rely on using techniques in DOM parsing, computer vision and natural language processing to simulate human browsing to enable gathering web page content for offline parsing. For eg: Github has rate limiting mechanism to control incoming and outgoing traffic.
๐ Use Cases?
Here, when we are learning about web scraping. One might think, it sounds cool and all but what do I do with it?
Most use cases include automation of some kind. It could be any of the following
- Online price monitoring
- Research
Market analysis
to build large data sets for Machine learning
End to end testing
Gathering real estate listings
Product comparison websites
Of course there doesn't have to be such a gigantic use case. Online you can find examples of developers getting creative for automating small things helping their day to day lives. One developer had built a small script for logging in and checking her loan due amount everyday, or when devs are not happy with the data representation UI provides and need some special kind of filter.
Our use case for today is that we need a list of emojis saved to a JSON file with it'S unicode and name (because who doesn't love emojis). There is official list of all emoji unicodes on unicode.org.
Note ๐: The more updated version of that lives here but we want to learn scraping so we will stick to html.
๐ Tools that can be used
Let's go hunting for a tool that can help us do that. There are two most commonly used JS libs for scraping Cheerio and Puppeteer. Let's look at each one of them briefly
Cheerio
Cheerio is like the most popular one. According to their website, Cheerio is a fast, flexible, and lean implementation of core jQuery designed specifically for the server. jQuery like api is what makes it a darling of devs. They have a massive list of selectors, again syntax borrowed from jQuery. Because I am not familiar with jQuery syntax as much, I decided to go with puppeteer.
Puppeteer
Puppeteer is Node api for Headless chrome and headless chrome is a program that node put out to be able to use their browser without a GUI. It is usually used for automating things, which is what we need. It uses devtool protocol. It's really cool, in case you want to check it out.
Puppeteer has event-driven architecture, which removes a lot of potential flakiness. Thereโs no need for sleep(1000)
calls in puppeteer scripts. You can play around with puppeteer here. And since it is actual chromium api, it is much more powerful than Cheerio. It can do things like generating PDFs, screenshots or capture timeline trace and much more.
Show me the code
- Install puppeeteer
so start a new project npm init -y
install puppeteer npm install puppeteer --save
Note ๐: When installed, it downloads a version of Chromium, which it then drives using puppeteer-core
. If you install puppeteer-core
, it doesn't download Chromium. It requires Node version >> v6.4.0, but our example below uses async/await which is only supported in Node version >= v7.6.0
- Launch the browser and navigate to the webpage
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto('https://unicode.org/emoji/charts/full-emoji-list.html');
...
await browser.close();
})();
When you launch Puppeteer, you get a instance of a browser back, It has all bunch of options, by default puppeteer launches a headless browser, for debugging purposes you can set headless false, then you can actually see all the things that are gonna happen with the script, but note that headless mode is faster. At the end of it you wanna close the browser, because if you don't, you gonna have memory leaks, and you don't want that.
- Search and get the data we need
const puppeteer = require('puppeteer');
let scrape = (async () => {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto('https://unicode.org/emoji/charts/full-emoji-list.html');
const result = await page.evaluate(() => {
let data = []
document.querySelectorAll('table tr').forEach(node => {
const code = node.querySelector('.code a')
const name = node.querySelector('.name')
if (code) {
data.push({
code: code.innerHTML.replace(' ', '').split('U+').filter(Boolean).join('_').toLowerCase(),
name: name.innerHTML
});
}
})
return data
});
await browser.close();
return result;
});
scrape().then(data => {
console.log(data) // success
})
If the function passed to the page.evaluate
returns a Promise, then page.evaluate
would wait for the promise to resolve and return its value.
It's not executing this function in Puppeteer, it's actually executing that in the DOM, so you have access to all the DOM. We searched the document for all emoji unicodes and their names, and returned the data.
- Save the data
const puppeteer = require('puppeteer');
const fa = require('fs');
let scrape = (async () => {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto('https://unicode.org/emoji/charts/full-emoji-list.html');
const result = await page.evaluate(() => {
let data = []
document.querySelectorAll('table tr').forEach(node => {
const code = node.querySelector('.code a')
const name = node.querySelector('.name')
if (code) {
data.push({
code: code.innerHTML.replace(' ', '').split('U+').filter(Boolean).join('_').toLowerCase(),
name: name.innerHTML
});
}
})
return data
});
await browser.close();
return result;
});
scrape().then(data => {
fs.writeFile('emoji-list.json', JSON.stringify(value), 'utf8', () => {
console.log('DONE!!')
});
})
Here we just saved the returned data to a JSON file. And there you have it, the list of emojis.
Thatโs it!
now run the script with node index.js
End Note
Web scraping is certainly a fun experience. As I mentioned it is broad field and you have finished a brief tour of that field. You can get pretty far using puppeteer
for scraping.
I hope that this post helps with getting started with Web scraping and that you enjoyed it!
If you have some questions or comments please let me know in the comments below and I will get back to you.
Photo by Nick Fewings on Unsplash
Top comments (0)