Introduction
In this tutorial, we will learn how to scrape Google Events Results using Node JS with Puppeteer JS.
Requirements:
Web Parsing with CSS selectors
Searching the tags from the HTML files is not only a difficult thing to do but also a time-consuming process. It is better to use the CSS Selectors Gadget for selecting the perfect tags to make your web scraping journey easier.
This gadget can help you to come up with the perfect CSS selector for your need. Here is the link to the tutorial, which will teach you to use this gadget for selecting the best CSS selectors according to your needs.
User Agents
User-Agent is used to identify the application, operating system, vendor, and version of the requesting user agent, which can save help in making a fake visit to Google by acting as a real user.
You can also rotate User Agents, read more about this in this article: How to fake and rotate User Agents using Python 3.
If you want to further safeguard your IP from being blocked by Google, you can try these 10 Tips to avoid getting Blocked while Scraping Google.
Install Libraries
Before we begin, install these libraries so we can move forward and prepare our scraper.
Or you can type the below commands in your project terminal to install the libraries:
npm i puppeteer
Target:
Process:
Open the below URL in your browser, so we can start the scraping process.
https://www.google.com/search?q=events+in+delhi&ibp=htl;events&hl=en&gl=in
Our query is "Events in Delhi". Then, we have the Google Search Parameter for displaying events results, ibp=htl;events
. After that, we have the hl
as a language parameter. You can set any language. I have used English here because it is common to mostly all of us reading the tutorial. Then we have the geolocation parameter, which is again can have a different value for a different country. For example, for the USA, it would be gl=us
.
We will use Puppeteer Infinite Scrolling Method to scrape the Google Events Search Results. So, let us start preparing our scraper.
First, let us create a main function which will launch the browser and navigate to the target URL.
const getEventsData = async () => {
browser = await puppeteer.launch({
headless: false,
args: ["--disabled-setuid-sandbox", "--no-sandbox"],
});
const [page] = await browser.pages();
await page.goto(
"https://www.google.com/search?q=events+in+delhi&ibp=htl;events&hl=en&gl=in",
{
waitUntil: "domcontentloaded",
timeout: 60000,
}
);
await page.waitForTimeout(5000);
let data = await scrollPage(page, ".UbEfxe", 20);
console.log(data);
await browser.close();
};
Step-by-step explanation:
-
puppeteer.launch()
- This will launch the Chromium browser with the options we have set in our code. In our case, we are launching our browser in non-headless mode. -
browser.newPage()
- This will open a new page or tab in the browser. -
page.setExtraHTTPHeaders()
- It is used to pass HTTP headers with every request the page initiates. -
page.goto()
- This will navigate the page to the specified target URL. -
page.waitForTimeout()
- It will cause the page to wait for 3 seconds to do further operations. -
scrollPage()
- At last, we called our infinite scroller to extract the data we need with the page, the tag for the scrollerdiv
, and the number of items we want as parameters.
Now, let us prepare the infinite scroller.
const scrollPage = async(page, scrollContainer, itemTargetCount) => {
let items = [];
let previousHeight = await page.evaluate(`document.querySelector("${scrollContainer}").scrollHeight`);
while (itemTargetCount > items.length) {
items = await extractItems(page);
await page.evaluate(`document.querySelector("${scrollContainer}").scrollTo(0, document.querySelector("${scrollContainer}").scrollHeight)`);
await page.evaluate(`document.querySelector("${scrollContainer}").scrollHeight > ${previousHeight}`);
await page.waitForTimeout(2000);
}
return items;
}
Step-by-step explanation:
-
previousHeight
- Scroll height of the container. -
extractItems()
- Function to parse the scraped HTML. - In the next step we just scrolled down the container to height equals to
previousHeight
. - And in the last step, we waited for the container to scroll down until its height got greater than the previous height.
And, at last we will talk about our parser.
const extractItems = async(page) => {
let events_results = await page.evaluate(() => {
return Array.from(document.querySelectorAll("li.PaEvOc")).map((el) => {
return{
title: el.querySelector(".YOGjf")?.textContent,
timings: el.querySelector(".cEZxRc")?.textContent,
date: el.querySelector(".gsrt")?.textContent,
address: Array.from(el.querySelectorAll(".zvDXNd")).map((el) => {
return el.textContent
}),
link: el.querySelector(".zTH3xc")?.getAttribute("href"),
thumbnail: el.querySelector('.wA1Bge')?.getAttribute("src"),
location_link: el.querySelector(".ozQmAd") ? "https://www.google.com" + el.querySelector(".ozQmAd")?.getAttribute("data-url") : "",
tickets: Array.from(el.querySelectorAll('.RLN0we[jsname="CzizI"] div[data-domain]')).map((el) => {
return {
source: el?.getAttribute("data-domain"),
link: el.querySelector(".SKIyM")?.getAttribute("href"),
}
}),
venue_name: el.querySelector(".RVclrc")?.textContent,
venue_rating: el.querySelector(".UIHjI")?.textContent,
venue_reviews: el.querySelector(".z5jxId")?.textContent,
venue_link: el.querySelector(".pzNwRe a") ? "" + el.querySelector(".pzNwRe a").getAttribute("href") : ""
}
})
})
for(let i =0; i events_results[i][key] === undefined || events_results[i][key] === "" || events_results[i][key].length === 0 ? delete events_results[i][key] : {});
}
return events_results;
}
Step-by-step explanation:
-
document.querySelectorAll()
- It will return all the elements that matches the specified CSS selector. In our case, it is Nv2PK. -
getAttribute()
-This will return the attribute value of the specified element. -
textContent
- It returns the text content inside the selected HTML element. -
split()
- Used to split a string into substrings with the help of a specified separator and return them as an array. -
trim()
- Removes the spaces from the starting and end of the string. -
replaceAll()
- Replaces the specified pattern from the whole string.
As you can see in the above image, all the data is under this parent tag li.PaEvOc
.
return Array.from(document.querySelectorAll("li.PaEvOc")).map((el) => {
The below piece of data can be scraped easily with the help CSS selector gadget and some basic parsing skills.
title: el.querySelector(".YOGjf")?.textContent,
timings: el.querySelector(".cEZxRc")?.textContent,
date: el.querySelector(".gsrt")?.textContent,
address: Array.from(el.querySelectorAll(".zvDXNd")).map((el) => {
return el.textContent
}),
link: el.querySelector(".zTH3xc")?.getAttribute("href"),
thumbnail: el.querySelector('.wA1Bge')?.getAttribute("src"),
The address property contains more than one element in its container, so these elements are stored in a list format in the array. A similar process is followed for scraping tickets.
I have also added an extra string in the location_link
, https://www.google.com
because the scraped URL will be incomplete.
location_link: el.querySelector(".ozQmAd") ? "https://www.google.com" + el.querySelector(".ozQmAd")?.getAttribute("data-url") : "",
That ternary operator says that if the element exists, then scrape it by selecting the target selector otherwise leave it blank.
Similarly, with the above explanation, you can now scrape the venue details. Here is the complete code:
const puppeteer = require('puppeteer');
const extractItems = async(page) => {
let events_results = await page.evaluate(() => {
return Array.from(document.querySelectorAll("li.PaEvOc")).map((el) => {
return{
title: el.querySelector(".YOGjf")?.textContent,
timings: el.querySelector(".cEZxRc")?.textContent,
date: el.querySelector(".gsrt")?.textContent,
address: Array.from(el.querySelectorAll(".zvDXNd")).map((el) => {
return el.textContent
}),
link: el.querySelector(".zTH3xc")?.getAttribute("href"),
thumbnail: el.querySelector('.wA1Bge')?.getAttribute("src"),
location_link: el.querySelector(".ozQmAd") ? "https://www.google.com" + el.querySelector(".ozQmAd")?.getAttribute("data-url") : "",
tickets: Array.from(el.querySelectorAll('.mi3HuEAU05x__visible-container div')).map((el) => {
return {
source: el?.getAttribute("data-domain"),
link: el.querySelector(".SKIyM")?.getAttribute("href"),
}
}),
venue_name: el.querySelector(".RVclrc")?.textContent,
venue_rating: el.querySelector(".UIHjI")?.textContent,
venue_reviews: el.querySelector(".z5jxId")?.textContent,
venue_link: el.querySelector(".pzNwRe a") ? "" + el.querySelector(".pzNwRe a").getAttribute("href") : ""
}
})
})
for(let i =0; i events_results[i][key] === undefined || events_results[i][key] === "" || events_results[i][key].length === 0 ? delete events_results[i][key] : {});
}
return events_results;
}
const scrollPage = async(page, scrollContainer, itemTargetCount) => {
let items = [];
let previousHeight = await page.evaluate(`document.querySelector("${scrollContainer}").scrollHeight`);
while (itemTargetCount > items.length) {
items = await extractItems(page);
await page.evaluate(`document.querySelector("${scrollContainer}").scrollTo(0, document.querySelector("${scrollContainer}").scrollHeight)`);
await page.evaluate(`document.querySelector("${scrollContainer}").scrollHeight > ${previousHeight}`);
await page.waitForTimeout(2000);
}
return items;
}
const getEventsData = async () => {
browser = await puppeteer.launch({
headless: false,
args: ["--disabled-setuid-sandbox", "--no-sandbox"],
});
const [page] = await browser.pages();
await page.goto("https://www.google.com/search?q=events+in+delhi&ibp=htl;events&hl=en&gl=in" , {
waitUntil: 'domcontentloaded',
timeout: 60000
})
await page.waitForTimeout(5000)
let data = await scrollPage(page,".UbEfxe",20)
console.log(data)
await browser.close();
};
getEventsData();
Results:
Our result should look like this ππ»:
{
title: 'Armaan Malik',
timings: 'Sun, 7β10 pm',
date: '27Nov',
address: [
'DLF Avenue Saket, A4, Press Enclave Marg, Saket District Centre, District Centre, Sector 6, Pushp Vihar',
'New Delhi, Delhi'
],
link: 'https://insider.in/steppinout-presents-armaan-malik-next-2-you-india-tour-delhi-nov27-2022/event',
thumbnail: 'https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRHJ2mFEDeFEz-J7OqksfK1TBg_HTNtwKYPnscewHm1gQ&s=10',
location_link: 'https://www.google.com/maps/place//data=!4m2!3m1!1s0x390ce1f4d9f62005:0x3aee569514ba9326?sa=X&hl=en&gl=in',
tickets: [
{
source: 'Songkick.com',
link: 'http://www.songkick.com/concerts/40751584-armaan-malik-at-dlf-avenue-saket?utm_medium=organic&utm_source=microformat'
},
{
source: 'Insider.in',
link: 'https://insider.in/steppinout-presents-armaan-malik-next-2-you-india-tour-delhi-nov27-2022/event'
}
],
venue_name: 'DLF Avenue Saket',
venue_rating: '4.4',
venue_reviews: '39,064 reviews',
venue_link: 'https://www.google.com/search?hl=en&gl=in&q=DLF+Avenue+Saket&ludocid=4246426696954843942&ibp=gwp%3B0,7'
}
......
Conclusion:
In this tutorial, we learned to scrape Google Events Results using Node JS. Feel free to message me if I missed something. Follow me on Twitter. Thanks for reading!
Additional Resources
- Web Scraping Google With Node JS - A Complete Guide
- Web Scraping Google Images
- Scrape Google News Results
- Scrape Google Maps Reviews
Author:
My name is Darshan and I am the founder of serpdog.io. I love to create scrapers. I am currently working for several MNCs to provide them Google Search Data through a seamless data pipeline.
Top comments (0)