This blog was originally posted to Crawlbase Blog
Forbes is a business and financial news site with great information on industries, companies, and people around the world. Forbes gets millions of visits every month. They have billionaire rankings, business trends, and analysis. Forbes uses JavaScript to load their content dynamically so it's a bit tricky to scrape with traditional tools.
This tutorial will show you how to scrape Forbes data using Puppeteer, a headless browser. Once you get the basics down, we'll cover how to use the Crawlbase Crawling API to optimize your data extraction. With these tools, you can collect Forbes data for research, analysis, or personal projects.
Why Scrape Data from Forbes?
There is no denying that Forbes has a wealth of business, financial, and lifestyle-related information. Scraping Forbes data does allow you to follow several aspects, such as the most current trends in business or the analysis of the billionaires' wealth. Here are some key reasons to scrape data from Forbes:
- Billionaire Rankings: Forbes is a name everyone is familiar with its global billionaire rankings. This data can be scraped to see how wealth has evolved over time.
- Company Information: For looking at how a business is doing, Forbes has the best profiles on companies.
- Industry Insights: Forbes provide articles on various sectors including technology, finance, healthcare and more. Scrape data to follow specific industries and trends.
- Financial News: Forbes publishes real-time news and and updates on the world economy and markets. Scrape this data to keep track of significant financial events.
Key Data Points to Scrape from Forbes
While Scraping Forbes, you may want to extract many data points. Some of the essential data points you need to look at are:
- Billionaire Profiles: Forbes provides in-depth biographies of the wealthiest individuals on the planet. These profiles contain wealth source, industry, net worth, and country of origin.
- Company Profiles: Forbes provides comprehensive data about businesses, such as revenue, headcount, and sector. Use this data to compare businesses or keep an eye on particular industries over time.
- Top Lists: Forbes is well-known for its "Top" lists, which include the top 100 billionaires, the top multinational corporations, and the top startups.
- Articles and News: Forbes features breaking news and in-depth articles on business, finance, and lifestyle. To keep up with the most recent news, trends, and expert opinions from the sector, scrape Forbes articles.
- Market Data: Financial information such as stock prices, market trends, and economic projections are available on the website. To keep track of the financial markets and gain real-time insights, scrape Forbes market data.
Setting Up Your Scraping Environment
To scrape Forbes data, we need to set up project environment. We need to install Node.js, Puppeteer, and other required libraries. Follow following steps.
Installing Puppeteer
Puppeteer is a Node.js library that provides a high-level API to control Chrome or Chromium, perfect for scraping dynamic content like Forbes. To install Puppeteer, follow these steps:
- Make sure Node.js is installed on your system. You can download it from Node.js official website.
- Once you have Node.js, open your terminal and run the following command to install Puppeteer:
npm install puppeteer
This command will install Puppeteer along with Chromium, which Puppeteer uses to run a headless browser for scraping websites.
Setting Up Your Project
Puppeteer is installed. Now set up your project folder and initialize Node.js. Follow these steps:
- Create a new directory for your project:
mkdir forbes-scraper
cd forbes-scraper
- Initialize a new Node.js project by running the following command:
npm init -y
This command will create a package.json
file, which manages your project dependencies.
This completes the setup for your Forbes scraping environment. Next, we’ll dive into writing the Puppeteer scraper.
Scraping Forbes with Puppeteer
Now that we have our environment set up, we’ll start scraping Forbes with Puppeteer. In this section, we’ll inspect the HTML, write the scraper, handle dynamic content, and store the scraped data in a JSON file. For this example, we’ll be scraping the Forbes Worlds Billionaires List 2024.
Inspecting the HTML Structure
Before we write the scraper, let’s inspect the Forbes website’s HTML. This will help us identify the key elements that contain the data.
Inspecting the Billionaires List Page
- Visit the Page: Go to the Forbes World's Billionaires List.
-
Open Developer Tools: Right-click anywhere on the page and select "Inspect" or press
Ctrl+Shift+I
to open Developer Tools.
- Look for Key Elements:
-
Billionaire Names/Links: Typically contained in
<a>
tags with classes likecolor-link
. This is where you get the link to each billionaire's profile.
Scraping Each Billionaire’s Profile
- Navigate to a Profile: Click on a link from the list to open the billionaire’s profile page.
-
Open Developer Tools: Right-click anywhere on the page and select "Inspect" or press
Ctrl+Shift+I
to open Developer Tools.
- Key Elements to Look For:
-
Rank: Look for the rank, typically inside a
<div>
or<span>
with a class likelistuser-item__list--rank
. -
Name: Usually inside a header tag, like
<h1>
with a class likelistuser-header__name
. -
Organization: Found in either an
<a>
or<span>
element with organization-related classes. -
Net Worth: Typically inside a
<div>
with classes likeprofile-info__item-value
. -
Biography: Often found inside an unordered list (
<ul>
) element. -
Additional Data: Titles and texts could be found in elements with classes like
profile-stats__title
andprofile-stats__text
.
Writing the Puppeteer Scraper
Now, we can write the Puppeteer scraper. The following code demonstrates how to launch Puppeteer, open the Forbes page, and scrape key data points.
Example Code:
const puppeteer = require('puppeteer');
const fs = require('fs');
async function scrapeBillionaires() {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
// Go to Forbes Billionaires list
await page.goto(
'https://www.forbes.com/sites/chasewithorn/2024/04/02/forbes-worlds-billionaires-list-2024-the-top-200/?sh=67b3016430a7',
{
timeout: 0,
},
);
const links = await page.$$eval('a.color-link', (links) => links.slice(2).map((link) => link.href));
const billionaireList = [];
for (let link of links) {
try {
await page.goto(link, { timeout: 0 });
// Get rank
const rank = await page.$eval('.listuser-item__list--rank', (el) => el.innerText.trim()).catch(() => 'N/A');
// Get name
const name = await page.$eval('h1.listuser-header__name', (el) => el.innerText.trim()).catch(() => 'N/A');
// Get title
const title = await page
.$eval('div.listuser-header__headline-default', (el) => el.innerText.trim())
.catch(() => 'N/A');
// Get organization
const organization = await page
.$eval('a.listuser-header__organization', (el) => el.innerText.trim())
.catch(() => 'N/A');
// Get net worth
const netWorth = await page.$eval('div.profile-info__item-value', (el) => el.innerText.trim()).catch(() => 'N/A');
// Get biography text
const bio = await page.$eval('ul', (el) => el.innerText.trim()).catch(() => 'N/A');
// Get additional stack data
const stackData = await page.evaluate(() => {
let data = {};
const titles = Array.from(document.querySelectorAll('.profile-stats__title'));
const texts = Array.from(document.querySelectorAll('.profile-stats__text'));
titles.forEach((title, i) => (data[title.innerText.trim()] = texts[i].innerText.trim()));
return data;
});
// Push data to billionaireList
billionaireList.push({
Rank: rank,
Name: name,
Title: title,
Organization: organization,
NetWorth: netWorth,
Stack: stackData,
Bio: bio,
});
} catch (err) {
console.log(`Error scraping ${link}: ${err}`);
}
}
await browser.close();
return billionaireList;
}
scrapeBillionaires().then((data) => {
console.log(data); // Output data to console
});
Storing Data in a JSON File
Once the data is scraped, we need to save it in a structured format like JSON for later use.
Example Code:
async function saveDataToFile(data, filename = 'forbes_billionaires.json') {
fs.writeFileSync(filename, JSON.stringify(data, null, 2), 'utf-8');
console.log(`Data saved to ${filename}`);
}
scrapeBillionaires().then((data) => {
saveDataToFile(data);
});
This will store all the scraped articles in a forbes_billionaires.json
file, making the data easy to access and use in the future.
Complete Code Example
Here’s the complete code that combines all the steps:
const puppeteer = require('puppeteer');
const fs = require('fs');
async function scrapeBillionaires() {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
// Go to Forbes Billionaires list
await page.goto(
'https://www.forbes.com/sites/chasewithorn/2024/04/02/forbes-worlds-billionaires-list-2024-the-top-200/?sh=67b3016430a7',
{
timeout: 0,
},
);
const links = await page.$$eval('a.color-link', (links) => links.slice(2).map((link) => link.href));
const billionaireList = [];
for (let link of links) {
try {
await page.goto(link, { timeout: 0 });
// Get rank
const rank = await page.$eval('.listuser-item__list--rank', (el) => el.innerText.trim()).catch(() => 'N/A');
// Get name
const name = await page.$eval('h1.listuser-header__name', (el) => el.innerText.trim()).catch(() => 'N/A');
// Get title
const title = await page
.$eval('div.listuser-header__headline-default', (el) => el.innerText.trim())
.catch(() => 'N/A');
// Get organization
const organization = await page
.$eval('a.listuser-header__organization', (el) => el.innerText.trim())
.catch(() => 'N/A');
// Get net worth
const netWorth = await page.$eval('div.profile-info__item-value', (el) => el.innerText.trim()).catch(() => 'N/A');
// Get biography text
const bio = await page.$eval('ul', (el) => el.innerText.trim()).catch(() => 'N/A');
// Get additional stack data
const stackData = await page.evaluate(() => {
let data = {};
const titles = Array.from(document.querySelectorAll('.profile-stats__title'));
const texts = Array.from(document.querySelectorAll('.profile-stats__text'));
titles.forEach((title, i) => (data[title.innerText.trim()] = texts[i].innerText.trim()));
return data;
});
// Push data to billionaireList
billionaireList.push({
Rank: rank,
Name: name,
Title: title,
Organization: organization,
NetWorth: netWorth,
Stack: stackData,
Bio: bio,
});
} catch (err) {
console.log(`Error scraping ${link}: ${err}`);
}
}
await browser.close();
return billionaireList;
}
async function saveDataToFile(data, filename = 'forbes_billionaires.json') {
fs.writeFileSync(filename, JSON.stringify(data, null, 2), 'utf-8');
console.log(`Data saved to ${filename}`);
}
scrapeBillionaires().then((data) => {
saveDataToFile(data);
});
Example Output:
[
{
"Rank":"#1",
"Name":"Bernard Arnault & family",
"Title":"Chairman And CEO, LVMH Moët Hennessy Louis Vuitton",
"Organization":"LVMH Moët Hennessy Louis Vuitton",
"Networth":"$219.2B",
"Stack":{
"Age":"75",
"Source of Wealth":"LVMH",
"Residence":"Paris, France",
"Citizenship":"France",
"Marital Status":"Married",
"Children":"5",
"Education":"Bachelor of Arts/Science, Ecole Polytechnique de Paris"
},
"Bio":"Bernard Arnault oversees the LVMH empire of 75 fashion and cosmetics brands, including Louis Vuitton and Sephora.\nLVMH acquired American jeweler Tiffany & Co in 2021 for $15.8 billion, believed to be the biggest luxury brand acquisition ever.\nArnault's holding company Agache backs venture capital firm Aglaé Ventures, which has investments in businesses such as Netflix and TikTok parent company ByteDance.\nHis father made a small fortune in construction; Arnault got his start by putting up $15 million from that business to buy Christian Dior in 1984.\nArnault's five children all work at LVMH; in July 2022, he proposed a reorganization of his holding company Agache to give them equal stakes."
},
{
"Rank":"#2",
"Name":"Elon Musk",
"Title":"CEO, Tesla",
"Organization":"Tesla",
"Networth":"$189.2B",
"Stack":{
"Age":"52",
"Source of Wealth":"Tesla, SpaceX, Self Made",
"Self-Made Score":"8",
"Philanthropy Score":"1",
"Residence":"Austin, Texas",
"Citizenship":"United States",
"Marital Status":"Single",
"Children":"11",
"Education":"Bachelor of Arts/Science, University of Pennsylvania"
},
"Bio":"Elon Musk cofounded six companies, including electric car maker Tesla, rocket producer SpaceX and tunneling startup Boring Company.\nHe owns about 12% of Tesla excluding options, but has pledged more than half his shares as collateral for personal loans of up to $3.5 billion.\nIn early 2024, a Delaware judge voided Musk's 2018 deal to receive options equaling an additional 9% of Tesla. Forbes has discounted the options by 50% pending Musk's appeal.\nSpaceX, founded in 2002, is worth nearly $180 billion after a December 2023 tender offer of up to $750 million; SpaceX stock has quintupled its value in four years.\nMusk bought Twitter in 2022 for $44 billion, after later trying to back out of the deal. He owns an estimated 74% of the company, now called X.\nForbes estimates that Musk's stake in X is now worth nearly 70% less than he paid for it based on investor Fidelity's valuation of the company as of December 2023."
},
{
"Rank":"#3",
"Name":"Jeff Bezos",
"Title":"Chairman And Founder, Amazon",
"Organization":"Amazon",
"Networth":"$202.4B",
"Stack":{
"Age":"60",
"Source of Wealth":"Amazon, Self Made",
"Self-Made Score":"8",
"Philanthropy Score":"2",
"Residence":"Miami, Florida",
"Citizenship":"United States",
"Marital Status":"Engaged",
"Children":"4",
"Education":"Bachelor of Arts/Science, Princeton University"
},
"Bio":"Jeff Bezos founded e-commerce giant Amazon in 1994 out of his Seattle garage.\nBezos stepped down as CEO to become executive chairman in 2021. He owns a bit less than 10% of the company.\nHe and his wife MacKenzie divorced in 2019 after 25 years of marriage and he transferred a quarter of his then-16% Amazon stake to her.\nBezos donated more than $1.1 million worth of stock to nonprofits in 2023, though it's unclear which organizations received those shares\nHe owns The Washington Post and Blue Origin, an aerospace company developing rockets; he briefly flew to space in one in July 2021.\nBezos said in a November 2022 interview with CNN that he plans to give away the majority of his wealth in his lifetime, without disclosing specific details."
},
{
"Rank":"#4",
"Name":"Mark Zuckerberg",
"Title":"Cofounder, Meta Platforms",
"Organization":"Meta Platforms",
"Networth":"$184.3B",
"Stack":{
"Age":"39",
"Source of Wealth":"Facebook, Self Made",
"Self-Made Score":"8",
"Philanthropy Score":"2",
"Residence":"Palo Alto, California",
"Citizenship":"United States",
"Marital Status":"Married",
"Children":"3",
"Education":"Drop Out, Harvard University"
},
"Bio":"A 19-year-old Mark Zuckerberg started Facebook in 2004 for students to match names with photos of classmates.\nZuckerberg took Facebook public in 2012; he now owns about 13% of the company's stock.\nFacebook changed its name to Meta in 2021 to shift the company's focus to the metaverse.\nIn 2015, Zuckerberg and his wife, Priscilla Chan, pledged to give away 99% of their Meta stake over their lifetimes."
},
.... more
]
In the next section, we’ll discuss how to optimize Forbes scraping using Crawlbase Crawling API.
Optimize Forbes Scraping with Crawlbase Crawling API
Puppeteer is great for scraping dynamic websites but slow when dealing with big data or JavaScript heavy pages like Forbes. To optimize scraping and performance, we can use the Crawlbase Crawling API, which simplifies handling JavaScript-rendered content and gives more control and efficiency.
Introduction to Crawlbase Crawling API
Crawlbase Crawling API bypasses common web scraping challenges like CAPTCHAs, dynamic content loading and complex HTML structures. For scraping Forbes Crawlbase offers a streamlined solution by handling JavaScript rendering directly, making it a more efficient alternative to Puppeteer for big scraping projects.
Why use Crawlbase for Forbes scraping?
- Handles dynamic content: Optimized for JavaScript heavy pages like Forbes.
- Improved speed and scalability: No need for headless browsers, faster scraping.
- Simplifies the process: Easy API calls to scrape data, built in CAPTCHAs and anti-scraping mechanisms.
How to Use Crawlbase with Forbes
To scrape Forbes using Crawlbase, you need to sign up and get your API token. Here’s how to get started:
- Sign up for Crawlbase: Create an account on Crawlbase and get your API token. You need JS Token for Forbes.
- Install Crawlbase Library: In your Node.js environment, install the Crawlbase Crawling API library using:
npm install crawlbase
- Set up your request: Initialize the Crawlbase API with your token and make GET requests to scrape Forbes data.
Code Example with Crawlbase
Here’s a code example using the Crawlbase JavaScript library to scrape Forbes data more efficiently:
Example Code:
const { CrawlingAPI } = require('crawlbase');
const cheerio = require('cheerio');
const fs = require('fs');
// Initialize Crawlbase API with your access token
const api = new CrawlingAPI({ token: 'CRAWLBASE_JS_TOKEN' });
async function fetchForbesHTML(url) {
try {
const response = await api.get(url, {
ajax_wait: 'true', // Wait for AJAX requests to complete
page_wait: '5000',
});
if (response.statusCode === 200) {
return response.body;
} else {
console.log(`Failed to fetch data. Status code: ${response.statusCode}`);
return null;
}
} catch (error) {
console.error(`Error fetching data: ${error}`);
return null;
}
}
async function parseForbesData(html) {
const $ = cheerio.load(html);
let billionaireList = [];
$('.color-link')
.slice(2)
.each(async function () {
const link = $(this).attr('href');
try {
const detailPageHtml = await fetchForbesHTML(link);
const $page = cheerio.load(detailPageHtml);
let rank = $page('.listuser-item__list--rank').text().trim() || 'N/A';
let name = $page('h1.listuser-header__name').text().trim() || 'N/A';
let title = $page('div.listuser-header__headline-default').text().trim() || 'N/A';
let organization = $page('a.listuser-header__organization').text().trim() || 'N/A';
let netWorth = $page('div.profile-info__item-value').text().trim() || 'N/A';
let bio = $page('ul').text().trim() || 'N/A';
let stackData = {};
$page('.profile-stats__title').each((i, el) => {
let title = $(el).text().trim();
let text = $page('.profile-stats__text').eq(i).text().trim();
stackData[title] = text;
});
billionaireList.push({
Rank: rank,
Name: name,
Title: title,
Organization: organization,
NetWorth: netWorth,
Stack: stackData,
Bio: bio,
});
} catch (err) {
console.error(`Error parsing data for ${link}: ${err}`);
}
});
return billionaireList;
}
async function saveToFile(data, filename = 'forbes_billionaires.json') {
fs.writeFileSync(filename, JSON.stringify(data, null, 2), 'utf-8');
console.log(`Data saved to ${filename}`);
}
(async function () {
const url =
'https://www.forbes.com/sites/chasewithorn/2024/04/02/forbes-worlds-billionaires-list-2024-the-top-200/?sh=67b3016430a7';
const html = await fetchForbesHTML(url);
if (html) {
const data = await parseForbesData(html);
await saveToFile(data);
}
})();
Explanation of the Code:
-
Initialize Crawlbase:
CrawlingAPI
is initialized with your Crawlbase token to access the API for scraping. -
Get request: We use
api.get()
to scrape the Forbes URL. We useajax_wait
andpage_wait
to make sure all dynamic content loads. -
HTML Parsing: We use
cheerio
to parse the HTML and extract key data points. - Data Storage: The extracted data is saved to a JSON file.
This way scraping Forbes is more efficient, Crawlbase handles JavaScript rendering and complex content structures.
Optimize Forbes Scraping with Crawlbase
Whether you’re analyzing business trends, financial news or top company rankings, scraping data from Forbes can be very useful. While tools like Puppeteer are great for handling JavaScript rendered pages they are time consuming and resource heavy. Using Crawlbase Crawling API simplifies the process and makes scraping dynamic content faster.
Follow this guide to scrape Forbes data and scale your projects with Crawlbase. This method is a reliable and optimized way to scrape websites like Forbes. If you're looking to expand your web scraping capabilities, consider exploring our following guides on scraping other important websites.
📜 How to Scrape Monster.com
📜 How to Scrape Groupon
📜 How to Scrape TechCrunch
📜 How to Scrape X.com Tweet Pages
📜 How to Scrape Clutch.co
If you have any questions or feedback, our support team is always available to assist you on your web scraping journey. Happy scraping!
Frequently Asked Questions
Q. Is scraping Forbes legal?
Scraping any website, including Forbes, should be done in compliance with their terms of service. Always check the website's robots.txt
file and ensure you are not violating any terms regarding data extraction. Using APIs like Crawlbase helps you scrape efficiently while adhering to best practices.
Q. Why should I use Crawlbase Crawling API instead of Puppeteer for scraping Forbes?
While Puppeteer is a powerful tool for handling JavaScript rendering, it can be slow and resource-intensive. Crawlbase Crawling API simplifies the process by offering pre-configured options for handling dynamic content, which speeds up scraping and reduces the effort needed to manage browser sessions manually.
Q. How can I handle dynamic content on Forbes when scraping?
Forbes uses JavaScript to load much of its content dynamically. Using Puppeteer or Crawlbase Crawling API with options like ajax_wait
and page_wait
, you can ensure the content is fully loaded before scraping. This ensures you capture all relevant data from the page.
Top comments (0)