DEV Community

Jonathan Geiger
Jonathan Geiger

Posted on • Originally published at capturekit.dev

1

How to Extract All Links from a Website Using Puppeteer

Extracting all links from a website is a common task in web scraping and automation. Whether you're building a crawler, analyzing a website's structure, or gathering data, having access to all links can be invaluable. In this guide, we'll explore two approaches: using Puppeteer for manual extraction and using CaptureKit API for a simpler solution.

Method 1: Using Puppeteer

Puppeteer is a powerful Node.js library that provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Here's how you can use it to extract all URLs from a website:

const puppeteer = require('puppeteer');

async function extractLinks(url) {
    // Launch the browser
    const browser = await puppeteer.launch();
    const page = await browser.newPage();

    try {
        // Navigate to the URL
        await page.goto(url, { waitUntil: 'networkidle0' });

        // Extract all links
        const links = await page.evaluate(() => {
            const anchors = document.querySelectorAll('a');
            return Array.from(anchors).map((anchor) => anchor.href);
        });

        // Remove duplicates
        const uniqueLinks = [...new Set(links)];

        return uniqueLinks;
    } catch (error) {
        console.error('Error:', error);
        throw error;
    } finally {
        await browser.close();
    }
}

// Usage example
async function main() {
    const url = 'https://example.com';
    const links = await extractLinks(url);
    console.log('Found links:', links);
}

main();
Enter fullscreen mode Exit fullscreen mode

This code will:

  1. Launch a headless browser using Puppeteer
  2. Navigate to the specified URL
  3. Extract all <a> tags from the page
  4. Get their href attributes
  5. Remove any duplicate links
  6. Return the unique list of URLs

Handling Dynamic Content

If you're dealing with a website that loads content dynamically, you might need to wait for the content to load:

// Wait for specific elements to load
await page.waitForSelector('a');

// Or wait for network to be idle
await page.waitForNetworkIdle();
Enter fullscreen mode Exit fullscreen mode

Filtering Links

You can also filter links based on specific criteria:

const links = await page.evaluate(() => {
    const anchors = document.querySelectorAll('a');
    return Array.from(anchors)
        .map((anchor) => anchor.href)
        .filter((href) => {
            // Filter out external links
            return href.startsWith('https://example.com');
            // Or filter by specific patterns
            // return href.includes('/blog/');
        });
});
Enter fullscreen mode Exit fullscreen mode

Method 2: Using CaptureKit API (Recommended)

While Puppeteer is powerful, setting up and maintaining a web scraping solution can be time-consuming and complex. That's where CaptureKit API comes in. Our API provides a simple, reliable way to extract all links from any website, with additional features like link categorization and metadata extraction.

Here's how to use CaptureKit API:

curl "https://api.capturekit.dev/content?url=https://tailwindcss.com&access_key=YOUR_ACCESS_KEY"
Enter fullscreen mode Exit fullscreen mode

The API response includes categorized links and additional metadata:

{
    "success": true,
    "data": {
        "links": {
            "internal": ["https://tailwindcss.com/", "https://tailwindcss.com/docs"],
            "external": ["https://tailwindui.com", "https://shopify.com"],
            "social": [
                "https://github.com/tailwindlabs/tailwindcss",
                "https://x.com/tailwindcss"
            ]
        },
        "metadata": {
            "title": "Tailwind CSS - Rapidly build modern websites without ever leaving your HTML.",
            "description": "Tailwind CSS is a utility-first CSS framework.",
            "favicon": "https://tailwindcss.com/favicons/favicon-32x32.png",
            "ogImage": "https://tailwindcss.com/opengraph-image.jpg"
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

Benefits of Using CaptureKit API

  1. Categorized Links: Links are automatically categorized into internal, external, and social links
  2. Additional Metadata: Get website title, description, favicon, and OpenGraph image
  3. Reliability: No need to handle browser automation, network issues, or rate limiting
  4. Speed: Results are returned in seconds, not minutes
  5. Maintenance-Free: No need to update code when websites change their structure

Conclusion

While Puppeteer provides a powerful way to extract URLs programmatically, it requires significant setup and maintenance. For most use cases, using CaptureKit API is the recommended approach, offering a simpler, more reliable solution with additional features like link categorization and metadata extraction.

Choose the method that best fits your needs:

  • Use Puppeteer if you need full control over the scraping process or have specific requirements
  • Use CaptureKit API if you want a quick, reliable solution with additional features

AWS GenAI LIVE image

How is generative AI increasing efficiency?

Join AWS GenAI LIVE! to find out how gen AI is reshaping productivity, streamlining processes, and driving innovation.

Learn more

Top comments (0)

Cloudinary image

Zoom pan, gen fill, restore, overlay, upscale, crop, resize...

Chain advanced transformations through a set of image and video APIs while optimizing assets by 90%.

Explore

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay