DEV Community

Cover image for AI+Node.js x-crawl crawler: Why are traditional crawlers no longer the first choice for data crawling?
CoderHXL
CoderHXL

Posted on

AI+Node.js x-crawl crawler: Why are traditional crawlers no longer the first choice for data crawling?

AI and Node.js crawler combination

When AI is paired with Node.js crawlers, this combination makes data collection smarter and more efficient. AI can help Node.js crawler to achieve more accurate target positioning. Traditional crawlers often rely on fixed rules or templates to capture data, but this way is often powerless in the face of complex and changeable web structure.

Why do we need AI-assisted crawlers

With the rapid development of network technology, website updates become more and more frequent, and the change of class name or structure often brings no small challenge to the crawlers that rely on these elements. In this context, crawlers combined with AI technology have become a powerful weapon to deal with this challenge.

First, a change in class name or structure after a website is updated can render traditional crawling strategies ineffective. This is because crawlers often rely on fixed class names or structures to locate and extract the information they need. Once these elements change, crawlers may not be able to accurately find the required data, which affects the effectiveness and accuracy of data fetching.

However, crawlers that incorporate AI technology are better able to cope with this change. AI can also understand and parse the semantic information of web pages through technologies such as natural language processing, so as to extract the required data more accurately.

In summary, crawlers combined with AI technology can better cope with the problem of class names or structural changes after website updates.

What is x-crawl?

x-crawl is a flexible Node.js AI-assisted crawler library. Flexible use and powerful AI-assisted functions make the crawler work more efficient, intelligent and convenient.

It consists of two parts:

  • Crawler: Composed of crawler API and various functions, it can work properly even without relying on AI.
  • AI: Currently based on the large AI model provided by OpenAI, AI simplifies many tedious operations.

x-crawl GitHub: https://github.com/coder-hxl/x-crawl

x-crawl Documentation: https://coder-hxl.github.io/x-crawl/cn/

Features

πŸ€– AI Assistance - Powerful AI assistance function makes crawler work more efficient, intelligent and convenient.
πŸ–‹οΈ Flexible writing - A single crawling API is suitable for multiple configurations, and each configuration method has its own advantages.
βš™οΈMultiple uses - Supports crawling dynamic pages, static pages, interface data and file data.
βš’οΈ Control page - Crawling dynamic pages supports automated operations, keyboard input, event operations, etc.
πŸ‘€ Device Fingerprinting - Zero configuration or custom configuration to avoid fingerprint recognition to identify and track us from different locations.
πŸ”₯ Asynchronous Sync - Asynchronous or synchronous crawling mode without switching crawling API.
⏱️ Interval crawling - no interval, fixed interval and random interval, determine whether to crawl with high concurrency.
πŸ”„ Failed Retry - Customize the number of retries to avoid crawling failures due to temporary problems.
➑️ Rotation proxy - Automatic proxy rotation with failed retries, custom error times and HTTP status codes.
πŸš€ Priority Queue - Based on the priority of a single crawl target, it can be crawled ahead of other targets.
🧾 Crawl information - Controllable crawl information, which will output colored string information in the terminal.
🦾 TypeScript - Own types and implement complete types through generics.

Example of AI and x-crawl crawler combination

The combination of crawler and AI allows the crawler and AI to obtain pictures of high-rated vacation rentals according to our instructions:

import { createCrawl, createCrawlOpenAI } from 'x-crawl'

//Create a crawler application
const crawlApp = createCrawl({
  maxRetry: 3,
  intervalTime: { max: 2000, min: 1000 }
})

//Create AI application
const crawlOpenAIApp = createCrawlOpenAI({
  clientOptions: { apiKey: process.env['OPENAI_API_KEY'] },
  defaultModel: { chatModel: 'gpt-4-turbo-preview' }
})

// crawlPage is used to crawl pages
crawlApp.crawlPage('https://www.airbnb.cn/s/select_homes').then(async (res) => {
  const { page, browser } = res.data

  // Wait for the element to appear on the page and get the HTML
  const targetSelector = '[data-tracking-id="TOP_REVIEWED_LISTINGS"]'
  await page.waitForSelector(targetSelector)
  const highlyHTML = await page.$eval(targetSelector, (el) => el.innerHTML)

  // Let the AI get the image link and de-duplicate it (the more detailed the description, the better)
  const srcResult = await crawlOpenAIApp.parseElements(
    highlyHTML,
    `Get the image link, don't source it inside, and de-duplicate it`
  )

  browser.close()

  // crawlFile is used to crawl file resources
  crawlApp.crawlFile({
    targets: srcResult.elements.map((item) => item.src),
    storeDirs: './upload'
  })
})
Enter fullscreen mode Exit fullscreen mode

Even if the subsequent update of the website causes the class name or structure to change, it can climb to the data normally, because we no longer rely on the fixed class name or structure to locate and extract the required information, but let the AI understand and parse the semantic information of the web page, so as to extract the required data more efficiently, intelligently and conveniently.

You can even pass the entire HTML to AI to help us operate, because the website content is more complex you also need to more accurately describe the location to take, and will consume a lot of Tokens..

Procedure:

If you want to see the HTML the AI needs to process or see the srcResult (img url) returned by the AI after parsing the HTML according to our instructions:
here at the bottom, because there are too many HTML fragments inconvenient to view at the bottom of this example, you can go to see.

AI Intelligent on-demand analysis elements

No need to manually analyze the HTML page structure to extract the required element attributes or values. Now you just need to input the HTML code into AI and tell AI which elements you want to get information about, and AI will automatically analyze the page structure and extract the corresponding element attributes or values.

import { createXCrawlOpenAI } from 'x-crawl'

const xCrawlOpenAIApp = createXCrawlOpenAI({
  clientOptions: { apiKey: 'Your API Key' }
})

const HTMLContent = `
   <div class="scroll-list">
     <div class="list-item">Women's hooded sweatshirt</div>
     <div class="list-item">Men's sweatshirts</div>
     <div class="list-item">Women's sweatshirt</div>
     <div class="list-item">Men's hooded sweatshirt</div>
   </div>
   <div class="scroll-list">
     <div class="list-item">Men's pure cotton short sleeves</div>
     <div class="list-item">Men's pure cotton short sleeves</div>
     <div class="list-item">Women's pure cotton short sleeves</div>
     <div class="list-item">Men's ice silk short sleeves</div>
     <div class="list-item">Men's round neck short sleeves</div>
   </div>
`

xCrawlOpenAIApp
  .parseElements(HTMLContent, `Take all men's clothing and remove duplicates`)
  .then((res) => {
    console.log(res)
    /*
      res:
      {
        elements: [
          { content: "Men's hooded sweatshirt" },
          { content: "Men's sweatshirts" },
          { content: "Men's pure cotton short sleeves" },
          { content: "Men's ice silk short sleeves" },
          { content: "Men's round neck short sleeves" }
        ],
        type: 'multiple'
      }
    */
  })
Enter fullscreen mode Exit fullscreen mode

You can even pass the entire HTML to AI to help us operate, because the website content is more complex you also need to more accurately describe the location to take, and will consume a lot of Tokens..

Intelligent generation of element selectors

It can help us quickly locate specific elements on the page. Just enter the HTML code into AI and tell AI which elements you want to get selectors for, and AI will automatically generate appropriate selectors for you based on the page structure, greatly simplifying the tedious process of determining selectors.

import { createXCrawlOpenAI } from 'x-crawl'

const xCrawlOpenAIApp = createXCrawlOpenAI({
  clientOptions: { apiKey: 'Your API Key' }
})

const HTMLContent = `
   <div class="scroll-list">
     <div class="list-item">Women's hooded sweatshirt</div>
     <div class="list-item">Men's sweatshirts</div>
     <div class="list-item">Women's sweatshirt</div>
     <div class="list-item">Men's hooded sweatshirt</div>
   </div>
   <div class="scroll-list">
     <div class="list-item">Men's pure cotton short sleeves</div>
     <div class="list-item">Men's pure cotton short sleeves</div>
     <div class="list-item">Women's pure cotton short sleeves</div>
     <div class="list-item">Men's ice silk short sleeves</div>
     <div class="list-item">Men's round neck short sleeves</div>
   </div>
`

xCrawlOpenAIApp
  .getElementSelectors(HTMLContent, `all Women's wear`)
  .then((res) => {
    console.log(res)
    /*
      res:
      {
        selectors: '.scroll-list:nth-child(2) .list-item:nth-child(3)',
        type: 'multiple'
      }
    */
  })
Enter fullscreen mode Exit fullscreen mode

You can even pass the entire HTML to AI to help us operate, because the website content is more complex you also need to more accurately describe the location to take, and will consume a lot of Tokens..

Intelligent reply to crawler questions

Can provide you with intelligent answers and suggestions. Whether it is about crawling strategies, anti-crawling techniques or data processing, you can ask AI questions, and AI will provide you with professional answers and suggestions based on its powerful learning and reasoning capabilities to help you complete your tasks better. Reptile task.

import { createXCrawlOpenAI } from 'x-crawl'

const xCrawlOpenAIApp = createXCrawlOpenAI({
  clientOptions: { apiKey: 'Your API Key' }
})

xCrawlOpenAIApp.help('What is x-crawl').then((res) => {
  console.log(res)
  /*
    res:
    x-crawl is a flexible Node.js AI-assisted web crawling library. It offers powerful AI-assisted features that make web crawling more efficient, intelligent, and convenient. You can find more information and the source code on x-crawl's GitHub page: https://github.com/coder-hxl/x-crawl.
   */
})

xCrawlOpenAIApp
  .help('Three major things to note about crawlers')
  .then((res) => {
    console.log(res)
    /*
      res:
      There are several important aspects to consider when working with crawlers:

      1. **Robots.txt:** It's important to respect the rules set in a website's robots.txt file. This file specifies which parts of a website can be crawled by search engines and other bots. Not following these rules can lead to your crawler being blocked or even legal issues.

      2. **Crawl Delay:** It's a good practice to implement a crawl delay between your requests to a website. This helps to reduce the load on the server and also shows respect for the server resources.

      3. **User-Agent:** Always set a descriptive User-Agent header for your crawler. This helps websites identify your crawler and allows them to contact you if there are any issues. Using a generic or misleading User-Agent can also lead to your crawler being blocked.

      By keeping these points in mind, you can ensure that your crawler operates efficiently and ethically.
   */
  })
Enter fullscreen mode Exit fullscreen mode

Summary

In the latest version of x-crawl, we have introduced powerful AI-assisted features to make crawler work more efficient, intelligent and convenient. This innovative feature is mainly reflected in the following aspects:

1. Intelligent on-demand analysis elements

Traditional crawler work often requires manual analysis of the HTML page structure to extract the desired element attributes or values. Now, with the AI assistance of x-crawl, you can easily implement intelligent on-demand analysis of elements. Just tell the AI which elements you want to get information about, and the AI will automatically analyze the page structure and extract the corresponding element attributes or values.

2. Intelligent generation element selector

Selectors are an integral part of the crawler's work, helping us quickly locate specific elements on the page. Now, x-crawl's AI assistant can intelligently generate element selectors for you. Simply input the HTML code into the AI, and the AI will automatically generate the appropriate selector for you based on the page structure, greatly simplifying the tedious process of determining the selector.

3. Intelligent Reply to crawler problems

In the crawling work, we will inevitably encounter various problems and challenges. And x-crawl's AI assistance can provide you with intelligent answers and suggestions. Whether it is about crawling strategy, anti-crawling skills or data processing, you can ask AI questions, AI will provide you with professional answers and suggestions based on its strong learning and reasoning ability to help you better complete the crawling task.

4. User-defined AI function

In order to meet the individual needs of different users, x-crawl also provides the ability to customize the AI. This means that you can customize and optimize the AI according to your needs, making it better suited to your crawler work. Whether it's adjusting the AI's analysis strategy, optimizing the generation algorithm of the selector, or adding new functional modules, you can do it with simple operations to make the AI more in line with your usage habits and workflow.


x-crawl GitHub: https://github.com/coder-hxl/x-crawl

x-crawl Documentation: https://coder-hxl.github.io/x-crawl

If you find x-crawl helpful, or if you like x-crawl, you can star x-crawl repository on GitHub. Your support is our motivation for continuous improvement! Thank you for your support!

Top comments (2)

Collapse
 
coderhxl profile image
CoderHXL • Edited

Learn about the combination of AI and x-crawl

  • AI needs to process HTML
  • AI returns srcResult (img url) after parsing the HTML according to our instructions

Because too many HTML fragments are not convenient to view, put the bottom of the link below, you can go to see if you want to understand.

coder-hxl.github.io/x-crawl/guide/...

Collapse
 
coderhxl profile image
CoderHXL

With the continuous progress of AI technology and the continuous expansion of application scenarios, this combination will play a greater potential.