What is AI-powered web scraping?
The internet is currently saturated with terms like AI-driven and AI-powered web scrapers. But is AI web scraping really a thing?
While there are certainly some excellent web scraping tools for AI out there, most of those advertised as AI-powered web scrapers are well, just web scrapers. AI has been stuffed in there because thats what everyones obsessed with at the moment.
So, I did my due diligence and tried out a few of these so-called AI scraping tools, identified two that are arguably worthy of the name, and explored the possibility of using GPT models to do web data extraction, as well.
Here are my findings.
AI-powered web scrapers
BrowseAI
First up is BrowseAI. Its basically an Apify-like SaaS platform. The Chrome extension/web app allows you to record user actions in the browser, and you can upload the recording on the platform and run it there.
You provide the URL of a page you want to scrape, like so:
Naturally, I chose the Apify blog because the content is awesome!
After a while, the web page opens in your browser:
You can then extract data with point-and-click tools that automatically recognize repeating components:
Now you can pick parts of those repeating components (things like title and author in this case), select them by clicking, and name the columns in the resulting table:
Pros of BrowseAl
The recorder has intuitive controls and a smart UI for selecting data to scrape.
Its a no-code solution, so its easy for those who are not developers to use it.
Cons of BrowseAI
The performance dips when recording.
Because its a no-code solution, theres little space for customization.
Is BrowseAI an AI web scraping tool?
BrowseAI is basically Apify plus a recorder. It provides Prebuilt Robots (which are essentially what Apify Actors are) and a platform to run the bots on (just like the Apify platform).
While BrowseAI is a pretty neat no-code web scraping tool, I wouldnt go so far as to call it an AI-powered web scraper. And if youre a dev who wants more customization, anti-blocking features, proxies, datasets, and other crucial things for serious data extraction projects, web scraping with Apify is an alternative solution you should consider.
https://apify.com/store/categories/ai
Next up is Kadoa.com - an online service that uses generative AI models for automated data extraction.
With Kadoa Playground, you input a URL, and the service will analyze the page using AI models to extract data automatically.
You can then select which data you want to scrape, making the process quick and efficient.
This can be especially useful for those who need to collect large amounts of data from websites for research or business purposes.
Again, I went with the Apify blog. Did I mention how awesome it is?
After analyzing the page, the service asks what data you want to extract. In this case, it found out that blog.apify.com contains links to blog posts and articles, so it offered to scrape these:
After picking Blog posts, Kadoa gave me the option to customize the scrape even more:
What's cool is that it didn't ask for CSS / XPath selectors but allowed me to provide the commands using regular natural language (English, in this case).
As I wanted to scrape titles of the blog posts, their respective authors, and the publication dates, I just added 3 fields named title, author, and pub_date.
Theres no required syntax, as Kadoa makes extensive use of generative AI models to deal with that.
After a while, the service gave me the result as a neatly-formatted JSON array:
Pros of Kadoa.com
- Fast and easy to use.
Cons of Kadoa.com
- The whole project is still in the early phase, so it has some limitations:
The playground doesn't work for generic homepages, sites behind a login, sites with scraping preventions, or sites that require click automation.
If youre a developer who needs to scrape those things (and frankly, for any large-scale scraping task, you really do need to), then Website Content Crawler is an alternative you should consider.
Is Kadoa an AI web scraping tool?
I think Kadoa is worthy of the AI in AI-powered web scraping. The AI is what makes Kadoa very easy for non-developers to use.
https://blog.apify.com/webscraping-ai-data-for-llms/
Using GPT models for data extraction
So, those are two ready-made AI web scraping products you could try, but another possibility is to use AI (LLMs in this case) directly.
For example, you can build a scraper with Crawlee that extracts text from a page, feeds it to an LLM, and says, Make a JSON out of this.
๐ป Here follows body.innerText for a blog listing page. Extract article names, descriptions, author names, and creation dates, and format those as a JSON array...
You can see an example in the OpenAI playground:
And here's the response:
{
"articleName": "How web scraping and AI are helping to find missing children",
"description": "The Missing Children initiative began with a Facebook page. Web scraping Facebook for data labeling has taken it to a whole new level and the initiative is now reuniting families all over Egypt.",
"authorName": "Theo Vasilis",
"creationDate": "Aug 7, 2023"
},
{
"articleName": "Google Maps scraping manual: how to extract reviews, images, restaurants, and more ๐ ๐",
"description": "Welcome to your comprehensive guide to extracting valuable data from Google Maps. In this manual, we will walk you through various techniques and tools to help you scrape images, extract restaurant data, gather contact details, scrape reviews, and much more.",
"authorName": "Natasha Lekh",
"creationDate": "Jul 26, 2023"
},
{
"articleName": "What is generative AI?",
"description": "What you need to know about generative AI and what it means for developers.",
"authorName": "Theo Vasilis",
"creationDate": "Jun 23, 2023"
},
{
"articleName": "Apify tutorial contest ๐",
"description": "Create guides on how to use the Apify platform or Crawlee to build web scrapers, and win up to $1,000 of free platform credits for your own projects!",
"authorName": "Theo Vasilis",
"creationDate": "May 19, 2023"
},
{
"articleName": "Web scraping for AI: how to collect data for LLMs",
"description": "A tutorial that shows you how to crawl, extract, and process web data to feed, fine-tune, or train large language models.",
"authorName": "Theo Vasilis",
"creationDate": "Aug 10, 2023"
},
{
"articleName": "Puppeteer tutorial: submitting forms, clicking buttons, and handling inputs",
"description": "Find out how to use Puppeteer to handle forms, buttons, and inputs. Learn about type method, click method, and how to deal with text fields, dropdowns, and checkboxes.",
"authorName": "Ayodele Aransiola",
"creationDate": "Aug 9, 2023"
},
{
"articleName": "How web scraping and AI are helping to find missing children",
"description": "The Missing Children initiative began with a Facebook page. Web scraping Facebook for data labeling has taken it to a whole new level and the initiative is now reuniting families all over Egypt.",
"authorName": "Theo Vasilis",
"creationDate": "Aug 7, 2023"
},
{
"articleName": "Selenium Grid: what it is and how to set it up",
"description": "Learn about the Selenium Grid architecture and explore its use in scenarios like large test suites, cross-browser testing, and continuous integration. This comprehensive guide also includes steps for setting up Selenium Grid and practical tips for efficient parallel test execution.",
"authorName": "Percival Villalva",
"creationDate": "Aug 3, 2023"
},
{
"articleName": "Is web scraping legal?",
"description": "Web scraping is legal if you scrape data publicly available on the internet. But you should be careful when scraping personal data or intellectual property. We cover the confusion surrounding the legality of web scraping and give you tips for compliant and ethical scrapers.",
"authorName": "Ondra Urban",
"creationDate": "Aug 3, 2023"
},
{
"articleName": "Traditional NLP techniques and the rise of LLMs",
"description": "The field of NLP has changed with the rise of LLMs, but NLP still has a role to play. Apply NLP techniques to scraped data and learn about tokenization, stemming, lemmatization, removing stop words, and more NLP techniques.",
"authorName": "Usama Jamil",
"creationDate": "Aug 2, 2023"
},
{
"articleName": "10 reasons tourists hate European landmarks (according to data from Google Maps)",
"description": "A small data project to visualize and analyze bad Google Maps reviews of popular European landmarks.",
"authorName": "Natasha Lekh",
"creationDate": "Aug 1, 2023"
},
{
"articleName": "Python and machine learning",
"description": "Learn how Python and machine learning intersect to solve complex problems that defeat traditional programming methods. Find out about Pandas, TensorFlow, Scikit-learn, and how they can transform data.",
"authorName": "Percival Villalva",
"creationDate": "Jul 31, 2023"
},
{
"articleName": "Top 5 books on AI",
"description": "Explore the world of AI through a comprehensive selection of books recommended by business leaders. These reads provide an in-depth understanding of AI's history, machine learning, generative AI, diversity in AI, and AI for cybersecurity.",
"authorName": "Guest Author",
"creationDate": "Jul 29, 2023"
This approach is more resilient to page changes than regular scraping approaches as it doesnt use CSS selectors. These can stop working after a redesign or when developers change the page layout, for example.
The data consistency heavily depends on the prompt you provide. You should be as specific as possible and always describe the schema (attribute names, such as title, author_name, publication_date).
Also, you have to keep in mind that the GPT model only remembers the current context and will not keep references to articles, authors, or concepts it has seen on the previous pages.
As a result, while it will transform pages to JSON arrays just fine, it will struggle to transform the data to RDF (Resource Description Framework).
Example of an RDF graph (taken from Stardog). Utilizing GPT for creating these graphs from web content can be difficult because of the limited context size (e.g., Im scraping the Love Me Do page, but I dont know how to refer to the The Beatles node because its definition was on a different page).
RDF models are cool because they can answer questions like, Which friends of John Lennon live in Liverpool and have worked with him on at least two albums? which is something regular databases struggle with because you have to label all the entities and their relations manually.
Connecting a large language model with a web crawler may seem like the go-to solution for parsing webpages and creating RDF graphs out of them, but its tricky because of the limited context memory of todays LLMs.
Can AI do web scraping?
So, can you use AI to do web scraping, and more to the point should you?
As with other uses of GPT models, AI tools are most helpful to those who know their field well enough to moderate and correct them.
If you don't know how to code, you shouldnt trust an AI to do it for you.
If youre a developer, then you may find GPT models helpful for certain aspects of web scraping, especially if youre particularly good at prompt engineering, but I dont think theyre ready to steal your job just yet.
Top comments (0)