DEV Community

Cover image for Real World Example of AI Powered Parsing
Emirhan Akdeniz for SerpApi

Posted on • Originally published at serpapi.com

Real World Example of AI Powered Parsing

At SerpApi, we're constantly exploring innovative advancements to enhance our services and provide our users with the most cutting-edge solutions. That's why we're excited to introduce a potential new AI feature called AI Powered Parsing, designed to revolutionize how we extract valuable insights from Real Time Search Engine Results using AI Tools.

Disclaimer

The inclusion of this AI feature in SerpApi's offerings is contingent upon official confirmation. The plans outlined here are based on the assumption of approval.

For starters, we plan to utilize AI Powered Parsing for Local Results in a Google Search served by SerpApi's Google Local Pack API, and SerpApi's Google Local API. AI Powered Parsers take advantage of an open-source model called bert-base-local-results, and an open-source ruby gem called google-local-results-ai-parser. We believe in transparency, which is why both the model and the gem composing the core of this AI technology are publicly available for anyone to examine.

To be at the forefront of leveraging gathering data from Search Engine Results Pages, You may register to claim free credits.

Open Sourced Materials in this Blog Post:

What is AI Powered Parsing?

AI-powered parsing for HTML refers to the use of artificial intelligence techniques to extract and understand the content and structure of HTML documents. HTML (Hypertext Markup Language) is the standard language for creating web pages, and parsing refers to the process of analyzing the HTML code to extract meaningful information from it.

Traditionally, HTML parsing has been performed using rule-based algorithms or template-based approaches, where specific patterns or rules are defined to identify and extract desired elements from the HTML. However, these methods can be limited in their ability to handle complex or inconsistent HTML structures, as they rely on predefined rules that may not adapt well to variations in coding styles or new web page designs, resulting in a disruption of the customer experience.

AI-powered parsing offers several advantages over traditional parsing approaches. It can handle a wider range of HTML structures, adapt to different coding styles and variations, and improve accuracy and robustness. Furthermore, AI models can learn from large datasets, which enables them to be continuously improved for their parsing capabilities as they are exposed to more diverse HTML documents.

How is it useful to the user?

One of the key advantages of AI-Powered Parsing is the ability to trade a little extra processing time for greater precision in extracting data from search results. This means you can obtain more accurate and reliable information, resulting in improved decision-making and insights.

We understand that search engines frequently evolve, and keeping up with these changes can be challenging. That's why we continuously update our standard parsers to ensure they remain compatible. However, in the rare event that a deprecation goes unnoticed, our cutting-edge AI Powered Parsers serve as a reliable backup solution, ensuring that you're always covered and can access the data you need.

Accessing this game-changing feature is as simple as including a single parameter in your API requests. We've designed it to be user-friendly and hassle-free, so you can start benefiting from AI-Powered Parsing without any complex setup or configuration.

At SerpApi, we value our users' needs and strive to provide the most reliable and up-to-date solutions to improve user experience. While we continually update our parsers, we understand the importance of offering a backup solution using the power of AI systems for those rare cases when changes occur unexpectedly.

How does it work?

Below, you'll find a general flowchart illustrating the basic functioning. Further sections will provide more specific information.

flowchart-of-ai-powered-parser

BERT-Based Classification Model for Google Local Listings

We are excited to present the BERT-Based Classification Model for Google Local Listings, an open-source model available at Huggingface. This powerful model, developed using the Hugging Face library, and a dataset gathered by our own APIs is a key component of our AI-Powered Parsing feature.

The BERT-based model excels in differentiating difficult semantic similarities, handling partial texts that can be combined later, and effortlessly handling vocabulary from diverse areas. It also returns an Assurance Score for after-correction and demonstrates strong performance against grammatical mistakes.

Additionally, we want to emphasize that the model's flaws and limitations are systematically documented in the model card. While the model offers robust performance, it does have certain constraints, and potential misclassifications in specific scenarios. These limitations are diligently addressed to provide users with a transparent understanding of the model's capabilities.

You can play with the model using Free Inference API at the Repository:

bert-base-local-results-inference-api

Google Local Results AI Server

We're thrilled to announce that we have also open-sourced a simple server code for deploying the bert-base-local-results model. The server code can be found in SerpApi's Github Repository.

The repository contains the code for a server that mimics the Inference API endpoints provided by Hugging Face. The server offers a straightforward interface to perform text classification using BERT-based models. It has been specifically designed by SerpApi to cater to heavy-load prototyping and production tasks, enabling the implementation of the google-local-results-ai-parser gem, which utilizes the serpapi/bert-base-local-results model.

By open-sourcing this server code, we aim to provide developers with a convenient and efficient way to deploy the bert-base-local-results model in their own environments. It offers flexibility and control, allowing you to customize the deployment according to your specific requirements.

Feel free to explore the repository, leverage the server code, and adapt it to suit your needs. We're excited to see how you integrate the bert-base-local-results model into your projects using this server code.

Google Local Results AI Parser

The google-local-results-ai-parser is a gem developed by SerpApi that allows you to extract and categorize structured data from Google Local Search Results using natural language techniques. It uses the serpapi/bert-base-local-results transformer model to parse Google Local Results Listings in English and extract important information, categorizing it into different sections such as ratings, reviews, descriptions, etc.

You may visit SerpApi's Github Repository for an in-depth look at its capabilities. I'll provide a basic explanation of the usage and functions of this next-generation parser.

Let's say you want to parse the following Google Local Result:

Google Local Result

You can use the following simple code:

require 'google-local-results-ai-parser'
require 'nokolexbor'
require 'http'
require 'parallel'

html = "HTML of an Individual Local Result"
bearer_token = 'Huggingface Token or Private Server Key'
result = GoogleLocalResultsAiParser.parse(html: html, bearer_token: bearer_token)
Enter fullscreen mode Exit fullscreen mode

The output will be:


{

  "address" => "Nicosia",

  "description" => "Iconic Seattle-based coffeehouse chain",

  "price" => "€€",

  "reviews" => "418",

  "rating" => "4.0",

  "type" => "Coffee shop"

}

Enter fullscreen mode Exit fullscreen mode

You can also utilize it for multiple results:

require 'google-local-results-ai-parser'
require 'nokolexbor'
require 'http'
require 'parallel'

html_parts = [

  "HTML of an Individual Local Result",

  "HTML of another Individual Local Result",

  ...

]
bearer_token = 'Huggingface Token or Private Server Key'
results = GoogleLocalResultsAiParser.parse_multiple(html_parts: html_parts, bearer_token: bearer_token)

Enter fullscreen mode Exit fullscreen mode

In this case, the output will be an array of dictionaries:


[

  {

    "address" => "Nicosia",

    "description" => "Iconic Seattle-based coffeehouse chain",

    "price" => "€€",

    "reviews" => "418",

    "rating" => "4.0",

    "type" => "Coffee shop"

  },

  ...

]

Enter fullscreen mode Exit fullscreen mode

The AI-Powered Parser is able to do this by calling the model, and then doing a rule-based after correction for flaws of the model. This way, maximum precision can be provided. There are also other advanced parameters to cope with unexpected cases. You may refer to the Documentation for these capabilities.

SerpApi's Potential New Parameter

The AI Powered Parser Search operates by simply incorporating a single parameter named ai_parser. By integrating this AI-powered solution, the search results are not only enhanced but also become more precise. While the conventional parsing methods will still function as usual, the AI Powered Parser supercharges the parsing process, effectively refining and streamlining the precision of the results.

We favor solutions that are "no-brainers" - easy to implement, efficient, and notably improving outcomes. This AI parser fits right into that category, offering an innovative and straightforward way to enhance data extraction from Google Local Search Results or any other part that would be supported.

Imagine the following search:

This is searching for the query Coffee on Google. Imagine the Google Local Pack feature is messing up and giving the address for the price, the price for the number of reviews, etc.

You can open an issue at SerpApi's Public Roadmap and or contact our Customer Success Engineers to notify us about the deprecation. While the issue is being handled by our engineers, you may use the following query:

This will be searching for the query Coffee with our AI Powered Parsers, giving precise results until the issue is fixed. By trading extra time for superior precision, you can maintain your production capabilities as if the issue never happened. We desire to secure your channel of the data supply chain from us as much as we can.

Here's an example image showcasing the use of the parameter in the playground:
ai-parser-parameter-in-playground

Unit Tests for Maintenance

At SerpApi, we value test-driven development. In order to ensure the quality and integrity of our AI Powered Parser Search feature, we have undertaken comprehensive unit testing.

I sifted through all the previous unit test examples available in the stack and enriched the test cases with some recent examples as well. To facilitate the unit testing process, I crafted a Rake task designed to automatically save resulting JSONs as unit test examples in their respective parts.

As a part of the validation, I manually examined these JSONs for any abnormalities or inconsistencies. Once this manual check was completed, I employed Large Language Model (LLM) prompts for a more in-depth verification.

Here's an example context prompt:


Do you think there are any keys that may contain falsely parsed information in the following array at the following possible keys:

"rating", "type", "years_in_business", "service_options", "hours", "reviews_original", "reviews", "address", "description", "price", "phone"

[

  {

    "position": 1,

    "title": "xxx xxxx Ltd",

    "place_id": "9149508700925112934",

    "place_id_search": "http://localhost:3000/search.json?ai_parser=true&device=desktop&engine=google_local&gl=uk&google_domain=google.com&hl=en&location=United+Kingdom&ludocid=9149508700925112934&q=xxx+xxxxx.Ltd",

    "lsig": "AB86z5W38iEx_9mjnRFzmp68DR6h",

    "gps_coordinates": {

      "latitude": 53.9177131,

      "longitude": -2.1785891

    },

    "links": {

      "website": "http://xxx-xxxxxxx.com/"

    },

    "reviews_original": "No reviews",

    "reviews": 0,

    "address": "No ReviewsBarnoldswick"

  }
]

The Answer: "address" key possibly contains a residue. The correct value is probably "Barnoldswick". The residue is probably "No Reviews"

Wait for my instructions.

Enter fullscreen mode Exit fullscreen mode

Followed up by the query on the actual JSON:


Do you think there are any keys that may contain falsely parsed information in the following array at the following possible keys:

"rating", "type", "years_in_business", "service_options", "hours", "reviews_original", "reviews", "address", "description", "price", "phone"

[JSON I want to check out for oddities]

Enter fullscreen mode Exit fullscreen mode

These models, given their generative capabilities, provide helpful feedback about the accuracy of parsed data in keys such as "rating", "type", "years_in_business", "service_options", "hours", "reviews_original", "reviews", "address", "description", "price", "phone", etc.

Given their generative nature, Generative AI language models like OpenAI’s GPT, chatbots like ChatGPT, or Google AI Labs Project in beta version Bard might produce insufficient information in their answers, necessitating multiple follow-up prompts or some extra interpretation of their output. Despite this, it proved to be an efficient alternative to manual inspection, facilitating the process of identifying potential errors or inconsistencies in our JSONs.

After this preliminary check with the JSONs from various older examples that were already available in the stack, we leveraged these in the unit tests. Anytime there was a modification in the responsible parts, these vetted examples proved extremely helpful for comparison.

The overarching objective of these unit tests, as with any other unit tests, was to scrutinize different behaviors in different fixes. The fixes could be localized within the google-local-results-ai-parser gem or be more broadly based within the stack. By implementing this rigorous testing procedure, we ensure our feature's robustness and reliability, enabling it to consistently deliver enhanced and precise results.

Dedicated Pages and Documentation Examples

At SerpApi, we firmly believe in the importance of clear and effective communication with our users. We recognize that the ease of understanding our features and APIs is fundamental to their optimal utilization. Therefore, we've dedicated ourselves to improving our user-facing documentation to facilitate a seamless integration experience.

We have designed two new dedicated pages to serve this purpose. The first page provides a comprehensive overview of our new feature – the AI Powered Parser Search. It outlines the feature's capabilities, details its operation, and presents potential use cases. We believe this will allow our users to grasp the feature's value and envision how it can best fit into their unique workflows. Here are some visual examples:

Disclaimer

These pages contain a work in progress. The final version of these pages might be different than showcased here.

ai-powered-parser-feature-page-1

ai-powered-parser-feature-page-2

ai-powered-parser-feature-page-3

The second page is a complete table of all AI Powered Parsers. It serves as a handy reference for users to understand the different parsers available and the unique functionalities they offer. With this table, users can quickly locate the most suitable AI Powered Parser for their specific needs and applications.

ai-powered-parser-table

Furthermore, we have augmented our documentation with new examples in two of our key APIs: SerpApi's Google Local Pack Results API, and SerpApi's Google Local Results API to help our users understand how to leverage these APIs effectively.

ai-powered-parser-documentation

In conclusion, we continually strive to keep our documentation up-to-date and user-friendly, knowing that it empowers our users to leverage our offerings to their full potential. Whether you are new to SerpApi or an experienced user, these resources will provide valuable guidance in your journey of leveraging AI Powered Parser Search or any of our extensive array of APIs in general.

Brief Overview of the Previous Attempt

In my previous try to tackle this problem, I attempted to develop a local model in Rails for text classification. I followed a similar method as the one mentioned here. However, I encountered some unexpected challenges along the way. Here are the key lessons I learned from that experience.

You can find some of the old blogs from series around the problem at:

Key Lessons from Past Mistakes

Externalize the Model

One of the main problems with my previous approach was to train the model locally. I had to take into account a ton of optimization problems to be resolved. At first sight, it might seem that containing a situation within the stack is an excellent way to keep the problem at bay. It might give you access to better benchmarking and debugging capabilities. But it just turned out to be wrong.

I had to thrive for a more time and performance-efficient model since it was local. This limitation alone is enough to create more confusion in the development of the model. For example, since it was using an RNN model, when I increased the dataset size, it became slower. When I tried to cover the cases in the expanded dataset with less performance, I lost precision. It was a vicious cycle. Not to mention the lack of documentation on Machine Learning in Rails.

In this solution, I have externalized the model to a server, giving more room to play with different models at different capacities. This was an important detail that made this approach into a working solution.

Don't Reinvent the Wheel

In the previous approach, I have tried to train the model from scratch. Play with different structures, different hyperparameters, different datasets, etc. This would seem as if I had more tools in my hand.

But in reality, it gave me more things to complicate the situation. I had to do many grid-type attempts with different structures and hyperparameters which may or may not bear good results. Many times I was lost in details and was limited with my computing power to make subtle observations and make deductions on model types.

In this approach, I used a transformers model called bert-base-uncased. It was already trained on a relatively large English corpus, and all I had to do was to finetune it well to our needs. This has significantly boosted not only the simplicity of training the model but also the performance on precision. The resulting model bert-base-local-results was more capable, easy to implement, and easier to understand.

In conclusion, using solutions that are already good at some part of the solution to your problem, that are easily distributable, and applicable are a better choice for most cases.

Anticipate Conflicts with Other Parts

One other problem I faced with the previous solution was the amount of parallel progress in the reviewing process. These parts ended up creating more trouble than a solution.

For example, I attempted to employ the Rails logic of DRY (Don't Repeat Yourself), and I separated the extraction of the common parts with the old parsers into a separate function to be used by both parsers. By the time I relooked at the previous solution, even the location of the files was updated, and it was blocking any kind of attempt to provide how it is performing.

In this solution, I have decoupled necessary parts even if it means repeating the same code in two files. This way, I can ensure that the changes made on other parts won't affect the reviewing process or the integration of the code.

In conclusion, sometimes repeating yourself is actually good for maintenance. It is also good for keeping maintenance problems at bay. Knowing that the AI Parser problem is most likely caused by a definite part of the whole code is an absolute time saver.

Improve the Product Instead of Replacing it

I made the mistake of thinking I can replace the whole parsing process with an AI solution. At first sight, it could sound great. But the unknown is always fearsome, thus this meant more reviewing time, more details to be careful on the review, harder to explain concepts since they are replacing a known method, etc.

In this solution, I have used it as a parallel backup solution. Being able to switch from a traditional parser to an AI Powered Parser meant less complexity with better control. I safeguarded the code so that if there is any unknown problem, the AI Powered Parsers would be bypassed and the API would resort to traditional parsers.

This way, I was able to provide the same feature without giving up on the old method. Instead of measuring how much performance or precision I have added with this feature, I was able to serve it as a definite bonus.

Refrain from Automation of Every Aspect

I have attempted to create a completely automatic solution from ground zero in the previous solution. From gathering data with SerpApi from within the Rails stack, feeding it directly into the training dataset loader, training process, creating a table for the training debugging values such as loss function and success rates, etc.

At first sight, it might seem like an over-encompassing solution. But the amount of complexity it brought to someone who isn't informed on the process was huge. I had to take unconventional approaches in many parts.

To exemplify, I didn't want us to download the Torch library into each server. So I devised a system in which you could develop a model in Torch with PT format, and then convert that to an ONNX file. Rails had some light options to run ONNX models. But since I wasn't well versed in writing code for transforming PT to ONNX, I had to resort to a Python script within a Rails Stack.

In this solution, I only attempted to fully automate the parsing process with google-local-results-ai-parser gem.

I have also offered a solution to automate serving the model with google-local-results-ai-server, but this is optional. People can still deploy their own solutions at Microsoft Azure, Amazon Sagemaker, or a new startup they think is better for them.

For training the model, people are free to choose which dataset to use or which framework to train if they wish to improve or replicate bert-base-local-results.

The essential take from this is to give people more flexibility in how they want to resolve the problem while providing at least one working path. Automating every aspect with limited built-in options will possibly cause confusion in the long run.

Handle Flaws of the Model Systematically

In my previous approach, because of other complexities, I wasn't able to do a full systematic flaw breakdown. I have only provided the necessary fixes for the flaws. Solutions to these flaws created secondary flaws in the results, and things became out of control.

In addition to that, I also employed some hardcoded regex to do after-corrections. This solution while improving the overall precision, blocked me from implementing better solutions that would encompass a wide range of flaws.

Also, I didn't separate the information fed by the model, and the after-corrections into separate parts. This made it harder to observe the flaw in model behavior versus the after corrections.

In this solution, I have employed a fully systematic breakdown of the known flaws of the model. Also instead of using dense regexes, I've handled after corrections mostly in relation to the position of the elements in the HTML part.

Furthermore, since the model was hosted on a different server, I was easily able to check little entries, and their classifications done by the model. Even using Huggingface's Free Inference API on the model page was extremely helpful in debugging different flaws.

Ensuring that the solution could be easily transferred to other languages, and other search results was a crucial part of this attempt. Until we create a new model for a new language or a new part, we cannot know for sure if it was effective. However, few details here are likely to improve the transferability of the solution.

Compartmentalize Different Parts and Open Source Most of Them

This is much related to refraining from automating every aspect, but subtly different topics. In my previous attempt, I tried to solve the problem from one place because I thought it would be easier to employ.

In this solution, I have broken it into 4 parts as I have lightly mentioned above:

I could've easily gone for a server solution that also handles the parsing on a remote server, or used a custom hosting solution to combine multiple parts, etc., and have fewer steps to achieve the same result. This approach would ultimately fail in the long run.

To exemplify this with a relatively similar experience from the old project, I wasn't able to open-source most of the work because there were intended for our codebase, and I mitigated the risk of sharing bits of code to explain the solution. This has led me to create more blog posts to explain the idea.

Although I am grateful to all the readers who read my previous work, this situation caused a secondary problem; mitigating the propagating of the solution versus doing the actual work. Many times when I had to rephrase myself in other blog posts, keeping me from developing.

In this approach, different components are open-sourced, and only a minor code is intended for the stack. This way, I can express the intricate details of how to use google-local-results-ai-parser in a more advanced way inside its documentation, detailed flaws of the bert-base-local-results in the model card, etc. This is still creating a bigger corpus for the audience. But it is also helping me mitigate the risks at bay while having better control at different parts.

Also in the event that this work is not integrated, most of the materials will be out there for public use instead of bits and pieces that interested people must compile themselves.

In conclusion, open-sourcing the parts of your projects, especially ones that are not certain to be accepted, is a good way of mitigating the propagation of the idea versus its actual work. Also, a good compartmentalization of the open-source parts will reduce the risk of exposing delicate parts of your codebase to the public.

Conclusion

I'm grateful for the reader's time and attention. I am also grateful for any potential attention to open-sourced materials used in making it possible. I hope this solution communicates the idea of AI Powered Parsing to people who are interested, and also hope that key takes on my previous mistakes would benefit people in the most potent way. I would be more grateful open sourced materials that would provide use to any projects.

Top comments (0)