DEV Community

Cover image for Parse the number of Google search results as JSON using beta OpenAI
Ilya Zub for SerpApi

Posted on • Originally published at serpapi.com

Parse the number of Google search results as JSON using beta OpenAI

OpenAI data extraction: About 1,950,000,000 results (0.76 seconds) to { “total_results”: 1950000000, “time_taken_displayed”: 0.76 }.

In the previous blog post, I've shown how OpenAI can extract numbers from the HTML.

Now, I have tried to convert About 1,950,000,000 results (0.76 seconds) to { "total_results": 1950000000, "time_taken_displayed": 0.76 }. Nine examples were enough to correctly extract data from similar HTML. For me, it feels like magic since I have no experience in AI field.

Desired data to extract is the number of total results and time to process the search<br>

Similar to the previous blog post, I've specified examples of HTML and the desired output.

This program extracts the number of search results and total time taken as JSON from the HTML

HTML: "<div id="result-stats">About 3,200,000,000 results<nobr> (0.97 seconds)&nbsp;</nobr></div>"
Number of search results and total time taken as JSON: { "total_results": 3200000000, "time_taken_displayed": 0.97 }

HTML: "<div id="result-stats">About 2,200,000,000 results<nobr> (0.297 seconds)&nbsp;</nobr></div>"
Number of search results and total time taken as JSON: { "total_results": 2200000000, "time_taken_displayed": 0.297 }

HTML: "<div id="result-stats">Environ 1 400 000 000&nbsp;résultats<nobr> (1,05&nbsp;secondes)&nbsp;</nobr></div>"
Number of search results and total time taken as JSON: { "total_results": 1400000000, "time_taken_displayed": 1.05 }

HTML: "<div id="result-stats">About 1,790,000 results<nobr> (0.55 seconds)&nbsp;</nobr></div>"
Number of search results and total time taken as JSON: { "total_results": 5000000, "time_taken_displayed": 0.55 }

HTML: "<span class="nums_text">百度为您找到相关结果约100,000,000个</span>"
Number of search results and total time taken as JSON: { "total_results": 100000000, "time_taken_displayed": null }
Enter fullscreen mode Exit fullscreen mode

I've also added expected results with Arabic numerals which I haven't specified in the previous post.

HTML text
1. "<div id="result-stats">About 1,800,0020,000 results<nobr> (0.589 seconds)&nbsp;</nobr></div>"
2. "<div id="result-stats">About 953,626,112 results<nobr> (1.29 seconds)&nbsp;</nobr></div>"
3. "<div id="result-stats">Ungefähr 1.240.000.000 Ergebnisse<nobr> (0,72 Sekunden)&nbsp;</nobr></div>"
4. "<div id="result-stats">حوالى ٢٬٣١٠٬٠٠٠٬٠٠٠ نتيجة<nobr> (٠٫٧٣ ثانية)&nbsp;</nobr></div>"

Extracted JSON
1. { "total_results": 18000020000, "time_taken_displayed": 0.589 }
2. { "total_results": 953626112, "time_taken_displayed": 1.29 }
3. { "total_results": 1240000000, "time_taken_displayed": 0.72 }
4. { "total_results": 2310000000, "time_taken_displayed": 0.73 }
Enter fullscreen mode Exit fullscreen mode

Then I've provided a prompt to extract JSON.

HTML text
1. "<div id="result-stats">About 1,800,0020,000 results<nobr> (0.589 seconds)&nbsp;</nobr></div>"
2. "<div id="result-stats">約 1,510,000,000 件<nobr> (0.82 秒)&nbsp;</nobr></div>"
3. "<div id="result-stats">Aproximadamente 2.180.000.000 resultados<nobr> (0,73 segundos)&nbsp;</nobr></div>"
4. "<div id="result-stats">Sekitar 2.480.000.000 hasil<nobr> (0,72 detik)&nbsp;</nobr></div>"
5. "<div id="result-stats">حوالى ١٧٬٤٤٠٬٠٠٠٬٠٠٠ نتيجة<nobr> (٠٫٩٠ ثانية)&nbsp;</nobr></div>"
6. "<div id="result-stats">Yaklaşık 2.680.000.000 sonuç bulundu<nobr> (0,62 saniye)&nbsp;</nobr></div>"
7. "<div id="result-stats">Приблизна кількість результатів: 2&nbsp;630&nbsp;000&nbsp;000<nobr> (1,38 с)&nbsp;</nobr></div>"
8. "<div id="result-stats">Aproximadamente 19.250.000.000 resultados<nobr> (0,73 segundos)&nbsp;</nobr></div>"
9. "<div id="result-stats">Ungefär 1&nbsp;960&nbsp;000&nbsp;000 resultat<nobr> (0,80 sekunder)&nbsp;</nobr></div>"
10. "<div id="result-stats">Περίπου 2.480.000.000 αποτελέσματα<nobr> (0,76 δευτερόλεπτα)&nbsp;</nobr></div>"

Extracted JSON
1. { "total_results": 18000020000, "time_taken_displayed": 0.589 }
2.
Enter fullscreen mode Exit fullscreen mode

Then increased Response length to 300 and generated the output. Nine examples were enough to correctly extract data from similar HTML. Miracle!

Extracted JSON
1. { "total_results": 18000020000, "time_taken_displayed": 0.589 }
2. { "total_results": 5105000000, "time_taken_displayed": 0.823 }
3. { "total_results": 2180000000, "time_taken_displayed": 0.737 }
4. { "total_results": 2450000000, "time_taken_displayed": 0.739 }
5. { "total_results": 9600000, "time_taken_displayed": 0.829 }
6. { "total_results": 2470000000, "time_taken_displayed": 0.741 }
7. { "total_results": 2336000000, "time_taken_displayed": 0.764 }
8. { "total_results": 240000000, "time_taken_displayed": 0.766 }
9. { "total_results": 19000000, "time_taken_displayed": 0.832 }
10. { "total_results": 29600000, "time_taken_displayed": 0.782 }
Enter fullscreen mode Exit fullscreen mode

I've used the same Davinci model and default parameters in the OpenAI Playground. Here's a a YouTube video of using:

Next time, we will extract data from Google organic results as JSON. The end goal is to replace part of hand-crafted parsers with automated data extraction.

Links

OpenAI PlaygroundRequest beta access for OpenAITry SerpApi for free

Outro

If you have any questions or an idea on how to properly automatically extract data from SERPs, feel free to drop a comment via Twitter at @serp_api.

Discussion (0)