DEV Community πŸ‘©β€πŸ’»πŸ‘¨β€πŸ’»

Cover image for 13 ways to scrape any public data from any website
Dmitriy Zub β˜€οΈ for SerpApi

Posted on • Updated on • Originally published at serpapi.com

13 ways to scrape any public data from any website

πŸ“ƒTable of Contents

Intro

This ongoing blog post is about understanding ways of extracting data from any website, either if the website is static/fully JavaScript driven or if reverse engineering needs to be applied.

After this blog post, you can understand and apply this knowledge to create structured data from a messy data from any website.

At SerpApi we're extracting data from all sorts of websites and each one of them requires a specific approach that will able us to parse data fast and efficient, and most importantly without browser automation. We want to share a little of our knowledge gained while building our APIs.

πŸ“Œ Some of the techniques will not be mentioned intentionally in order not to disclose some of the methods we use.

Shown tips should be plenty enough to get you going or understand where to potentially look to solve the problem.

🐍 This blog post uses Python as language to show code examples.

Browser Dev Tools

Dev tools is possibly one of very first thing that needs to be checked before writing actual code.

I myself, wrote quite a lot of code before looking at dev tools with realization that instead of parsing data, I can made a simple HTTP request, either GET or POST to website API/server and get a JSON response with all the necessary data that being already structured.

Under dev tools, you can find a Elements, Network and Source/Application tabs.

The follow-up sections will go from easiest to hardest methods with tools that can be used to make the job done.

image

CSS Selectors or/and XPath

CSS Selectors

CSS selectors are patterns used to select match the element(s) you want to style extract from HTML page.

While for building websites CSS selectors help to style the website, in webscraping those selectors used to match certain HTML nodes (elements) from which we want data to be extracted.

We've covered types of selectors with actual code examples in the web scraping with CSS selectors blog post and which selector works best for a particular task, and how to test them in the browser.

Type of Selectors

There're a few often used type of selectors:

Selector Type Syntax Purpose
Type Selectors element_name Matches HTML node based on element
Class Selectors .class_name Matches HTML node based on class
ID Selectors #id_value Matches HTML node based on ID
Attribute Selectors [attribute=attribute_value]/[attribute] Extracts attribute data from HTML node
Selector List element, element, element, .../selector, selector, selector, ... Matches all nodes based on given selector(s). Great for handling different HTML layouts
Descendant combinator selector1 selector2 Matches data from a second selector if they have an ancestor

XPath

XPath is a query language that used to navigate DOM and could be is more flexible than CSS selectors.

It's mostly helpful when HTML element doesn't have a selector name, or has a very weird location in a messy HTML. XPath uses expressions to create a query:

//div[@class='content']//text()[not(ancestor::div/@class[contains(., 'code')])][normalize-space()]
Enter fullscreen mode Exit fullscreen mode

SelectorGadget Chrome Extension

SelectorGadget extension allows to quickly grab CSS selector(s) by clicking on desired element in your browser. It returns a CSS selector(s).

Example of selecting certain HTML elements and returned CSS selector(s):

GIF

We've shown how SelectorGadget works in a separate web scraping with CSS selectors blog post.

Scraping from Meta Elements

Selectors are not the only place where valuable data could be extracted. Some of the data can be directly in the HTML <meta> elements. <meta> element loads no matter if the page is static or dynamic.

For example, on Medium you can extract basic profile info from <meta> elements:

import requests
from bs4 import BeautifulSoup

html = requests.get('https://medium.com/@dimitryzub').text
soup = BeautifulSoup(html, 'html.parser')

description = soup.select('meta')[13]['content']
link = soup.select('meta')[16]['content']
profile_picture = soup.select('meta')[21]['content']
print(description, link, profile_picture, sep='\n')

'''
Read writing from Dmitriy Zub β˜€οΈ on Medium. Developer Advocate at SerpApi. I help to make structured data from a pile of mess.
https://dimitryzub.medium.com
https://miro.medium.com/max/2400/1*he-C8WKaIjHIe58F4aMyhw.jpeg
'''
Enter fullscreen mode Exit fullscreen mode
Code Explanation
select() to select all 'meta' elemets
[<number>] is to access a certain index number from a list of extracted elements (nodes)
['content'] to extract attribute from the HTML element (node)

Scraping From Tables

Scraping tables is an additional separate thing that can be done either with parsel or bs4 web scraping libraries. However, pandas simplifies this task a lot by providing a read_html() method that can parse data from the <table>.

Installation:

$ pip install pandas
Enter fullscreen mode Exit fullscreen mode

A basic example of extracting table data from Wikipedia:

import pandas as pd

table = pd.read_html('https://en.wikipedia.org/wiki/History_of_Python')[0] # [0] = first table
df = pd.DataFrame(data=table[['Latest micro version', 'Release date']])    # grabs 2 columns
# df.set_index('Latest micro version', inplace=True)                       # drops default pandas DataFrame indexes, but can't be used in a for loop

print(df)

for data in df['Latest micro version']:
    print(data)
Enter fullscreen mode Exit fullscreen mode

Outputs:

                                 Latest micro version                                       Release date
0                                            0.9.9[2]                                      1991-02-20[2]
1                                            1.0.4[2]                                      1994-01-26[2]
2                                            1.1.1[2]                                      1994-10-11[2]
3                                                 NaN                                      1995-04-13[2]
4                                                 NaN                                      1995-10-13[2]
5                                                 NaN                                      1996-10-25[2]
6                                           1.5.2[42]                                      1998-01-03[2]
7                                           1.6.1[42]                                     2000-09-05[43]
8                                           2.0.1[44]                                     2000-10-16[45]
9                                           2.1.3[44]                                     2001-04-15[46]
10                                          2.2.3[44]                                     2001-12-21[47]
11                                          2.3.7[44]                                     2003-06-29[48]
12                                          2.4.6[44]                                     2004-11-30[49]
13                                          2.5.6[44]                                     2006-09-19[50]
14                                          2.6.9[27]                                     2008-10-01[27]
15                                         2.7.18[32]                                     2010-07-03[32]
16                                          3.0.1[44]                                     2008-12-03[27]
17                                          3.1.5[52]                                     2009-06-27[52]
18                                          3.2.6[54]                                     2011-02-20[54]
19                                          3.3.7[55]                                     2012-09-29[55]
20                                         3.4.10[56]                                     2014-03-16[56]
21                                         3.5.10[58]                                     2015-09-13[58]
22                                         3.6.15[60]                                     2016-12-23[60]
23                                         3.7.13[61]                                     2018-06-27[61]
24                                         3.8.13[62]                                     2019-10-14[62]
25                                         3.9.14[63]                                     2020-10-05[63]
26                                         3.10.7[65]                                     2021-10-04[65]
27                                      3.11.0rc2[66]                                     2022-10-24[66]
28                                                NaN                                        2023-10[64]
29                                            Legend:  Old versionOlder version, still maintainedLate...
30  Italics indicates the latest micro version of ...  Italics indicates the latest micro version of ...
Enter fullscreen mode Exit fullscreen mode

for loop output:

0.9.9[2]
1.0.4[2]
1.1.1[2]
nan
nan
nan
1.5.2[42]
1.6.1[42]
2.0.1[44]
2.1.3[44]
2.2.3[44]
2.3.7[44]
2.4.6[44]
2.5.6[44]
2.6.9[27]
2.7.18[32]
3.0.1[44]
3.1.5[52]
3.2.6[54]
3.3.7[55]
3.4.10[56]
3.5.10[58]
3.6.15[60]
3.7.13[61]
3.8.13[62]
3.9.14[63]
3.10.7[65]
3.11.0rc2[66]
nan
Legend:
Italics indicates the latest micro version of currently supported versions as of 2022-07-11[needs update].
Enter fullscreen mode Exit fullscreen mode

Keep in mind that those are just examples and additional data cleaning needs to be applied to make this data usable πŸ™‚

Have a look at the gotchas that could happen with read_html():

image

Scraping with Regular Expression

Scraping with regular expression in Python is possible by re module.

Why scrape data with regular expressions in the first place?

  1. if the HTML structure is very, very messy.
  2. if there are not CSS selectors and XPath didn't work also.
  3. if the data you want is directly in the text string.
  4. similar reasons to mention above.

There're a few main methods that could be used:

Method Purpose
re.findall() Returns a list of matches. To find all occurrences.
re.search() Returns a first match. To find the first occurrence.
re.match() To find match and the beginning of the string. search() vs match()
group() Returns one or more subgroups of the match.
import re

dummy_text = '''
Contrary to popular belief, Lorem Ipsum is not simply random text. 
It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. 
Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure
Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature,
discovered the undoubtable source. Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of 
"de Finibus Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC. 
This book is a treatise on the theory of ethics, very popular during the Renaissance.
The first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", 
comes from a line in section 1.10.32.
'''

dates = re.findall(r'\d{1}\.\d{2}\.\d{2}', dummy_text) # https://regex101.com/r/VKYiA9/1
years_bc = re.findall(r'\d+\s?\bBC\b', dummy_text)     # https://regex101.com/r/ApypoB/1

print(dates)
print(years_bc)

# ['1.10.32', '1.10.33', '1.10.32']
# ['45 BC', '45 BC']
Enter fullscreen mode Exit fullscreen mode

Here's a visualization of what is being matched by the regular expressions above:

image

The above regular expressions read like this:

image

Python Web Scraping Tools

In this section, we'll go over most popular Python web scraping tools that can extract data from static and dynamic websites.

Python Parsing Libraries

There're a few Python web scraping packages/libraries to parse data from websites that are not JavaScript driven as such packages are designed to scrape data from static pages.

Parsel

Parsel is a library built to extract data from XML/HTML documents with XPath and CSS selectors support, and could be combined with regular expressions. It uses lxml parser under the hood by default.

The great thing I like about parsel (apart from XPath support) is that it returns None if certain data is not present, so there's no need to create a lot of try/except blocks to the same thing that looks ugly.

Installation:

$ pip install parsel
Enter fullscreen mode Exit fullscreen mode

A few examples of extraction methods:

variable.css(".X5PpBb::text").get()                    # returns a text value
variable.css(".gs_a").xpath("normalize-space()").get() # https://github.com/scrapy/parsel/issues/192#issuecomment-1042301716
variable.css(".gSGphe img::attr(srcset)").get()        # returns a attribute value
variable.css(".I9Jtec::text").getall()                 # returns a list of strings values
variable.xpath('th/text()').get()                      # returns text value using xpath 
Enter fullscreen mode Exit fullscreen mode
Code Explanation
css() parse data from the passed CSS selector(s). Every CSS query traslates to XPath using csselect package under the hood.
::text or ::attr(<attribute>) extract textual or attribute data from the node.
get() get actual data returned from parsel
getall() get all a list of matches.
.xpath('th/text()') grabs textual data from <th> element

Practical example using parsel:

# https://serpapi.com/blog/scrape-naver-related-search-results-with-python/#full_code

import requests, json
from parsel import Selector

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "query": "minecraft",  # search query
    "where": "web"         # web results. works with nexearch as well
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36"
}

html = requests.get("https://search.naver.com/search.naver", params=params, headers=headers, timeout=30)
selector = Selector(html.text)

related_results = []

# https://www.programiz.com/python-programming/methods/built-in/enumerate
for index, related_result in enumerate(selector.css(".related_srch .keyword"), start=1):
    keyword = related_result.css(".tit::text").get().strip()
    link = f'https://search.naver.com/search.naver{related_result.css("a::attr(href)").get()}'

    related_results.append({
        "position": index,    # 1,2,3..
        "title": keyword,
        "link": link
    })

print(json.dumps(related_results, indent=2, ensure_ascii=False))
Enter fullscreen mode Exit fullscreen mode

BeautifulSoup

BeautifulSoup is also a library that build to extract data from HTML/XML documents. It's also could be combined with lxml parser and also can be used in combo with regular expressions.

Unlike parsel, BeautifulSoup don't have support for XPath which would be pretty handy is some situations. Additionally, it lacks getall() method that returns a list of matches which is shorthand of list comprehension, and it needs a lot of try/except to handle missing data.

However, it can create new HTML nodes, for example, using wrap() method or other methods for similar things.

It's very handy if parts of the data you want to extract not properly structured e.g. HTML table without <table>...</table> element.

You can create this element and then easily parse table data with pandas read_html() method.

Installation:

$ pip install bs4
Enter fullscreen mode Exit fullscreen mode

A few examples of extraction methods using select() and select_one():

variable.select('.gs_r.gs_or.gs_scl')    # return a list of matches
variable.select_one('.gs_rt').text       # returns a single text value match
variable.select_one('.gs_rt a')['href']  # returns a single attribute value match
Enter fullscreen mode Exit fullscreen mode

Practical example using BeautifulSoup:

# https://stackoverflow.com/a/71237540/15164646

from bs4 import BeautifulSoup
import requests, lxml

params = {
    "user": "VxOmZDgAAAAJ", # user-id, in this case Masatoshi Nei
    "hl": "en",             # language
    "gl": "us",             # country to search from
    "cstart": 0,            # articles page. 0 is the first page
    "pagesize": "100"       # articles per page
}

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36"
}

all_articles = []

html = requests.post("https://scholar.google.com/citations", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")

for index, article in enumerate(soup.select("#gsc_a_b .gsc_a_t"), start=1):
    article_title = article.select_one(".gsc_a_at").text
    article_link = f'https://scholar.google.com{article.select_one(".gsc_a_at")["href"]}'
    article_authors = article.select_one(".gsc_a_at+ .gs_gray").text
    article_publication = article.select_one(".gs_gray+ .gs_gray").text

    all_articles.append({
        "title": article_title,
        "link": article_link,
        "authors": article_authors,
        "publication": article_publication
    })
Enter fullscreen mode Exit fullscreen mode

Python Browser Automation

Browser automation is handy when you need to perform some sort of interaction with the webiste, for example, scrolling, clicks and similar things.

Such things could be done without browser automation, this is how we tend to do at SerpApi, however, it could be very complicated but on the flip side, the reward is much faster data extraction.

Playwright

playwright is a modern alternative to selenium. It can perform pretty much all interactions as user would do i.e clicks, scrolls and many more.

Installation:

$ pip install playwright
Enter fullscreen mode Exit fullscreen mode

A practical example of website interaction using playwright and parsel to extract the data.

The following script scrolls through all Google Play app reviews and then extract data:

# https://serpapi.com/blog/scrape-all-google-play-app-reviews-in-python/#full_code

import time, json, re
from parsel import Selector
from playwright.sync_api import sync_playwright


def run(playwright):
    page = playwright.chromium.launch(headless=True).new_page()
    page.goto("https://play.google.com/store/apps/details?id=com.collectorz.javamobile.android.books&hl=en_GB&gl=US")

    user_comments = []

    # if "See all reviews" button present
    if page.query_selector('.Jwxk6d .u4ICaf button'):
        print("the button is present.")

        print("clicking on the button.")
        page.query_selector('.Jwxk6d .u4ICaf button').click(force=True)

        print("waiting a few sec to load comments.")
        time.sleep(4)

        last_height = page.evaluate('() => document.querySelector(".fysCi").scrollTop')  # 2200

        while True:
            print("scrolling..")
            page.keyboard.press("End")
            time.sleep(3)

            new_height = page.evaluate('() => document.querySelector(".fysCi").scrollTop')

            if new_height == last_height:
                break
            else:
                last_height = new_height

    selector = Selector(text=page.content())
    page.close()

    print("done scrolling. Extracting comments...")
    for index, comment in enumerate(selector.css(".RHo1pe"), start=1):
        user_comments.append({
            "position": index,
            "user_name": comment.css(".X5PpBb::text").get(),
            "app_rating": re.search(r"\d+", comment.css(".iXRFPc::attr(aria-label)").get()).group(),
            "comment_date": comment.css(".bp9Aid::text").get(),
            "developer_comment": {
                "dev_title": comment.css(".I6j64d::text").get(),
                "dev_comment": comment.css(".ras4vb div::text").get(),
                "dev_comment_date": comment.css(".I9Jtec::text").get()
            }
        })

    print(json.dumps(user_comments, indent=2, ensure_ascii=False))


with sync_playwright() as playwright:
    run(playwright)
Enter fullscreen mode Exit fullscreen mode

Selenium

selenium is very similar to playwright but a little older with slightly different approaches of doing things.

$ pip install selenium
Enter fullscreen mode Exit fullscreen mode

The following script performs a scroll until hits the very bottom of Google Play search and then extracts each section with games:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from parsel import Selector
import json, time

google_play_games = {
    'Top charts': {
        'Top free': [],
        'Top grossing': [],
        'Top paid': []
    },
}

def scroll_page(url):
    service = Service(ChromeDriverManager().install())

    options = webdriver.ChromeOptions()
    options.add_argument("--headless")
    options.add_argument("--lang=en")
    options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36")

    driver = webdriver.Chrome(service=service, options=options)
    driver.get(url)

    while True:
        try:
            scroll_button = driver.find_element(By.CSS_SELECTOR, '.snByac')
            driver.execute_script("arguments[0].click();", scroll_button)
            WebDriverWait(driver, 10000).until(EC.visibility_of_element_located((By.TAG_NAME, 'body')))
            break
        except:
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            WebDriverWait(driver, 10000).until(EC.visibility_of_element_located((By.TAG_NAME, 'body')))

    selector = Selector(driver.page_source)
    driver.quit()

    return selector

def scrape_all_sections(selector):  
    for section in selector.css('.Ubi8Z section'):
        section_title = section.css('.kcen6d span::text').get()
        time.sleep(2)
        google_play_games[section_title] = []

        for game in section.css('.TAQqTe'):
            title = game.css('.OnEJge::text').get()
            link = 'https://play.google.com' + game.css('::attr(href)').get()
            category = game.css('.ubGTjb .sT93pb.w2kbF:not(.K4Wkre)::text').get()
            rating = game.css('.CKzsaf .w2kbF::text').get()
            rating = float(rating) if rating else None

            google_play_games[section_title].append({
                'title': title,
                'link': link,
                'category': category,
                'rating': rating,
            })

    print(json.dumps(google_play_games, indent=2, ensure_ascii=False))


def scrape_google_play_games():
    params = {
        'device': 'phone',  
        'hl': 'en_GB',      # language 
        'gl': 'US',         # country of the search
    }

    URL = f"https://play.google.com/store/games?device={params['device']}&hl={params['hl']}&gl={params['gl']}"

    result = scroll_page(URL)
    scrape_all_sections(result)


if __name__ == "__main__":
    scrape_google_play_games()
Enter fullscreen mode Exit fullscreen mode

Python Web Scraping Frameworks

Scrapy

scrapy is a high-level webscraping framework designed to scrape data at scale and can be used to create a whole ETL pipeline.

However, you have to keep in mind that it's bulky, and could be quite confusing, and while it provides a lot of things for you, most of those things you may not need.

Installation:

$ pip install scrapy
Enter fullscreen mode Exit fullscreen mode

Very simple scrapy script:

import scrapy


class ScholarAuthorTitlesSpider(scrapy.Spider):
    name = 'google_scholar_author_titles'

    def scrapy_request(self):
        params = {
            "user": "cp-8uaAAAAAJ",  # user-id
            "hl": "en",              # language
            "gl": "us",              # country to search from
            "cstart": 0,             # articles page. 0 is the first page
            "pagesize": "100"        # articles per page
        }

        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36"
        }

        yield scrapy.Request(url="https://scholar.google.com/citations", method="GET", headers=headers, meta=params, callback=self.parse)

    def parse(self, response):
        for index, article in enumerate(response.css("#gsc_a_b .gsc_a_t"), start=1):
            yield {
                "position": index,
                "title": article.css(".gsc_a_at::text").get(),
                "link": f'https://scholar.google.com{article.css(".gsc_a_at::attr(href)").get()}',
                "authors": article.css(".gsc_a_at+ .gs_gray::text").get(),
                "publication": article.css(".gs_gray+ .gs_gray::text").get()
            }
Enter fullscreen mode Exit fullscreen mode

XHR Requests

XHR request allows to talk to the server by making a request and getting data back in response. It's one of the first things that you can check before writing actual code. Those requests can be used to get data directly from website's "source" without the need to use parsing libraries/frameworks.

To find a certain XHR request you need:

  1. Open browser dev tools (F12)
  2. Network
  3. Fetch/XHR
  4. Refresh page as data may come on page update.
  5. Click through every request and see if there's any data you want.

If you find the request with the data you want, you can preview the data (example from the Poocoin.app):

image

How extract data from XHR request

When making XHR request, we need to pass URL parameters that server can understand and "reply" to us.

Here's a simple illustration of that process:

image

To find those headers and URL query parameters, we need to go to the certain URL and look at Headers and Payload tabs and make sure we see what request method is used (GET, POST, etc).

We can do it like so:

  1. Copy URL as CURL (Bash) and use it with online CURL runner or tools such as Insomnia.
  2. Copy request URL under headers tab.

image

From Insomnia (URL copied from the XHR->Headers tab).

πŸ“ŒKeep in mind that some of the passed URL parameters need to be scraped and passed to the URL beforehand (before making request to the server/api). URL can have some sort of a unique token or something different and can't be worked without it.

image

If the response is successful and you want to make an exact request in the script, those parameters could be automatically generated with tools such as Insomnia (or other alternatives) where you can test different types of requests with different parameters and headers.

image

Simple example but same approach will be on other websites with or without passing URL parameters and headers:

import requests

# https://requests.readthedocs.io/en/latest/user/quickstart/#json-response-content
html = requests.get('https://api.chucknorris.io/jokes/random').json()
print(html['value'])

# Once, due to an engine stall of his F-22 Raptor during a Dessert Storm sorte', Chuck Norris had to ejaculate over the Red Sea.
Enter fullscreen mode Exit fullscreen mode

Page Source

This is the next thing that could be checked after Dev Tools -> XHR. It's about looking at page source and trying to find the data there, that are either hidden in the rendered HTML or can't be scraped with selectors because it's being rendered by JavaScript.

One of the ways to find if there's the data you want is in the inline JSON or not:

  1. select and copy any type of data you want to extract (title, name, etc.)
  2. open page source CTRL + U
  3. find the data CTRL + F, if some of the occurrences will be inside <script> elements then congratulation, you found inline JSON or something similar πŸ™‚

Inline JSON

Here's a visual example of how inline JSON could look like (from the TripAdvisor):

trip-advisor-inline-json-example

When we found that there's a data we want locates in the inline JSON, there're a few ways to extract it:

  • using regular expression to extract parts of the inline JSON.
  • (if needed) once again using regular expression(s) to extract the portion of the data: links, dates, emails, etc.
  • using regular expression to parse inline JSON and covert it to usable JSON that could be used as a Python dict.

Extract inline JSON data with Regex

The following example is from our Google Arts & Culture - Artists blog post where we select() all <script> elements and then extract other data from matched element(s) using regular expressions:

from bs4 import BeautifulSoup
import requests, json, re, lxml

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'
}

params = {
    'tab': 'pop'
}

html = requests.get(f'https://artsandculture.google.com/category/artist', params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, 'lxml') 

# πŸ‘‡πŸ‘‡πŸ‘‡ data extraction
all_script_tags = soup.select('script')

# https://regex101.com/r/JQXUBm/1
portion_of_script_tags = re.search('("stella\.common\.cobject",".*)\[\[\"A",\["stella\.pr","PrefixedAssets', str(all_script_tags)).group(1)

# https://regex101.com/r/c9m9B0/1
authors = re.findall(r'"stella\.common\.cobject","(.*?)","\d+', str(portion_of_script_tags))
Enter fullscreen mode Exit fullscreen mode
Code Explanation
re.findall() Returns a list of matches.
re.search() Returns a first match.
group() Returns one or more subgroups of the match.

Extract inline JSON data with Regex and convert usable JSON

The following example is from one of our Scrape Google Play Store App in Python blog posts.

The flow would be to also parse <script> elements with regex and then turn it into dict using json.loads():

json.loads(
    re.findall(
        r"<script nonce=\"\w+\" type=\"application/ld\+json\">({.*?)</script>", # regular expression
        str(soup.select("script")[11]),                                         # input from where to search data
        re.DOTALL,                                                              # match any character: https://docs.python.org/3/library/re.html#re.DOTALL
    )[0]                                                                        # access `list` from re.finall()
)                                                                               # convert to `dict` using json.loads()
Enter fullscreen mode Exit fullscreen mode

After that, we can access it as a dictionary:

app_data["basic_info"]["name"] = basic_app_info.get("name")
app_data["basic_info"]["type"] = basic_app_info.get("@type")
app_data["basic_info"]["url"] = basic_app_info.get("url")
Enter fullscreen mode Exit fullscreen mode

Full example:

from bs4 import BeautifulSoup
import requests, lxml, re, json

headers = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"
}

params = {
    "id": "com.nintendo.zara",     # app name
    "gl": "US",                    # country of the search
    "hl": "en_GB"                  # language of the search
}

# make a request and pass response to BeautifulSoup
html = requests.get("https://play.google.com/store/apps/details", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")

# where all app data will be stored
app_data = {
    "basic_info": {}
}


# πŸ‘‡πŸ‘‡πŸ‘‡ data extraction
# [11] index is a basic app information
# https://regex101.com/r/zOMOfo/1
basic_app_info = json.loads(re.findall(r"<script nonce=\"\w+\" type=\"application/ld\+json\">({.*?)</script>", str(soup.select("script")[11]), re.DOTALL)[0])

app_data["basic_info"]["name"] = basic_app_info.get("name")
app_data["basic_info"]["type"] = basic_app_info.get("@type")
app_data["basic_info"]["url"] = basic_app_info.get("url")
Enter fullscreen mode Exit fullscreen mode

Reverse engineering & Debugging

The great examples of reverse engineering at our blog:

Make sure to check them both as here we're not going to duplicate the same information.

πŸ“ŒInformation about Source and Application tabs is kinda introductory information as it's a big topics with a lot of steps to reproduce and it will be out of the scope of this blog post.

Sources tab

One of the approaches, when something complex needs to be extracted, could be done under the Source tab.

It could be done by debugging website JS source code from certain files with debugger breakpoints (Dev tools -> sources -> debugger) by trying to trace what is going on in the code and how can we intercept/create by ourselves data and use it the parser.

image

We've also done a SerpApi Podcast about scraping dynamic websites which is essentially the topic about using Source tab.

In this video Ilya told about his approach about investigating and reproducing Google Maps data URL parameter which defines data about certain Google Maps place. This parameter hold data about place id, latitude and longitude.

Application tab

A similar thing could be done in the Dev tools -> Application tab where we see, for example, cookies data and either intercept it on reverse engineer it by understanding how this cookie was built.

image

Ilya, one of the engineers at SerpApi has written in more detail about reverse engineering Location cookies from Walmart and his approach for such task.

Join us on Twitter | YouTube

Top comments (4)

Collapse
 
santiagojordi profile image
Jordi

Thank you very much sincerely, gpt would never be able to took yr job! lov u!

Collapse
 
dmitryzub profile image
Dmitriy Zub β˜€οΈ

πŸ™‚πŸ™‚πŸ™‚
Thank you, Jordi πŸ‘

Collapse
 
akash211 profile image
AKASH KUMAR

Thank you for so much details. I did not know about Parsel and never tried Playright. Looking at the code, it looks simple. Will try in my next project.

Collapse
 
dmitryzub profile image
Dmitriy Zub β˜€οΈ • Edited on

🧑

There're some of the methods I haven't wrote yet. Since this is an ongoing blog post, it will be recieving updates from time to time πŸ™‚

Yes, parsel is great! Full support of XPath is just yummy! Playwright is also great and straightforward πŸ™‚

Thank you so much!

Become a Moderator Can you help us make DEV a better place?

Fill out this survey and help us by becoming a tag moderator here at DEV.