DEV Community

Dmitriy Zub ☀️
Dmitriy Zub ☀️

Posted on • Originally published at dimitryzub.Medium

Pagination Techniques to Scrape Data from any Website in Python

Intro

In this blog post will go over most frequent pagination techniques that could be applied to perform dynamic pagination on any website. This blog post is ongoing and will be updated if new techniques will be discovered.

Dynamic vs Hardcoded Pagination

What the heck is dynamic pagination?

Well, it's simply a way to paginate through all available pages without you knowing how many there are, it will just go through them all:

while True:
    requests.get('<website_url>')
    # data extraction code

    # condition to paginate to the next page or to exit pagination
Enter fullscreen mode Exit fullscreen mode

Hardcoded approach differs by explicitly writing the N number of pages we want to paginate over:

# hardcoded way to paginate from 1 to 25th page
for page_num in range(1, 26):
    requests.get('<website_url>')
    # data extraction code
Enter fullscreen mode Exit fullscreen mode

It's an easy approach if we need to extract data from N number of pages. But what if we need to extract all pages from several, say categories on the same website, and if each category contains different number of pages?

The thing is, when using hardcoded approach, we'll come to the point where we need to update page numbers to meet requirements for every page which is not particularly satisfying :-)

Dynamic pagination exit condition

Let's also stop for a second and see what another difference is dynamic while True pagination and for page_num in range(...) approach.

Have you noticed comment in the dynamic pagination: "condition to paginate to the next page or to exit pagination"?

This means that whenever we use a dynamic pagination we always need some condition to exit the infinite loop. It could be: element disappeared, previous page number is different than current, height of the elements are the same, etc.

Prerequisites

If you want to try along, let's create a separate environment first.

If you're on Linux:

python -m venv env && source env/bin/activate
Enter fullscreen mode Exit fullscreen mode

If you're on Windows and using Git Bash:

python -m venv env && source env/Scripts/activate
Enter fullscreen mode Exit fullscreen mode

Next, install needed libraries if you want to try yourself:

$ pip install requests bs4 parsel playwright
Enter fullscreen mode Exit fullscreen mode
  • requests: make a request to a website.
  • bs4: HTML parser.
  • parsel: Another HTML parser, faster than bs4, used in a few examples.
  • playwright: modern browser automation.

For playwright, if you're on Linux, you also need to install additional dependencies:

$ sudo apt-get install -y libnss3 libnspr4 libatk1.0-0 libatk-bridge2.0-0 libcups2 libdrm2 libxkbcommon0 libxcomposite1 libxdamage1 libxfixes3 libxrandr2 libgbm1 libatspi2.0-0 libwayland-client0
Enter fullscreen mode Exit fullscreen mode

After that we need to install chromium (or other browsers):

$ playwright install chromium
Enter fullscreen mode Exit fullscreen mode

Types of pagination

There're four most frequent types of pagination:

  1. token pagination using unique token.
    • For example: SAOxcijdaf#Ad21
  2. non-token pagination using digits.
    • For example: 1,2,3,4,5.
  3. click pagination.
    • For example: clicking on the next page button until button disappears.
  4. scroll or JavaScript evaluation pagination.
    • For example: scrolling page until no more reviews left. Same could be done (evaluate) with JS code.

📌Those types of pagination could be combined with one or another.

For example, non-token and token pagination values need to be updated at the same time to paginate to the next page. This is how Google Scholar Profile pagination works without using browser automation.

Another example of combined pagination is combining scrolls with clicks as they become needed (when a certain button appears).

Token Pagination

Token pagination is when website generates a token responsible for retrieving next page data. It could look something like this: 27wzAGn-__8J.

This token most likely will be passed as URL parameter, for example on Google Scholar Profiles page it looks like this:

#                                                     ▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼
https://scholar.google.com/citations?mauthors=biology&after_author=27wzAGn-__8J
Enter fullscreen mode Exit fullscreen mode

In some cases this token need to be combined with other parameters. For example, Google Scholar Profile page has:

#                                                     ▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼
https://scholar.google.com/citations?mauthors=biology&after_author=27wzAGn-__8J&astart=10
Enter fullscreen mode Exit fullscreen mode

Dynamic Pagination with Token based Websites

Dynamic pagination on token based websites happens through parsing next page token which can be located in the:

Here's a code snippet from my StackOverflow answer of paginating through all Google Scholar Profiles in Python:

from bs4 import BeautifulSoup
import requests, lxml, re

params = {
    "view_op": "search_authors", # profiles tab
    "mauthors": "blizzard",      # search query
    "hl": "en",                  # language of the search
    "astart": 0                  # page number
}

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
}

authors_is_present = True
while authors_is_present:

    html = requests.get("https://scholar.google.com/citations", params=params, headers=headers, timeout=30)
    soup = BeautifulSoup(html.text, "lxml")

    for author in soup.select(".gs_ai_chpr"):
        name = author.select_one(".gs_ai_name a").text
        link = f'https://scholar.google.com{author.select_one(".gs_ai_name a")["href"]}'
        affiliations = author.select_one(".gs_ai_aff").text
        email = author.select_one(".gs_ai_eml").text
        try:
            cited_by = re.search(r"\d+", author.select_one(".gs_ai_cby").text).group() # Cited by 17143 -> 17143
        except: cited_by = None

        print(f"extracting authors at page #{params['astart']}.",
                name,
                link,
                affiliations,
                email,
                cited_by, sep="\n")

    # if next page token exists, we extract next page token form HTML node attribute
    # and increment `astart` parameter +10
    if soup.select_one("button.gs_btnPR")["onclick"]:
        params["after_author"] = re.search(r"after_author\\x3d(.*)\\x26", str(soup.select_one("button.gs_btnPR")["onclick"])).group(1)  # -> XB0HAMS9__8J
        params["astart"] += 10
    else:
        authors_is_present = False
Enter fullscreen mode Exit fullscreen mode

Non Token Pagination

Non-token pagination is simply when you increment page number by an N number. It could be incremented by 1, 10 (Google Search), 11 (Bing Search), 100 (Google Scholar Author Articles), or other number depending on how pagination on certain website functionates.

As mentioned above, non-token pagination could be combined with token in order to perform pagination.

Dynamic Pagination with Non Token based Websites

To identify if website uses non-token based pagination is simple. Keep an eye on URL parameters, see if there're any digits associated with URL parameters, and see if they are changing.

For example, Google Search has a start parameter:

# first page (no start parameter, or could be manually set to 0, first page)
https://www.google.com/search?q=sushi

# second page                         ▼▼▼▼▼▼▼▼
https://www.google.com/search?q=sushi&start=10

# third page                          ▼▼▼▼▼▼▼▼
https://www.google.com/search?q=sushi&start=20
Enter fullscreen mode Exit fullscreen mode

A code example of non-token pagination is Google Search results. The following code scrapes all results from all pages.

from bs4 import BeautifulSoup
import requests, lxml

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "q": "sushi",       # search query
    "hl": "en",         # language
    "gl": "us",         # country of the search, US -> USA
    "start": 0,         # number page by default up to 0
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
}

page_num = 0

while True:
    page_num += 1
    print(f"{page_num} page:")

    html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
    soup = BeautifulSoup(html.text, 'lxml')

    for result in soup.select(".tF2Cxc"):
        title = f'Title: {result.select_one("h3").text}'
        link = f'Link: {result.select_one("a")["href"]}'
        try:
            description = f'Description: {result.select_one(".VwiC3b").text}'
        except: 
            description = None

        print(title, link, description, sep="\n", end="\n\n")

    # if arrow button with attribute 'pnnext' is present -> paginate
    if soup.select_one('.d6cvqb a[id=pnnext]'):
        params["start"] += 10
    else:
        break
Enter fullscreen mode Exit fullscreen mode

Click Pagination

As you understand, it performs a click and can only be used with browser automation such as playwright or selenium because these libraries provide method to click on a given element.

Dynamic Pagination with Clicks

All we need to do is to find the button or whatever element responsible for next page button via CSS selector or XPath.

After that we need to perform a click() method:

# page = playwright.chromium.launch(headless=True).new_page()
# page.goto('<website_url>')
page.query_selector('.Jwxk6d .u4ICaf button').click(force=True)
Enter fullscreen mode Exit fullscreen mode

Scroll or JavaScript Evaluation

The scroll pagination technique requires scrolls in order to perform pagination. Scrolls could be either top-bottom or side to side depending on how a website works.

Dynamic Pagination with Scrolls

There're three frequent methods that I use to perform scrolls with either playwright or selenium:

  1. page.keyboard.press(<key>)
  2. page.evaluate(<JS_code>)
  3. page.mouse.wheel(<scrollX>, <scrollY>)

Pressing a keyboard button to perform a scroll down:

# page = playwright.chromium.launch(headless=True).new_page()
# page.goto('<website_url>')
page.keyboard.press('END') # scrolls to possible end of the page
Enter fullscreen mode Exit fullscreen mode

Evaluating JavaScript code to perform a scroll down:

# page = playwright.chromium.launch(headless=True).new_page()
# page.goto('<website_url>')
page.evaluate("""let scrollingElement = (document.scrollingElement || document.body);
              scrollingElement.scrollTop = scrollingElement scrollHeight;""")
Enter fullscreen mode Exit fullscreen mode

📌We have to keep in mind that whenever we use a scroll pagination, we always need to perform a condition check that checks height of a certain element before and after scroll.

If height before and after scroll is the same, this will be a signal that there's more space for additional scrolls:

last_height = page.evaluate('() => document.querySelector(".fysCi").scrollTop')  # 2200

    while True:
        print("scrolling..")
        page.keyboard.press("End")
        time.sleep(3)

        new_height = page.evaluate('() => document.querySelector(".fysCi").scrollTop') # 2800

        if new_height == last_height:
            break
        else:
            last_height = new_height
Enter fullscreen mode Exit fullscreen mode

Here's a complete example from one of my blog posts with step-by-step explanation about scraping all Google Play App Reviews in Python that uses click, keyboard press, and evaluate to check for current height:

import time, json, re
from parsel import Selector
from playwright.sync_api import sync_playwright


def run(playwright):
    page = playwright.chromium.launch(headless=True).new_page()
    page.goto("https://play.google.com/store/apps/details?id=com.collectorz.javamobile.android.books&hl=en_GB&gl=US")

    user_comments = []

    # if "See all reviews" button present
    if page.query_selector('.Jwxk6d .u4ICaf button'):
        print("the button is present.")

        print("clicking on the button.")
        page.query_selector('.Jwxk6d .u4ICaf button').click(force=True)

        print("waiting a few sec to load comments.")
        time.sleep(4)

        last_height = page.evaluate('() => document.querySelector(".fysCi").scrollTop')  # 2200

        while True:
            print("scrolling..")
            page.keyboard.press("End")
            time.sleep(3)

            new_height = page.evaluate('() => document.querySelector(".fysCi").scrollTop')

            if new_height == last_height:
                break
            else:
                last_height = new_height

    selector = Selector(text=page.content())
    page.close()

    print("done scrolling. Exctracting comments...")
    for index, comment in enumerate(selector.css(".RHo1pe"), start=1):

        comment_likes = comment.css(".AJTPZc::text").get()   

        user_comments.append({
            "position": index,
            "user_name": comment.css(".X5PpBb::text").get(),
            "user_avatar": comment.css(".gSGphe img::attr(srcset)").get().replace(" 2x", ""),
            "user_comment": comment.css(".h3YV2d::text").get(),
            "comment_likes": comment_likes.split("people")[0].strip() if comment_likes else None,
            "app_rating": re.search(r"\d+", comment.css(".iXRFPc::attr(aria-label)").get()).group(),
            "comment_date": comment.css(".bp9Aid::text").get(),
            "developer_comment": {
                "dev_title": comment.css(".I6j64d::text").get(),
                "dev_comment": comment.css(".ras4vb div::text").get(),
                "dev_comment_date": comment.css(".I9Jtec::text").get()
            }
        })

    print(json.dumps(user_comments, indent=2, ensure_ascii=False))


with sync_playwright() as playwright:
    run(playwright)
Enter fullscreen mode Exit fullscreen mode

Example from the another blog post of mine that shows how to scrape all Naver Video results:

from playwright.sync_api import sync_playwright
import json

with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)
    page = browser.new_page()
    page.goto("https://search.naver.com/search.naver?where=video&query=minecraft")

    video_results = []

    not_reached_end = True
    while not_reached_end:
        page.evaluate("""let scrollingElement = (document.scrollingElement || document.body);
                                 scrollingElement.scrollTop = scrollingElement scrollHeight;""")

        if page.locator("#video_max_display").is_visible():
            not_reached_end = False

    for index, video in enumerate(page.query_selector_all(".video_bx"), start=1):
        title = video.query_selector(".text").inner_text()
        link = video.query_selector(".info_title").get_attribute("href")
        thumbnail = video.query_selector(".thumb_area img").get_attribute("src")
        channel = None if video.query_selector(".channel") is None else video.query_selector(".channel").inner_text()
        origin = video.query_selector(".origin").inner_text()
        video_duration = video.query_selector(".time").inner_text()
        views = video.query_selector(".desc_group .desc:nth-child(1)").inner_text()
        date_published = None if video.query_selector(".desc_group .desc:nth-child(2)") is None else \
            video.query_selector(".desc_group .desc:nth-child(2)").inner_text()

        video_results.append({
            "position": index,
            "title": title,
            "link": link,
            "thumbnail": thumbnail,
            "channel": channel,
            "origin": origin,
            "video_duration": video_duration,
            "views": views,
            "date_published": date_published
        })

    print(json.dumps(video_results, indent=2, ensure_ascii=False))

    browser.close()
Enter fullscreen mode Exit fullscreen mode

In the Naver pagination example, have a look at the if a condition that exits an infinite loop:

if page.locator("#video_max_display").is_visible():
    not_reached_end = False
Enter fullscreen mode Exit fullscreen mode

Conclusion

  1. Keep an eye on URL parameters. If something is changing when pagination if performed, then it could be a sign that those parameters can be used to perform pagination programmatically.
  2. Try to find the next page tokens in the page source.
  3. If nothing can be found from the points above, use either click or scroll pagination, or both.

Hope you found it useful. Let me know if something is still confusing.

Top comments (0)