DEV Community πŸ‘©β€πŸ’»πŸ‘¨β€πŸ’»

Cover image for Web Scraping All Google Play App Reviews in Python
Dmitriy Zub β˜€οΈ
Dmitriy Zub β˜€οΈ

Posted on • Originally published at serpapi.com

Web Scraping All Google Play App Reviews in Python

What will be scraped

image

Prerequisites

Basic knowledge scraping with CSS selectors

CSS selectors declare which part of the markup a style applies to thus allowing to extract data from matching tags and attributes.

If you haven't scraped with CSS selectors, there's a dedicated blog post of mine about how to use CSS selectors when web-scraping that covers what it is, pros and cons, and why they're matter from a web-scraping perspective.

Separate virtual environment

In short, it's a thing that creates an independent set of installed libraries including different Python versions that can coexist with each other at the same system thus prevention libraries or Python version conflicts.

If you didn't work with a virtual environment before, have a look at the dedicated Python virtual environments tutorial using Virtualenv and Poetry blog post of mine to get familiar.

πŸ“ŒNote: this is not a strict requirement for this blog post.

Install libraries:

pip install playwright parsel
Enter fullscreen mode Exit fullscreen mode

You also need to install chromium for playwright to work and operate the browser:

playwright install chromium
Enter fullscreen mode Exit fullscreen mode

After that, if you're on Linux, you might need to install additional things (playwright will prompt you in the terminal in case something is missing):

sudo apt-get install -y libnss3 libnspr4 libatk1.0-0 libatk-bridge2.0-0 libcups2 libdrm2 libxkbcommon0 libxcomposite1 libxdamage1 libxfixes3 libxrandr2 libgbm1 libatspi2.0-0 libwayland-client0
Enter fullscreen mode Exit fullscreen mode

Reduce the chance of being blocked

There's a chance that a request might be blocked. Have a look at how to reduce the chance of being blocked while web-scraping, there are eleven methods to bypass blocks from most websites and some of them will be covered in this blog post.

Full Code

import time, json, re
from parsel import Selector
from playwright.sync_api import sync_playwright


def run(playwright):
    page = playwright.chromium.launch(headless=True).new_page()
    page.goto("https://play.google.com/store/apps/details?id=com.collectorz.javamobile.android.books&hl=en_GB&gl=US")

    user_comments = []

    # if "See all reviews" button present
    if page.query_selector('.Jwxk6d .u4ICaf button'):
        print("the button is present.")

        print("clicking on the button.")
        page.query_selector('.Jwxk6d .u4ICaf button').click(force=True)

        print("waiting a few sec to load comments.")
        time.sleep(4)

        last_height = page.evaluate('() => document.querySelector(".fysCi").scrollTop')  # 2200

        while True:
            print("scrolling..")
            page.keyboard.press("End")
            time.sleep(3)

            new_height = page.evaluate('() => document.querySelector(".fysCi").scrollTop')

            if new_height == last_height:
                break
            else:
                last_height = new_height

    selector = Selector(text=page.content())
    page.close()

    print("done scrolling. Exctracting comments...")
    for index, comment in enumerate(selector.css(".RHo1pe"), start=1):

        comment_likes = comment.css(".AJTPZc::text").get()   

        user_comments.append({
            "position": index,
            "user_name": comment.css(".X5PpBb::text").get(),
            "user_avatar": comment.css(".gSGphe img::attr(srcset)").get().replace(" 2x", ""),
            "user_comment": comment.css(".h3YV2d::text").get(),
            "comment_likes": comment_likes.split("people")[0].strip() if comment_likes else None,
            "app_rating": re.search(r"\d+", comment.css(".iXRFPc::attr(aria-label)").get()).group(),
            "comment_date": comment.css(".bp9Aid::text").get(),
            "developer_comment": {
                "dev_title": comment.css(".I6j64d::text").get(),
                "dev_comment": comment.css(".ras4vb div::text").get(),
                "dev_comment_date": comment.css(".I9Jtec::text").get()
            }
        })

    print(json.dumps(user_comments, indent=2, ensure_ascii=False))


with sync_playwright() as playwright:
    run(playwright)
Enter fullscreen mode Exit fullscreen mode

Code Explanation

Import libraries:

import time, json
from playwright.sync_api import sync_playwright
Enter fullscreen mode Exit fullscreen mode
  • time to set a sleep() intervals between each scroll.
  • json just for pretty printing.
  • sync_playwright for synchronous API. playwright have asynchronous API as well using asyncio module.

Declare a function:

def run(playwright):
    # further code..
Enter fullscreen mode Exit fullscreen mode

Initialize playwright, connect to chromium, launch() a browser new_page() and goto() a given URL:

page = playwright.chromium.launch(headless=False).new_page()
page.goto("https://play.google.com/store/apps/details?id=com.collectorz.javamobile.android.books&hl=en_GB&gl=US")

user_comments = [] # temporary list for all extracted data
Enter fullscreen mode Exit fullscreen mode

Next, we need to check if the button responsible for showing all reviews is present and click on it if present:

if page.query_selector('.Jwxk6d .u4ICaf button'):
    print("the button is present.")

    print("clicking on the button.")
    page.query_selector('.Jwxk6d .u4ICaf button').click(force=True)

    print("waiting a few sec to load comments.")
    time.sleep(4)
Enter fullscreen mode Exit fullscreen mode
  • query_selector is function that accepts CSS selectors to be searched.
  • click is to click on the button and force=True will bypass any auto-waits and click immidiately.

Scroll to the bottom of the comments window:

last_height = page.evaluate('() => document.querySelector(".fysCi").scrollTop')  # 2200

while True:
    print("scrolling..")
    page.keyboard.press("End")
    time.sleep(3)

    new_height = page.evaluate('() => document.querySelector(".fysCi").scrollTop')

    if new_height == last_height:
        break
    else:
        last_height = new_height
Enter fullscreen mode Exit fullscreen mode
  • page.evaluate() will run a JavaScript code in the browser context that will measurement of the height of the .fysCi selector. scrollTop gets the number of pixels scrolled from a given element, in this case CSS selector.
  • time.sleep(3) will stop code execution for 3 seconds to load more comments.
  • Then it will measure a new_height after the scroll running the same measurement JavaScript code.
  • Finally, it will check if new_height == last_height, and if so, exit the while loop by using break.
  • else set the last_height to new_height and run the iteration (scroll) again.

After that, pass scrolled HTML content to parsel, close the browser:

selector = Selector(text=page.content())
page.close()
Enter fullscreen mode Exit fullscreen mode

Iterate over all results after the while loop is done:

for index, comment in enumerate(selector.css(".RHo1pe"), start=1):

    comment_likes = comment.css(".AJTPZc::text").get()   

    user_comments.append({
        "position": index,
        "user_name": comment.css(".X5PpBb::text").get(),
        "user_avatar": comment.css(".gSGphe img::attr(srcset)").get().replace(" 2x", ""),
        "user_comment": comment.css(".h3YV2d::text").get(),
        "comment_likes": comment_likes.split("people")[0].strip() if comment_likes else None,
        "app_rating": re.search(r"\d+", comment.css(".iXRFPc::attr(aria-label)").get()).group(),
        "comment_date": comment.css(".bp9Aid::text").get(),
        "developer_comment": {
            "dev_title": comment.css(".I6j64d::text").get(),
            "dev_comment": comment.css(".ras4vb div::text").get(),
            "dev_comment_date": comment.css(".I9Jtec::text").get()
        }
    })
Enter fullscreen mode Exit fullscreen mode

Print the data:

print(json.dumps(user_comments, indent=2, ensure_ascii=False))
Enter fullscreen mode Exit fullscreen mode

Run your code using context manager:

with sync_playwright() as playwright:
    run(playwright)
Enter fullscreen mode Exit fullscreen mode

Output

[
  {
    "position": 1,
    "user_name": "JazzTripp",
    "user_avatar": "https://play-lh.googleusercontent.com/a-/ACNPEu8THUUDL3yzcd0bHSDRR4OegOWLmfbFi70On0HbRg",
    "user_comment": "This app takes a bit if getting used to at first, but the catalogue is extensive, and most bar codes and isbn numbers can be used to autofill a good chuck of a collection. I personally use this app for manga, and while its only correct about 70% of the time, its still easy to update and change as you see fit. The 'add to core' option makes me feel like im actually helping out the app, so i add data whenever i can. Keep up the good work guys!",
    "comment_likes": "20",
    "app_rating": "5",
    "comment_date": "May 06, 2022",
    "developer_comment": null
  }, ... other results
  {
    "position": 875,
    "user_name": "Originalbigguy",
    "user_avatar": "https://play-lh.googleusercontent.com/a/ALm5wu3dYTOHvlG8SUqgyTbRnjv9I49JtxgySY-RwTJU=s64-rw-mo",
    "user_comment": "Not free",
    "comment_likes": null,
    "app_rating": "1",
    "comment_date": "9 April 2021",
    "developer_comment": {
      "dev_title": "Collectorz.com",
      "dev_comment": "The app is never advertised as free anywhere. The app information clearly states this is a paid subscription app.\n",
      "dev_comment_date": "10 April 2021"
    }
  }
]
Enter fullscreen mode Exit fullscreen mode

Using Google Play Product Reviews API

As we support extracting reviews data from Google Play App, this section is to show the comparison between the DIY solution and our solution.

The biggest difference is that you don't need to use browser automation to scrape results, create the parser from scratch and maintain it.

Keep in mind that there's also a chance that the request might be blocked at some point from Google (or CAPTCHA), we handle it on our backend.

Installing google-search-results from PyPi:

pip install google-search-results
Enter fullscreen mode Exit fullscreen mode
from serpapi import GoogleSearch
from urllib.parse import (parse_qsl, urlsplit)

params = {
  "api_key": "...",                                        # your serpapi api key
  "engine": "google_play_product",                         # serpapi parsing engine
  "store": "apps",                                         # app results
  "gl": "us",                                              # country of the search
  "hl": "en",                                              # language of the search
  "product_id": "com.collectorz.javamobile.android.books"  # app id
}

search = GoogleSearch(params)                              # where data extraction happens on the backend

reviews = []

while True:
    results = search.get_dict()                            # JSON -> Python dict

    for review in results["reviews"]:
        reviews.append({
            "title": review.get("title"),
            "avatar": review.get("avatar"),
            "rating": review.get("rating"),
            "likes": review.get("likes"),
            "date": review.get("date"),
            "snippet": review.get("snippet"),
            "response": review.get("response")
        })

    # pagination
    if "next" in results.get("serpapi_pagination", {}):
        search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination", {}).get("next")).query)))
    else:
        break

print(json.dumps(reviews, indent=2, ensure_ascii=False))
Enter fullscreen mode Exit fullscreen mode

Output:

[
  {
    "title": "JazzTripp",
    "avatar": "https://play-lh.googleusercontent.com/a-/ACNPEu8THUUDL3yzcd0bHSDRR4OegOWLmfbFi70On0HbRg",
    "rating": 5.0,
    "likes": 20,
    "date": "May 06, 2022",
    "snippet": "This app takes a bit if getting used to at first, but the catalogue is extensive, and most bar codes and isbn numbers can be used to autofill a good chuck of a collection. I personally use this app for manga, and while its only correct about 70% of the time, its still easy to update and change as you see fit. The 'add to core' option makes me feel like im actually helping out the app, so i add data whenever i can. Keep up the good work guys!",
    "response": null
  }, ... other reviews
  {
    "title": "Originalbigguy",
    "avatar": "https://play-lh.googleusercontent.com/a/ALm5wu3dYTOHvlG8SUqgyTbRnjv9I49JtxgySY-RwTJU=mo",
    "rating": 1.0,
    "likes": 0,
    "date": "April 09, 2021",
    "snippet": "Not free",
    "response": {
      "title": "Collectorz.com",
      "snippet": "The app is never advertised as free anywhere. The app information clearly states this is a paid subscription app.",
      "date": "April 10, 2021"
    }
  }
]
Enter fullscreen mode Exit fullscreen mode

Join us on Reddit | Twitter | YouTube

Top comments (0)

Let's team up together 🀝

We're Hiring

We're hiring for a Senior Full Stack Engineer to join the DEV team. Want the deets? Head here to learn more about who we're looking for.