DEV Community πŸ‘©β€πŸ’»πŸ‘¨β€πŸ’»

DEV Community πŸ‘©β€πŸ’»πŸ‘¨β€πŸ’» is a community of 966,904 amazing developers

We're a place where coders share, stay up-to-date and grow their careers.

Create account Log in
Cover image for Scrape Google Daily Search Trends with Python
Artur Chukhrai
Artur Chukhrai

Posted on

Scrape Google Daily Search Trends with Python

What will be scraped

blog-google-trends-daily-what-will-be-scraped

πŸ“ŒNote: For now, we don't have an API that supports extracting data from Google Realtime Search Trends.

This blog post is to show you way how you can do it yourself while we're working on releasing our proper API in a meantime. We'll update you on our Twitter once this API will be released.

Full Code

If you don't need explanation, have a look at full code example in the online IDE.

import time, json
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from parsel import Selector


def scroll_page(url):
    service = Service(ChromeDriverManager().install())

    options = webdriver.ChromeOptions()
    options.headless = True
    options.add_argument("--lang=en")
    options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36")
    options.add_argument("--no-sandbox")
    options.add_argument("--disable-dev-shm-usage")

    driver = webdriver.Chrome(service=service, options=options)
    driver.get(url)

    WebDriverWait(driver, 10000).until(EC.visibility_of_element_located((By.TAG_NAME, 'body')))

    flag = True

    while flag:
        try:
            search_input = driver.find_element(By.CSS_SELECTOR, 'div[class*="feed-load-more-button"]')
            driver.execute_script("arguments[0].click();", search_input)
            time.sleep(2)
        except:
            flag = False

    selector = Selector(driver.page_source)
    driver.quit()

    return selector


def scrape_daily_search(selector):  
    daily_search_trends = {}

    for date in selector.css('.feed-list-wrapper'):
        date_published = date.css('.content-header-title::text').get()
        daily_search_trends[date_published] = []

        for item in date.css('.feed-item-header'):
            index = item.css('.index::text').get().strip()
            title = item.css('.title span a::text').get().strip()
            title_link = f"https://trends.google.com{item.css('.title span a::attr(href)').get()}"
            subtitle = item.css('.summary-text a::text').get()
            subtitle_link = item.css('.summary-text a::attr(href)').get()
            source = item.css('.source-and-time span::text').get().strip()
            time_published = item.css('.source-and-time span+ span::text').get().strip()
            searches = item.css('.subtitles-overlap div::text').get().strip()
            image_source = item.css('.image-text::text').get()
            image_source_link = item.css('.image-link-wrapper a::attr(href)').get()
            thumbnail = item.css('.feed-item-image-wrapper img::attr(src)').get()

            daily_search_trends[date_published].append({
                'index': index,
                'title': title,
                'title_link': title_link,
                'subtitle': subtitle,
                'subtitle_link': subtitle_link,
                'source': source,
                'time_published': time_published,
                'searches': searches,
                'image_source': image_source,
                'image_source_link': image_source_link,
                'thumbnail': thumbnail,
            })

    print(json.dumps(daily_search_trends, indent=2, ensure_ascii=False))


def main():
    GEO = "US"
    URL = f"https://trends.google.com/trends/trendingsearches/daily?geo={GEO}"
    result = scroll_page(URL)
    scrape_daily_search(result)


if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

Preparation

Install libraries:

pip install parsel selenium webdriver webdriver_manager
Enter fullscreen mode Exit fullscreen mode

Basic knowledge scraping with CSS selectors

CSS selectors declare which part of the markup a style applies to thus allowing to extract data from matching tags and attributes.

If you haven't scraped with CSS selectors, there's a dedicated blog post of mine
about how to use CSS selectors when web-scraping that covers what it is, pros and cons, and why they matter from a web-scraping perspective.

Reduce the chance of being blocked

Make sure you're using request headers user-agent to act as a "real" user visit. Because default requests user-agent is python-requests and websites understand that it's most likely a script that sends a request. Check what's your user-agent.

There's a how to reduce the chance of being blocked while web scraping blog post that can get you familiar with basic and more advanced approaches.

Code Explanation

Import libraries:

import time, json
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from parsel import Selector
Enter fullscreen mode Exit fullscreen mode
Library Purpose
time to work with time in Python.
json to convert extracted data to a JSON object.
webdriver to drive a browser natively, as a user would, either locally or on a remote machine using the Selenium server.
Service to manage the starting and stopping of the ChromeDriver.
By to set of supported locator strategies (By.ID, By.TAG_NAME, By.XPATH etc).
WebDriverWait to wait only as long as required.
expected_conditions contains a set of predefined conditions to use with WebDriverWait.
Selector XML/HTML parser that have full XPath and CSS selectors support.

Top-level code environment

This code uses the generally accepted rule of using the __name__ == "__main__" construct:

def main():
    GEO = "US"
    URL = f"https://trends.google.com/trends/trendingsearches/daily?geo={GEO}"
    result = scroll_page(URL)
    scrape_daily_search(result)


if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

This check will only be performed if the user has run this file. If the user imports this file into another, then the check will not work.

You can watch the video Python Tutorial: if name == 'main' for more details.

A small description of the main function:

daily search

Scroll page

The function takes the URL and returns a full HTML structure.

First, let's understand how pagination works on the daily search trends page. To download more information, you must click on the LOAD MORE button below:

load more

πŸ“ŒNote: To get all the data, you need to press the button until the data runs out.

In this case, selenium library is used, which allows you to simulate user actions in the browser. For selenium to work, you need to use ChromeDriver, which can be downloaded manually or using code. In our case, the second method is used. To control the start and stop of ChromeDriver, you need to use Service which will install browser binaries under the hood:

service = Service(ChromeDriverManager().install())
Enter fullscreen mode Exit fullscreen mode

You should also add options to work correctly:

options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_argument('--lang=en')
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36")
Enter fullscreen mode Exit fullscreen mode
Chrome options Explanation
--headless to run Chrome in headless mode.
--lang=en to set the browser language to English.
user-agent to act as a "real" user request from the browser by passing it to request headers. Check what's your user-agent.

Now we can start webdriver and pass the url to the get() method.

driver = webdriver.Chrome(service=service, options=options)
driver.get(url)
Enter fullscreen mode Exit fullscreen mode

Sometimes it is difficult to calculate how long it will take to load a page, it all depends on the speed of the Internet, the power of the computer and other factors. The method described below is much better than using a delay in seconds since the wait occurs exactly until the moment when the page is fully loaded:

WebDriverWait(driver, 10000).until(EC.visibility_of_element_located((By.TAG_NAME, 'body')))
Enter fullscreen mode Exit fullscreen mode

πŸ“ŒNote: In this case, we give 10 seconds for the page to load, if it loads earlier then the wait will end.

When the page has loaded, it is necessary to find the LOAD MORE button. Selenium provides the ability to find element by CSS Selectors.

Clicking the button is done by pasting the JavaScript code into the execute_script() method. Wait a while for the data to load using the sleep() method. These actions are repeated as long as the button exists and allows you to download data.

flag = True

while flag:
    try:
        search_input = driver.find_element(By.CSS_SELECTOR, 'div[class*="feed-load-more-button"]')
        driver.execute_script("arguments[0].click();", search_input)
        time.sleep(2)
    except:
        flag = False
Enter fullscreen mode Exit fullscreen mode

Now we will use the Selector from the Parsel Library where we pass the html structure with all the data, taking into account pagination.

The parsel has much faster scraping times because of the engine itself and there is no network component anymore, no real-time interaction with a page and the element, there is only HTML parsing involved.

After all the operations done, stop the driver:

selector = Selector(driver.page_source)
# extracting code from HTML
driver.quit()
Enter fullscreen mode Exit fullscreen mode

The function looks like this:

def scroll_page(url):
    service = Service(executable_path="chromedriver")

    options = webdriver.ChromeOptions()
    options.headless = True
    options.add_argument("--lang=en")
    options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36")
    options.add_argument("--no-sandbox")
    options.add_argument("--disable-dev-shm-usage")

    driver = webdriver.Chrome(service=service, options=options)
    driver.get(url)

    WebDriverWait(driver, 10000).until(EC.visibility_of_element_located((By.TAG_NAME, 'body')))

    flag = True

    while flag:
        try:
            search_input = driver.find_element(By.CSS_SELECTOR, 'div[class*="feed-load-more-button"]')
            driver.execute_script("arguments[0].click();", search_input)
            time.sleep(2)
        except:
            flag = False

    selector = Selector(driver.page_source)
    driver.quit()

    return selector
Enter fullscreen mode Exit fullscreen mode

In the gif below, I demonstrate how this function works:

blog-button-clicking-daily

This function takes a full HTML structure and prints all results in json format.

The data is extracted in such a way that it corresponds to the date publication.

First of all, you need to find a container that contains data for a specific date publication. We iterate over each date in the loop using .feed-list-wrapper the container selector. Each publication date has its own number of items, which also be iterated in a loop using .feed-item-header the item selector.

The complete function to scrape all data would look like this:

def scrape_daily_search(selector):  
    daily_search_trends = {}

    for date in selector.css('.feed-list-wrapper'):
        date_published = date.css('.content-header-title::text').get()
        daily_search_trends[date_published] = []

        for item in date.css('.feed-item-header'):
            index = item.css('.index::text').get().strip()
            title = item.css('.title span a::text').get().strip()
            title_link = f"https://trends.google.com{item.css('.title span a::attr(href)').get()}"
            subtitle = item.css('.summary-text a::text').get()
            subtitle_link = item.css('.summary-text a::attr(href)').get()
            source = item.css('.source-and-time span::text').get().strip()
            time_published = item.css('.source-and-time span+ span::text').get().strip()
            searches = item.css('.subtitles-overlap div::text').get().strip()
            image_source = item.css('.image-text::text').get()
            image_source_link = item.css('.image-link-wrapper a::attr(href)').get()
            thumbnail = item.css('.feed-item-image-wrapper img::attr(src)').get()

            daily_search_trends[date_published].append({
                'index': index,
                'title': title,
                'title_link': title_link,
                'subtitle': subtitle,
                'subtitle_link': subtitle_link,
                'source': source,
                'time_published': time_published,
                'searches': searches,
                'image_source': image_source,
                'image_source_link': image_source_link,
                'thumbnail': thumbnail,
            })

    print(json.dumps(daily_search_trends, indent=2, ensure_ascii=False))
Enter fullscreen mode Exit fullscreen mode
Code Explanation
realtime_search_trends a temporary list where extracted data will be appended at the end of the function.
css() to access elements by the passed selector.
::text or ::attr(<attribute>) to extract textual or attribute data from the node.
get() to actually extract the textual data.
strip() to return a copy of the string with the leading and trailing characters removed.
"".join() to concatenate a list into a string.
realtime_search_trends.append({}) to append extracted data to a list as a dictionary.

Output

{
  "Wednesday, August 24, 2022": [
    {
      "index": "1",
      "title": "Len Dawson",
      "title_link": "https://trends.google.com/trends/explore?q=Len+Dawson&date=now+7-d&geo=US",
      "subtitle": "Len Dawson, Kansas City Chiefs quarterback and broadcasting ...",
      "subtitle_link": "https://www.npr.org/2022/08/24/1117595982/hall-of-fame-kansas-city-chiefs-quarterback-len-dawson-died-kmbc-kc-mvp",
      "source": "NPR",
      "time_published": "6h ago",
      "searches": "100K+ searches",
      "image_source": "NPR",
      "image_source_link": "https://www.npr.org/2022/08/24/1117595982/hall-of-fame-kansas-city-chiefs-quarterback-len-dawson-died-kmbc-kc-mvp",
      "thumbnail": "https://t1.gstatic.com/images?q=tbn:ANd9GcThi6H-kzkAMrCuhaFZN06AKg24SZAeRr8Wy_tWw_oxJSSO3aqkaR9O3OAJFEmSqPR-lgz-sS__"
    },
    ... other results
    {
      "index": "9",
      "title": "DualSense Edge",
      "title_link": "https://trends.google.com/trends/explore?q=DualSense+Edge&date=now+7-d&geo=US",
      "subtitle": "Everything we know about Sony's modular DualSense Edge ...",
      "subtitle_link": "https://www.inverse.com/gaming/ps5-dualsense-edge-release-date-price-features",
      "source": "Inverse",
      "time_published": "2h ago",
      "searches": "20K+ searches",
      "image_source": "Inverse",
      "image_source_link": "https://www.inverse.com/gaming/ps5-dualsense-edge-release-date-price-features",
      "thumbnail": "https://t0.gstatic.com/images?q=tbn:ANd9GcQ_G2RoFCa44x7fGSpUVCsu5yDzWUq1OdCYIg9qnr5H1oVVWzp6byVRAd4D553fee90tukWKxBl"
    }
  ],
  ... other dates
  "Tuesday, July 26, 2022": [
    {
      "index": "1",
      "title": "Mega Millions",
      "title_link": "https://trends.google.com/trends/explore?q=Mega+Millions&date=today+1-m&geo=US",
      "subtitle": "Mega Millions Friday Jackpot Over a BILLION, What Are The Odds?",
      "subtitle_link": "https://www.focusdailynews.com/mega-millions-crosses-a-billion-what-are-the-odds/",
      "source": "Focusdailynews",
      "time_published": "4w ago",
      "searches": "5M+ searches",
      "image_source": "Focusdailynews",
      "image_source_link": "https://www.focusdailynews.com/mega-millions-crosses-a-billion-what-are-the-odds/",
      "thumbnail": "https://t3.gstatic.com/images?q=tbn:ANd9GcQsSZUkbfXa7ODuRdREyqlZzXU8-_uOnrRybGgPLsvimItLolXz2QCtVIqe6Z1VMITZ_aJJsCAg"
    },
    ... other results
    {
      "index": "16",
      "title": "Alex Jones",
      "title_link": "https://trends.google.com/trends/explore?q=Alex+Jones&date=today+1-m&geo=US",
      "subtitle": "Second day of Alex Jones trial ends in turmoil as lawyers square off",
      "subtitle_link": "https://www.statesman.com/story/news/local/2022/07/27/alex-jones-trial-sandy-hook-defamation-day-2-recap/65385054007/",
      "source": "Austin American-Statesman",
      "time_published": "3w ago",
      "searches": "20K+ searches",
      "image_source": "Austin American-Statesman",
      "image_source_link": "https://www.statesman.com/story/news/local/2022/07/27/alex-jones-trial-sandy-hook-defamation-day-2-recap/65385054007/",
      "thumbnail": "https://t2.gstatic.com/images?q=tbn:ANd9GcT-vdMBiHFcQiLklLHCMZujakKrhEPH7eLmnzlN70b1cn7rJk8MIfpw8UFG2xkEddC7_qs8Wj60"
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Join us on Twitter | YouTube

Add a Feature RequestπŸ’« or a Bug🐞

Top comments (0)

🌚 Browsing with dark mode makes you a better developer.

It's a scientific fact.