DEV Community

Kyle Knapp
Kyle Knapp

Posted on

Mastering Web Automation: Building a Selenium Crawler Bot

Before we get started let's talk about robots, not the ones from the movies like WALL-E or Terminator I'm talking about the ones that live on the internet known as bots. Here is a quick reference guide before we get started:

bot vs robot

According to cybersecurity firm Imperva, an impressive 47.4 percent of all internet traffic in 2022 originated from bots. These digital entities assume a variety of roles, each pivotal to the seamless functioning of the online realm. From enhancing user experiences to streamlining tasks and fostering communication, bots have become indispensable. Search engine bots meticulously traverse web pages, ensuring content is efficiently indexed for delivering relevant search results. Social media bots, on the other hand, deftly automate posts, likes, and interactions, thereby optimizing content visibility. In real-time, chatbots engage users by providing immediate support or guiding them through websites. It's worth noting, however, that malicious bots lurk in the shadows, engaging in activities such as spamming, phishing, or launching cyber attacks. Beyond this, web scraping bots adeptly extract data from websites for in-depth analysis which we will discuss this type of bot in deeper detail with a project I worked on earlier in the month.

The Goal

Before I discuss my project I worked on lets first discuss what I was trying to accomplish. To keep things simple I'll explain the scenario - over christmas time there was a gift I was looking to buy for a family member the problem was that this item was only available online, limited quantity and would be released at a random time during the day. A recipe for disaster for the average person, luckily for me I had the tools on my side to have a shot as getting this high ticket item. That tool is called Selenium.

What's Selenium?

So what is selenium? Selenium is an open-source framework primarily used for automating web applications. It provides a suite of tools for web browser automation, allowing developers to interact with web pages programmatically. Selenium supports various programming languages, including Python, Java, C#, and more, making it versatile for different development environments. It enables the simulation of user actions such as clicking buttons, filling forms, and navigating through web pages, facilitating the testing of web applications. Selenium WebDriver, a key component, directly communicates with browsers, enabling seamless automation and testing across different browsers and platforms. Overall, Selenium is a powerful tool for automating repetitive tasks, conducting testing, and ensuring the robustness and reliability of web applications.

The Project

Now that we have the goal and tool understood lets talk about implementation. To kickstart this project I needed to choose a programming language along with a solid IDE to pair this with to utilize selenium. I ended up choosing Python as my language and VSCode as my dev environment. Now that the stage is set I began coding away. I kept the code relatively simple, it contains basically just two core functions:

Main()

To summarize the code in Main() it first reads the file named pings.txt to get a list of URLs that will be scraped. It then reads a pickled file named previous.pickle to get the previous HTML states of the tracked websites. After, we iterate over the URLs, retrieves the current HTML, compares it with the previous state, and logs any changes. Finally it then writes the change report to a file named updates.log.If changes are detected, and the script is running on a Windows system, it shows a toast notification using the win10toast library.

parse_pings()
Parse_pings does just two things, both very important though. It first parses ping_file with all information about the data it scraped, including URLs, actions and responses. This data will then be packed and returned into a list of tuples for evaluation.

The Code

import argparse
import datetime
import pickle
import os
import sys
from pathlib import Path
import platform
from collections import defaultdict


from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import WebDriverException
from tqdm import tqdm
from win10toast import ToastNotifier


CLICK_MACROS = {
     '#EBAY_KL': ['//*[@id="edit-birthmonth"]']
}

TARGET_MACROS = {
}



class Spider:
    def __init__(self, headless=True):
        chrome_options = Options()
        if headless:
            chrome_options.add_argument("--disable-gpu")
            chrome_options.add_argument("--headless")
            chrome_options.add_argument("--silent")
            chrome_options.add_argument("--log-level=3")

        try:
            driver = webdriver.Chrome(options=chrome_options)
        except WebDriverException:
            decision = input('add bundled chromedriver to path? y/[n]')
            if decision not in ['y', 'n'] or decision == 'n':
                sys.exit(1)
            else:
                dir_path = Path(os.path.dirname(os.path.realpath(__file__)))
                sys.path.insert(0, dir_path / 'chromedriver.exe')
                print('Warning: supplied chromedriver is intended for Windows and Chrome version 83.')

        self.driver = driver

    def get_current_html(self, url, clicks, target, html_or_content= 'content'):
        print(url)
        self.driver.get(url)

        # perform actions
        for click in clicks:
            element = WebDriverWait(self.driver, 10).until(
                EC.presence_of_element_located((By.XPATH, click))
            )
            element.click()

        # get html
        element = WebDriverWait(self.driver, 10).until(
            EC.presence_of_element_located((By.XPATH, target))
        )

        if html_or_content == 'html':
            # more accurate, but problematic due to js content etc.
            html = element.get_attribute('innerHTML')
        else:
            # more reliable:
            html = element.text

        return html

    def __del__(self):
        self.driver.close()


def parse_pings(ping_file):

    pings = []

    with open(ping_file)as inf:
        for line in inf.readlines():
            if line.strip() == '' or line.startswith('#'):
                continue
            url, actions, target, triggers = [x.strip() for x in line.split('|')]
            triggers = triggers.split(';')
            triggers = [x.strip() for x in triggers]

            if actions.startswith('#'):
                print('Reading click macros')
                actions = CLICK_MACROS[actions]
            elif actions == '':
                actions = []
            else:
                # split actions
                actions = actions.split(';')
            if target.startswith('#'):
                target = TARGET_MACROS[target]
            elif target == '':
                target = '//body'

            ping = (url, actions, target, triggers)
            pings.append(ping)

    return pings


def main():
    # parse args
    parser = argparse.ArgumentParser(description='Crawl a number of sites and compare them to a previous known state.')
    parser.add_argument('--show_driver', action='store_true', default=False, help='Disables headless mode for webdriver.')
    parser.add_argument('--no_notify', action='store_true', default=False, help='Disable toast notifications.')
    parser.add_argument('--html_or_content', choices=['html', 'content'], default='content', help='Whether to compare html or content.')
    args = parser.parse_args()

    # read pings
    pings = parse_pings('./pings.txt')

    previous_htmls = defaultdict(lambda: {
        'time': None,
        'html': ''
    })

    # read previous htmls
    if os.path.isfile('./previous.pickle'):
        with open('./previous.pickle', 'rb') as inf:
            previous_htmls.update(pickle.load(inf))

    # store new htmls separately in order to discard no longer tracked sites
    new_previous_htmls = defaultdict(lambda: {
        'time': None,
        'html': ''
    })

    spider = Spider(headless=(not args.show_driver))

    stuff_changed = False
    changes = []
    for url, clicks, target, triggers in tqdm(pings):
        try:
            current_content = spider.get_current_html(url, clicks, target, args.html_or_content)
        except Exception:
            tqdm.write(f'Retrieving html failed for: {url}')

        # do nothing if not of the given triggers are fired
        if len(triggers) > 0 and not any([x in current_content for x in triggers]):
            continue

        # store "new previous"
        new_previous_htmls[url]['html'] = current_content
        new_previous_htmls[url]['time'] = datetime.datetime.now()

        if previous_htmls[url]['html'] is not None \
            and current_content is not None \
            and current_content != previous_htmls[url]['html']:
            tqdm.write(f'Changed: {url}')

            stuff_changed = True

            # save differences
            changes.append({
                'url': url,
                'previous_time': previous_htmls[url]['time'],
                'previous_html': ' '.join(previous_htmls[url]['html'].split()),
                'current_time': datetime.datetime.now(),
                'current_html': ' '.join(current_content.split())
            })

    # write change report
    with open('./updates.log', 'a+', encoding='utf-8') as out:
        for change in changes:
            out.write(f'{change["current_time"]}, URL {change["url"]}\n')
            out.write(f'\tBefore:\t{change["previous_html"]}\n')
            out.write(f'\tAfter:\t {change["current_html"]}\n')
            out.write('----------------------------------------------\n')

    if not args.no_notify and stuff_changed and platform.system() == 'Windows':
        # create an object to ToastNotifier class
        n = ToastNotifier()
        n.show_toast("ProjectPing", f"{len(changes)} change(s) detected. Check log for details.", duration=5,
                     icon_path="./assets/spider.ico")

    # write current htmls as new previous
    with open('./previous.pickle', 'wb') as out:
        pickle.dump(dict(new_previous_htmls), out)


if __name__ == '__main__':
    # execute only if run as the entry point into the program
    main()


Enter fullscreen mode Exit fullscreen mode

The Results

The script efficiently parsed the file of URLs, retrieved previous HTML states, and dynamically compared changes, recording a detailed change report in the updates.log file. When the state page detected a change I was able to trigger a notification which allowed me to complete the transaction successfully. This project not only showcased the efficacy and reliability of Selenium in automating repetitive tasks but also highlighted its versatility beyond testing, serving as a valuable tool in the dynamic landscape of online commerce.

Top comments (0)