David Newberry

Posted on Aug 22 • Edited on Aug 26

Polling for profit

#python #polling #http

I recently started job where work appears on an online dashboard. I just want to know when a certain kind of job (i.e. one that pays a certain amount) becomes available.

This isn’t the kind of thing there’s an API for, so I decided to just make my script mimic the behavior of the browser. To send HTTP requests from a Python script, the requests module came recommended on StackOverflow, so that’s the way I went.

The developer tools of Firefox (I assume the other major browsers as well) allows you to inspect HTTP requests and responses, including their headers, in the Network pane.

It offers a few different choices, slightly different but all similar and not exactly what I needed. So I started with the cURL code and started constructing the requests code.

(Eventually I took out the “If-None-Match” header that cURL included, not because of any issues but just to simplify the code. A couple others I might have accidentally removed; it still works fine.)

In retrospect I probably went further than I need to, converting the Cookies header string into a Python dictionary, when it’s only going to be converted back into string form to be sent through the network.

But I did write a neat little function to turn the string into a dictionary automatically, and learned about Python’s partition function along the way. Good times.

import requests

cookies = "(cookie data copied from Firefox)"

### you could remove this ####
cookies = cookies.split("; ")

def l(c):
    p = c.partition("=")
    k = p[0]
    v = p[2]
    return (k, v)

cookies = dict(map(l, cookies))
### ^ this could be removed ^ ###

r = requests.get('https://somejob.site/projects',
                 headers={
                    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:128.0) Gecko/20100101 Firefox/128.0",
                    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/png,image/svg+xml,*/*;q=0.8",
                    "Accept-Language": "en-US,en;q=0.5",
                    "Upgrade-Insecure-Requests": "1",
                    "Sec-Fetch-Dest": "document",
                    "Sec-Fetch-Mode": "navigate",
                    "Sec-Fetch-Site": "same-origin",
                    "Priority": "u=0, i"
                 },
                 cookies=cookies)

(These headers are probably unnecessary, but I just wanted to pretend to be the browser.)

When working on this, I used IDLE at first. It's a terrible text editor, but it makes it easy to make little changes and then run the code. You can also run the script from the terminal using the python executable:

python "polling script.py"

Once the request returned the source code for the webpage, it was time to tackle the HTML. First I just looked through the source to find out where the data was and for hooks that the program could use to get to it.

I found that all the information about jobs was stored inside a “data-react-props” attribute. This was used more than once in the code, but the tag I wanted to get at also had a “data-react-class” attribute with the value “workers/WorkerProjectsTable”.

This was all very brittle (they could easily break my script by changing the layout of the HTML code slightly), but it was a start. (As it happens it did break slightly, after working for some time. I fixed it, and it worked again. Then it went back to the first layout again, so it’ll be easy to revert or fix again.)

For the Python side of things, correctly parsing HTML is non-trivial. Rather than trying to get all the data I needed directly from the source code string, I opted to use a HTML parsing module to help.

HTMLParser fit the bill. Starting with an example, I stripped it down to just the code that is triggered every time a start tag is read in.

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        # handle tag

parser = MyHTMLParser()
parser.feed(r.text)

The code assumes that the data-react-class attribute will come before the data-react-props attribute. Importantly, these are both attributes of the same tag. A for-loop is used to iterate over all the attributes in the tag. A Boolean variable is set to True if the right data-react-class attribute is found, and if so then the subsequent data-react-props attribute is found.

found_tag = False
for key, val in attrs:
    if key=="data-react-class" and val=="workers/WorkerProjectsTable":
        found_tag = True
    elif found_tag and key=="data-react-props":
        d = json.loads(val)

The value of the data-react-props attribute is in JSON format. I think I started working with this by iterating through it looking for the data I wanted. This lead to some laughably convoluted code when I was looking back over it just now:

d = json.loads(val)
# oh no, did I really do this?
for k in d:
    if k == "dashboardMerchTargeting":
        v = d[k]

OK, let’s simplify away the loop and conditional and just access the data.

d = json.loads(val)
v = d["dashboardMerchTargeting"]

At this point, v is a Dictionary, and one of its keys is “projects.” Contained therein is a List of Dictionaries that each contain information about a project.

The code iterates over the list, skipping over items whose pay is less than a threshold number.

for p in pl:
    if p["pay"] < "$25/hr": continue # skip jobs under $25 an hour
        print(p) # dangerous, apprently

(Yes, I am using string comparison for this. It’s fragile, but it works as far as it needs to.)

The items can be printed out directly, as I did at first. In the process I ran into a problem with IDLE crashing. It turned out to be because one of the project titles had an emoji in it, and this caused IDLE to freeze up.

My solution was to use encode with errors set to “ignore” and then decode back to a string. This uses the default encoding of utf-8.

Since I wanted to change, omit, and add some fields, I chose to make a new object based on p (a Dict with project info). I could have almost as easily modified p and used it to store my information, but I did this instead.

{
    "name": p["name"].encode(errors='ignore').decode(),
    "coding?": str(p["isCoding"]),
    "pay": p["pay"],
    "added": datetime.datetime.now(),
}

So far I only use the name and added fields, but I included the other fields in case I decide I want them later.

So what do I actually do with the above object? Well, I put it in a Dictionary, using the project id as the key.

live_projects[p["id"]] = {
    # seen above
}

The live_projects is defined early on as an empty Dictionary. The request and parsing code is moved into a function to make it easy to call repeatedly.

live_projects = {}

def do_it():

    # all that code

    t = Timer(60.0 * 5, do_it) # every 5 minutes
    t.start()

do_it()

I added the live_projects Dictionary so that I could keep track of what projects were available, and not keep announcing the same one each time it polled.

To keep track of what projects were added or removed, I used the Set class. Above the loop, an empty set is creted.

current_projects = set()

Inside the loop that adds projects to the live_projects Dictionary, I added this line.

current_projects.add(p["id"])

This creates a set contains the id of each project it finds when polling that matches my criteria.

To compare this to the previous state (as contained in the live_projects Dictionary), at the top of the handle_starttag method I save the current set of keys as “lpks” (live_project keys).

lpks = set(live_projects.keys())

Now after going through the current list of projects, I can compare the sets. The live_projects Dictionary may be out of sync just after polling. To find projects that have been taken down, I get the difference between the current_set (generated from polling) and the keys of live_projects Dictionary (the lpks set).

Python allows you to take the difference of two sets (i.e. set a but without any elements from set b) using the minus operator. However, keep in mind that this is not regular subtraction.

not_yet_live = current_projects - lpks
still_live = current_projects - not_yet_live
no_longer_live = lpks - current_projects

(Algebraic substitution suggests that still_live should lpks, but it doesn’t hold for set difference.)

Finally, this code prints out the new state if there have been any changes. It also takes care of removing elements from the live_projects Dictionary that are no longer available.

if len(not_yet_live) > 0 or len(no_longer_live) > 0:
    ctime = datetime.datetime.now()
    print()
    print(ctime)

    if len(no_longer_live) > 0:
        for k in no_longer_live:
            print("- (live " + str((ctime - live_projects[k]["added"])) + ") " + live_projects[k]["name"])
            del live_projects[k]

    if len(still_live) > 0:
        for k in still_live:
            print("  " + live_projects[k]["name"])

    if len(not_yet_live) > 0:
        for k in not_yet_live:
            print("+ " + live_projects[k]["name"])

That covers just about everything, except for a few lines of debugging code. I was playing around with using bit masks as flags for different kinds of debugging messages.

Here is the complete code.

import requests
from html.parser import HTMLParser
import json
from threading import Timer
import datetime
import os

cookies = "your cookies"

cookies = cookies.split("; ")

#print(cookies)

def l(c):
    p = c.partition("=")
    k = p[0]
    v = p[2]
    return (k, v)

cookies = dict(map(l, cookies))

#print(cookies)

DEBUG_SETS = 2**0  # i.e. 00000001
DEBUG_ATTRS = 2**1 #      00000010

DEBUG = 0

live_projects = {}

def do_it():

    r = requests.get('https://fake-url.job.tech/projects',
                     headers={
                         "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:128.0) Gecko/20100101 Firefox/128.0",
                        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/png,image/svg+xml,*/*;q=0.8",
                        "Accept-Language": "en-US,en;q=0.5",
                        "Upgrade-Insecure-Requests": "1",
                        "Sec-Fetch-Dest": "document",
                        "Sec-Fetch-Mode": "navigate",
                        "Sec-Fetch-Site": "same-origin",
                        "Priority": "u=0, i"
                     },
                     cookies=cookies)

    #print(r.text)

    class MyHTMLParser(HTMLParser):
        def handle_starttag(self, tag, attrs):
            global live_projects

            lpks = set(live_projects.keys())

            if DEBUG & DEBUG_ATTRS: print("Encountered a start tag:", tag)

            found_tag = False
            for key, val in attrs:

                if DEBUG & DEBUG_ATTRS: print(f"debug: {key}={val}")

                if key=="id" and val=="workers/WorkerProjectsTable-hybrid-root":
                    found_tag = True
                elif found_tag and key=="data-props":
                    d = json.loads(val)
                    v = d["dashboardMerchTargeting"]
                    pl = v["projects"]

                    current_projects = set()
                    for p in pl:
                        if p["pay"] < "$22/hr": continue


                        if p["id"] not in live_projects:
                            live_projects[p["id"]] = {
                                "name": p["name"].encode(errors='ignore').decode(),
                                "coding?": str(p["isCoding"]),
                                "pay": p["pay"],
                                "id": p["id"],
                                "added": datetime.datetime.now(),
                            }

                            current_projects.add(p["id"])

                    not_yet_live = current_projects - lpks
                    still_live = current_projects - not_yet_live
                    no_longer_live = lpks - current_projects

                    if DEBUG & DEBUG_SETS:
                        print(f"\ndebug current_projects (a): {current_projects}")
                        print(f"debug lpks (b): {lpks}")
                        print(f"debug not_yet_live (c = a - b): {not_yet_live}")
                        print(f"debug still_live (a - c): {still_live}")
                        print(f"debug no_longer_live (b - a): {no_longer_live}")

                    if len(not_yet_live) > 0 or len(no_longer_live) > 0:
                        ctime = datetime.datetime.now()
                        print()
                        print(ctime)

                        note = ""

                        if len(no_longer_live) > 0:
                            for k in no_longer_live:
                                print("- (live " + str((ctime - live_projects[k]["added"])) + ") " + live_projects[k]["name"])
                                del live_projects[k]

                        if len(still_live) > 0:
                            for k in still_live:
                                print("  " + live_projects[k]["name"])
                                note += "  " + live_projects[k]["name"]

                        if len(not_yet_live) > 0:
                            for k in not_yet_live:
                                print("+ " + live_projects[k]["name"])
                                note += "+ " + live_projects[k]["name"]

                    if len(not_yet_live) > 0:
                        os.system(f"osascript -e 'display notification \"{note}\" with title \"{ctime}\"'")


    parser = MyHTMLParser()
    parser.feed(r.text)

    t = Timer(60.0 * 5, do_it) # every 5 minutes
    t.start()

do_it()

DEV Community

Polling for profit

Top comments (0)

Read next

Mastering REST API Best Practices in Python 🐍

Remaking a rule-engine DSL

Day 4 - None Datatype & input() function in Python

Boost Your Business with Professional Web Development & Cybersecurity Solutions!