Kalob Taulien

Posted on Dec 30, 2021

Building a scraping tool with Python and storing it in Airtable (with real code)

#python

A startup often needs extremely custom tools to achieve its goals.

At Arbington.com we've had to build scraping tools, data analytics tools, and custom email functions.

None of this required a database. We used files as our "database" but mostly we used Airtable.

Scrapers

Nobody wants to admit it, but scraping is pretty important for gathering huge amounts of useful data.

It's frowned upon, but frankly, everyone does it. Whether they use an automated tool, or manually sift through thousands of websites to collect email addresses - most organizations do it.

In fact, scraping is what made the worlds best search engine: Google.

And in Python, this is REALLY easy.

The hardest part is reading through various forms of HTML, but even then, we have a tool for that. Let's take a look at an example that I've adjusted so you can scrape my website.

We'll use https://kalob.io/teaching/ as the example and get all the courses I teach.

First, we look for a pattern in the DOM. Open up that page, right click, inspect element, and look for all the blue buttons.

You'll see they all have class="btn btn-primary". Interesting, we've found a pattern. Great! We can work with that.

Now let's just right into the code. And if you're a Python dev, feel free to paste this into your terminal.

import requests 

response = requests.get("https://kalob.io/teaching/")
print(response.content)

You'll see the HTML for my website. Now, all we need to do is parse the HTML.

Note: utf-8 encoding is most commonly used on the internet. So we'll want to decode the HTML we scraped into utf-8 compatible text (in a giant string)

Our code now looks like this:

import requests 

response = requests.get("https://kalob.io/teaching/")

html = response.content.decode("utf-8")
print(html)

And you'll see the HTML looks a little nicer now.

Now here's a big hairy problem: parsing HTML. Some people use attr="" some people use attr='' some people use XHTML and some don't.

So how do we get around this?

Introducing: Beautiful Soup 4.

In your Python environment pip install this package:

pip install beautifulsoup4

And your code now looks like this:

import requests 

response = requests.get("https://kalob.io/teaching/")

html = response.content.decode("utf-8")

import bs4  # You'll need to `pip install `
soup = bs4.BeautifulSoup(html, "html.parser")
print(soup)  # Shows the parsed HTML
print(type(soup))  # Returns <class 'bs4.BeautifulSoup'>

So our soup variable is no longer a string, but an object. This means we can use object methods on it - like looking for certain elements in the HTML we scraped.

Let's put together a list of all the links on this page.

import requests 

response = requests.get("https://kalob.io/teaching/")

html = response.content.decode("utf-8")

import bs4  # You'll need to `pip install `
soup = bs4.BeautifulSoup(html, "html.parser")

courses = soup.findAll("a", {"class": ["btn btn-primary"]})
print(courses)

Look at that.. now we have a list of buttons from the page we scraped at the beginning of this article.

Lastly, let's loop through them to get the button text and the link:

for course in courses:
    print(course.get("href"))
    print(course.text.strip())
    print("\n")

Listen, I wrote 3 print statements to make this clear - but typically I'd write this in a single line.

Now we have something to work with! We have the entire HTML element, the href attribute, and the innerText without any whitespace.

The entire script is 9 lines of code and looks like this:

import requests 
import bs4  # You'll need to `pip install `

response = requests.get("https://kalob.io/teaching/")
html = response.content.decode("utf-8")
soup = bs4.BeautifulSoup(html, "html.parser")
courses = soup.findAll("a", {"class": ["btn btn-primary"]})

for course in courses:
    print(f"{course.get('href')} -> {course.text.strip()}")

Moving this data somewhere useful.

You know me, I'm a HUGE fan of Airtable.

And instead of using local database or a cloud based database, I like to use Airtable so me and my team and work with the data and easily expand the tables if we need to. Like if we needed to add a column to see if a course meetings our criteria to be on Arbington.com.

For this we use Airtables API and the python package known as
airtable-python-wrapper.

Go ahead an install this through pip.

pip install airtable-python-wrapper

Now before we continue, you'll need a free Airtable account 👈 that's our referral link. No need to use it, it's just a nice kickback for us for constantly promoting Airtable 😂

Once you have an account, you need to dig up your app API key, your table API key, and your Base Name. It would look something like this in python:

from airtable.airtable import Airtable

airtable = Airtable('appXXXXXXXXX', 'Links', 'keyXXXXXXXXXX')

Lastly, all we need to do is create a dictionary of Airtable Column Names, and insert the record.

import requests 
import bs4  # You'll need to `pip install `
from airtable.airtable import Airtable

response = requests.get("https://kalob.io/teaching/")
html = response.content.decode("utf-8")
soup = bs4.BeautifulSoup(html, "html.parser")
courses = soup.findAll("a", {"class": ["btn btn-primary"]})

airtable = Airtable('appXXXXXXXXX', 'Links', 'keyXXXXXXXXXX')

for course in courses:
    new_record = {
        "Link": course.get('href'),
        "Text": course.text.strip(),
    }
    airtable.insert(new_record)

Assuming you setup your Airtable columns, table and API keys properly, you should see my website links and URLs appear in your Airtable.

Now you and your team can scrape webpages and store the data in Airtable for the rest of your team to use!

Pulling data out to work with it

Now that all the data we want is in Airtable, we can use the same Python package to pull the data out, work with it, scrape more data, and update each record.

But that's for another day 😉