zchtodd

Posted on Jun 27, 2021 • Edited on Jul 5, 2021 • Originally published at theparsedweb.com

Building a SaaS App: Beyond the Basics (Part III)

#python #javascript #docker

Once you've finished this post, you'll have a foundation on which to build the data model of your applications, using SQLAlchemy and Postgres.

If you haven't read the first post in the series, this is a step by step guide on building a SaaS app that goes beyond the basics, showing you how to do everything from accept payments to manage users. The example project is a Google rank tracker that we'll build together piece by piece, but you can apply these lessons to any kind of SaaS app.

In the last post, we set up NGINX and Flask using Docker, with both a local development version, as well as a version suitable for production deployment. In this post, we'll set up SQLAlchemy and explore a few of the performance pitfalls that lurk behind the scenes. Then we'll move to setting up our first real route handler, so that the scraper we built in part one can report its results.

You can find the complete code on GitHub.

Setting up SQLAlchemy and Postgres

Back in the first post, we built a working Google search scraper, but we didn't have anywhere to put the results. We're going to fix that problem now with the help of SQLAlchemy – by far the most popular ORM library for Python.

If you haven't used one before, using an ORM will allow us to work in terms of objects, instead of working with messy raw SQL strings in the Python code. Luckily, setting up SQLAlchemy to work with a Flask application is very straightforward, thanks to the Flask-SQLAlchemy package.

The app/__init__.py file contains all of the configuration necessary to get started.

from flask import Flask
from flask_sqlalchemy import SQLAlchemy

from app.util import create_db_uri

db = SQLAlchemy()

def init_app(app):
    db.init_app(app)
    return app

def create_app():
    app = Flask(__name__)
    app.config["SQLALCHEMY_DATABASE_URI"] = create_db_uri()
    app.config["SQLALCHEMY_ECHO"] = False

    return app

from app.models import *  # noqa

This is a reduced version of the init file containing just the minimum needed to set up Flask-SQLAlchemy. The config value SQLALCHEMY_DATABASE_URI tells Flask-SQLAlchemy how to connect to the database. This ultimately depends on the environment variables we saw in Part 2, such as POSTGRES_USER and POSTGRES_HOST.

The SQLALCHEMY_ECHO value is useful when debugging – when set to true, every SQL statement is logged, so you can see what's happening at every step. We'll see a lot of the global db variable throughout the application, because we'll import it wherever we need to interact with the database.

You might also notice the seemingly odd import at the bottom of the file, but it serves an important purpose. As you'll see soon, each of our models resides in its own file. Until a model is imported, SQLAlchemy won't know that it exists, even though we created the definition. Thus, the wildcard import at the bottom ensures that all of our models are imported at runtime.

Defining a model is easy. Model classes inherit from db.Model and define the columns, indexes, and constraints that belong to that model.

from app import db

class ProxyConnection(db.Model):
    __tablename__ = "proxyconn"

    id = db.Column(db.Integer, primary_key=True)

    proxy_url = db.Column(db.String, nullable=False)
    username = db.Column(db.String, nullable=False)
    password = db.Column(db.String, nullable=False)

    allow_parallel = db.Column(
        db.Boolean, default=False, server_default="f", nullable=False
    )

    usage_count = db.Column(db.Integer, default=0, server_default="0")
    block_count = db.Column(db.Integer, default=0, server_default="0")
    consecutive_fails = db.Column(db.Integer, default=0, server_default="0")

    engaged = db.Column(db.Boolean, default=False, server_default="f")

    min_wait_time = db.Column(db.Integer, default=0, server_default="0", nullable=False)
    random_delay = db.Column(db.Integer, default=0, server_default="0", nullable=False)
    last_used = db.Column(db.DateTime, index=True, nullable=True)

As we discussed in the first part of the series, we'll need to use proxy connections for the scraper – we'll keep track of those proxies in the database, as well as how they are performing. We can set a threshold, for instance, so that if a proxy has a certain number of consecutive_fails we take it out of rotation. The project will eventually have many tables, with a model for everything from users to ranking results.

At the moment, however, the database is empty. We need to create the tables defined in our models. To do that, we can use manage.py to create an interactive shell. This shell session is almost the same as an interactive Python session, but within the context of the Flask application.

docker exec -it openranktracker_app_1 python manage.py shell
>>> db.create_all()

The shell makes the db global available, and the create_all function will initialize the entire schema within Postgres. Once that step is complete, you can verify that the tables were created using a psql session.

docker exec -it openranktracker_database_1 psql -U pguser -d openranktracker

psql (11.4 (Debian 11.4-1.pgdg90+1))
Type "help" for help.

openranktracker=# \d
               List of relations
 Schema |       Name       |   Type   | Owner  
-------------+------------------+----------+--------
 public | domain           | table    | pguser
 public | domain_id_seq    | sequence | pguser
 public | keyword          | table    | pguser
 public | keyword_id_seq   | sequence | pguser
 public | proxyconn        | table    | pguser
 public | proxyconn_id_seq | sequence | pguser
(6 rows)

openranktracker=#

The tables have been created! Now we just need to put some data in there.

SQLAlchemy performance pitfalls to avoid

We want our users to have a fast and responsive experience with the application. In my experience, the most common source of slowness is mistakes made with the ORM. SQLAlchemy allows for great convenience and speed of development, but it's easy to lose track of what's happening behind the scenes.

So before moving on, let's cover some of the biggest performance killers, and how to avoid them in our application.

The N+1 Problem

Relationship loading is one of the killer features of any ORM. Instead of manually writing SQL, we can treat data more like objects and object properties. To take an example, think of artists, their albums, and the songs that make up those albums.

This would be a fairly typical way to represent such a problem.

class Artist(db.Model):
    id = db.Column(db.Integer, primary_key=True)
    name = db.Column(db.String)

class Album(db.Model):
    id = db.Column(db.Integer, primary_key=True)
    name = db.Column(db.String)

    artist_id = db.Column(
        db.Integer,
        db.ForeignKey("artist.id", ondelete="CASCADE"),
        index=True,
        nullable=False,
    )

    artist = db.relationship(
        Artist,
        backref=db.backref(
            "albums", order_by="Album.name", cascade="delete-orphan,all"
        ),
    )

class Song(db.Model):
    id = db.Column(db.Integer, primary_key=True)
    name = db.Column(db.String)

    album_id = db.Column(
        db.Integer,
        db.ForeignKey("album.id", ondelete="CASCADE"),
        index=True,
        nullable=False,
    )    

    album = db.relationship(
        Album,
        backref=db.backref(
            "songs", order_by="Song.name", cascade="delete-orphan,all"
        ),
    )

Now say we wanted to show a big list of all artists, their albums, and songs in one place. Because of the relationships that we have defined, this would be one possible way of doing that.

result = []
for artist in Artist.query.all():
    for album in artist.albums:
        for song in album.songs:
            result.append((artist.name, album.name, song.name))

This may seem innocent enough, and in fact might work perfectly well with a small amount of data, but there is a lot happening behind the scenes. After setting SQLALCHEMY_ECHO to true, we see a surprising number of queries sent to the database for such a simple program.

[2021-06-26 17:03:27,602] INFO in log: SELECT artist.id AS artist_id, artist.name AS artist_name 
FROM artist
[2021-06-26 17:03:27,607] INFO in log: SELECT album.id AS album_id, album.name AS album_name, album.artist_id AS album_artist_id 
FROM album 
WHERE %(param_1)s = album.artist_id ORDER BY album.name
[2021-06-26 17:03:27,610] INFO in log: SELECT song.id AS song_id, song.name AS song_name, song.album_id AS song_album_id 
FROM song 
WHERE %(param_1)s = song.album_id ORDER BY song.name

And this is just with one artist and a single album! For each new artist or album, you can expect to see another query. The N+1 problem refers to the idea that an ORM is constantly issuing yet another query for each related object that you want to load.

This is a real problem, because each database round-trip quickly adds up, and that means our user is growing impatient as they stare at a loading spinner.

There are many variations on this problem, but the general idea is that we should aim to accomplish the task with fewer queries to the database. This isn't always worth obsessing over, especially if we're certain that the amount of data is always going to remain limited. When we know that a serious amount of data is expected, however, it's worth thinking in terms of how many queries are involved.

How can we speed up the code we saw above?

There is no one absolute best answer, but instead solutions that fit different scenarios. If we know, for instance, that we're almost always going to fetch artists, albums, and songs together, then we can change how the relationships are loaded.

    artist = db.relationship(
        Artist,
        backref=db.backref(
            "albums", order_by="Album.name", lazy="joined", cascade="delete-orphan,all"
        ),
    )

Adding lazy="joined" instructs SQLAlchemy to always load the related data upfront by issuing a query with a JOIN. This means fewer queries to the database, because the data is already available when accessing the relationship attributes.

[2021-06-26 17:21:44,224] INFO in log: SELECT artist.id AS artist_id, artist.name AS artist_name, album_1.id AS album_1_id, album_1.name AS album_1_name, album_1.artist_id AS album_1_artist_id, song_1.id AS song_1_id, song_1.name AS song_1_name, song_1.album_id AS song_1_album_id 
FROM artist LEFT OUTER JOIN album AS album_1 ON artist.id = album_1.artist_id LEFT OUTER JOIN song AS song_1 ON album_1.id = song_1.album_id ORDER BY album_1.name, song_1.name

Now all of the albums and songs are loaded alongside the artist data, and in a single query, instead of separate queries that waste round-trip time. Of course, if the relationships are rarely loaded, this becomes wasteful, because we're asking the database to do this extra work regardless.

You can also do the join yourself, which makes sense when always joining up-front is overkill.

result = Artist.query.join(Album, Song).all()

Calling commit too many times

Understanding when to call commit is also important. Performance and data integrity are two key reasons to call commit at the proper point in the code. Commit marks all of your changes as permanent (i.e. visible outside of your current transaction), and does so by forcing all of your updates to disk.

You want your changes persisted to disk, but calling commit multiple times repeats this process unnecessarily. Call commit only once, typically once you're done making all changes. This is usually simple in a web application, where you should expect to see commit() called near the end of a route handler.

Great convenience, but with caution required

SQLAlchemy brings great convenience, but also the ability to shoot yourself in the foot. Be mindful of what's happening in the background, and just because some code works fine now, doesn't ensure it won't grind to a halt later when flooded with real data.

Setting up our first API route handler

Now that SQLAlchemy is set up, we're almost ready to start storing some real data. The scraper agent from part one is already collecting ranking data, so let's build an API endpoint that can store those results.

First we'll make a slight modification of app/__init__.py to register the Flask blueprint that represents the API. A blueprint is a Flask concept that allows for endpoints sharing a common prefix (i.e. /api in our case) to be grouped together.

def init_app(app):
    db.init_app(app)

    from app.api import api_blueprint

    app.register_blueprint(api_blueprint)

    return app

Importing the api_blueprint object within the init_app function prevents circular import issues, since the code within those endpoints will need to import the global db object.

We'll need a place to wire up routes to their respective handlers, and app/api/__init__.py is where that happens. We'll start off with just one route, for handling data coming back from the Puppeteer search scraper.

from flask import Blueprint
from flask_restful import Api

from app.api.keywords.scan_callback import ScanCallbackView

api_blueprint = Blueprint("main", __name__, url_prefix="/api")
api = Api(api_blueprint)

api.add_resource(ScanCallbackView, "/keywords/<int:keyword_id>/callback/")

The <int:keyword_id> in the URL path is a placeholder for a variable that we expect to receive. The value will be passed along to the handler method, as we'll see in the next snippet of code.

from flask import request, abort
from flask import current_app as app

from flask_restful import Resource
from app.services.keyword import handle_scraper_response


class ScanCallbackView(Resource):
    def post(self, keyword_id):
        data = request.get_json()

        app.logger.debug("Keyword scan callback initiated")

        if data.get("secret_key") != app.config["SECRET_KEY"]:
            app.logger.warning(
                "Scan callback did not provide correct secret key: {}".format(
                    data.get("secret_key")
                )
            )
            abort(403)

        handle_scraper_response(keyword_id, data)
        return "", 201

This project uses Flask-RESTful, so the handlers are class based instead of functions – this allows us to handle GET, PUT, POST, and DELETE a little more elegantly. The scraper sends a POST request, and the handler extracts the data via request.get_json() before processing the results within the handle_scraper_response function.

We haven't covered users or authentication yet, so how do we prevent abuse of this endpoint? The app has a SECRET_KEY config value, which it will pass to scraper sessions. When a scraper sends its POST request, it will include that same secret key for verification.

That's all it takes to add a route handler! At this point we have some real functionality: the scraper runs, collects results, and now has a place to report those results.

What's next?

In part four, we'll break ground on the user interface for OpenRankTracker using React. We'll start off with the user sign-up and login flow and build from there! I hope you'll join me for part four.