zchtodd

Posted on Sep 8, 2019

SQLAlchemy Performance Anti-Patterns and Their Fixes

#python #database #sql #sqlalchemy

An ORM makes life easier for application developers. SQLAlchemy offers a mature ORM that brings with it the benefit of model relationships, a powerful query construction paradigm, easy serialization, and much more. As a result of this ease of use, however, it can be easy to forget what is going on behind the scenes. Seemingly small choices made when using SQLAlchemy can have important peformance implications.

In this article I'll go over some of the top performance issues that developers encounter when using SQLAlchemy.

Retrieving an entire result set when only the count is needed

Sometimes a developer will only need a count of results, but instead of utilizing a database count, all of the results are fetched and the count is done via len in Python.

count = len(User.query.filter_by(acct_active=True).all())

Using the SQLAlchemy count method instead will cause the count to be performed server-side, resulting in far less data sent to the client. Calling all() in the prior example also results in the instantiation of model objects, which can become expensive quickly given enough rows.

Unless more than the count is required, opt for using the count method.

count = User.query.filter_by(acct_active=True).count()

Retrieving entire models when only a few columns are needed

In many cases only a few columns are really needed when issuing a query. Instead of returning entire model instances, SQLAlchemy can fetch only the columns we're interested in. This not only reduces the amount of data sent, but also avoids the need to instantiate entire objects. Working with tuples of column data instead of models can be quite a bit faster.

result = User.query.all()
for user in result:
    print(user.name, user.email)

Opt instead to select only what is needed using the with_entities method.

result = User.query.with_entities(User.name, User.email).all()
for (username, email) in result:
    print(username, email)

Updating one object at a time inside a loop

Avoid using loops to update collections one at a time. While the database may execute a single update very quickly, the roundtrip time between the application and database servers will quickly add up. In general, strive for fewer queries where reasonable.

for user in users_to_update:
  user.acct_active = True
  db.session.add(user)

Use the bulk update method instead.

query = User.query.filter(user.id.in_([user.id for user in users_to_update]))
query.update({"acct_active": True}, synchronize_session=False)

Triggering cascading deletes

The ORM allows easy configuration of relationships on models, but there are some subtle behaviors that can be surprising. Most databases maintain relational integrity through foreign keys and various cascade options. SQLAlchemy allows you to define models with foreign keys and cascade options, but the ORM has its own cascade logic that can preempt the database.

Consider the following models.

class Artist(Base):
    __tablename__ = "artist"

    id = Column(Integer, primary_key=True)
    songs = relationship("Song", cascade="all, delete")

class Song(Base):
    __tablename__ = "song"

    id = Column(Integer, primary_key=True)
    artist_id = Column(Integer, ForeignKey("artist.id", ondelete="CASCADE"))

Deleting artists will cause the ORM to issue delete queries on the Song table, thus preventing the deletes from happening as a result of the foreign key. This behavior can become a bottleneck with complex relationships and a large enough number of records.

Include the passive_deletes option to ensure that the database is managing relationships. Be sure, however, that your database is capable of this. SQLite, for example, does not manage foreign keys by default.

songs = relationship("Song", cascade="all, delete", passive_deletes=True)

Relying on lazy loading when eager loading should be used

Lazy loading is the default SQLAlchemy approach to relationships. This implies that, building from the last example, loading an artist does not simultaneously load his or her songs. This is usually a good idea, but the separate queries can be wasteful if certain relationships always need to be loaded.

Popular serialization frameworks like Marshmallow can trigger a cascade of queries if relationships are allowed to load in a lazy fashion.

There are a few ways to control this behavior. The simplest method is through the relationship function itself.

songs = relationship("Song", lazy="joined", cascade="all, delete")

This will cause a left join to be added to any query for artists, and as a result, the songs collection will be immediately available. Although more data is returned to the client, there are potentially far fewer roundtrips.

SQLAlchemy offers finer grained control for situations where such a blanket approach can't be taken. The joinedload() function can be used to toggle joined loading on a per query basis.

from sqlalchemy.orm import joinedload

artists = Artist.query.options(joinedload(Artist.songs))
print(artists.songs) # Does not incur a roundtrip to load

Using the ORM for a bulk record import

The overhead of constructing full model instances becomes a major bottleneck when importing thousands of records. Imagine, for example, loading thousands of song records from a file where each song has first been converted to a dictionary.

for song in songs:
    db.session.add(Song(**song))

Instead bypass the ORM and use just the parameter binding functionality of core SQLAlchemy.

batch = []
insert_stmt = Song.__table__.insert()
for song in songs:
    if len(batch) > 1000:
       db.session.execute(insert_stmt, batch)
       batch.clear()
    batch.append(song)
if batch:
    db.session.execute(insert_stmt, batch)

Keep in mind that this method naturally skips any client side ORM logic that you might depend on, such as Python based column defaults. While this method is faster than loading objects as full model instances, your database may have bulk loading methods that are faster. PostgreSQL, for example, has the COPY command that offers perhaps the best performance for loading large numbers of records.

Calling commit or flush prematurely

There are many occasions when you will need to associate a child record to its parent, or vice versa. One obvious way of doing this is to flush the session so that the record in question will be assigned an ID.

artist = Artist(name="Bob Dylan")
song = Song(title="Mr. Tambourine Man")

db.session.add(artist)
db.session.flush()

song.artist_id = artist.id

Committing or flushing more than once per request is usually unnecessary and undesireable. A database flush involves forcing disk writes on the database server, and in most circumstances the client will block until the server can acknowledge that the data has been written.

SQLAlchemy can track relationships and manage keys behind the scenes.

artist = Artist(name="Bob Dylan")
song = Song(title="Mr. Tambourine Man")

artist.songs.append(song)

Wrapping up

I hope this list of common pitfalls can help you avoid these issues and keep your application running smoothly. As always, when diagnosing a performance problem, measurement is key. Most databases offer performance diagnostics that can help you pinpoint issues, such as the PostgreSQL pg_stat_statements module.

Top comments (4)

ss153g • Oct 4 '20 • Edited

Great article! I was searching for if it is faster to search int and string together vs. just string. My use case is at follows: I'm getting brand & product information together however, I need to store brand separately and use it's primary_key as a foreign_key in my products table. Instead of looking up brand by its name, I'm creating an int equivalent of the brand name by calculating the sum of ascii value of each character of the brand. Then when I get a brand, I'm doing a search against the int sum value and brand text. My assumption is that the query will first search int first which will be faster and since ascii sum of the different text values can be the same, the query may have more than one record back and then the query uses the string value to further filter down the table.
I am also getting data from the taxonomy from here: google.com/basepages/producttype/t...