DEV Community

Jimmy Yeung
Jimmy Yeung

Posted on

Performant script in updating a large table🚀

From time to time, it's inevitable to update some fields in the database via some manual scripts, no matter it's data patching or some random requirements from the business side.

I often have a hard time in writing a script that updates a large table. It just takes so much time in running once, not to mention if you need to debug/optimise after each run.

Use Case

In one of my sprint, I need to append a json object into a jsonb column, in which the json object is computed differently for each row. There are millions of records in the Postgres table and it takes hours to run the script for testing. In order to minimise the time taken for each test so that I can finish my work on time, I tried to compare different ways in order to acquire the fastest result:

(In the examples below, I'm using 10000 records for each case).

  1. Naive Django ORM .save() way

    for record in records:
        json_value = <computation on json_value>
        record.json_array_field.append(json_value)
        record.save()
    

    with cProfile: 6951470 function calls (6891465 primitive calls) in 31.840 seconds

    This is definitely the slowest way as it hits the database and executes the update query for every row. Just the first try, we could definitely make things faster.

  2. Django Django ORM bulk_update way

    for record in records:
        json_value = <computation on json_value>
        record.json_array_field.append(json_value)
    MyModel.objects.bulk_update(records, fields=. ["json_array_field"])
    

    with cProfile: 5072220 function calls (4612182 primitive calls) in 20.213 seconds

    Slightly better. bulk_update is actually suggested by Django for optimization as bulk_update just executes one query. But it's still a lot of time, I tried to find if there's better way to improve.

  3. Django bulk_update with batch_size

    for record in records:
        json_value = <computation on json_value>
        record.json_array_field.append(json_value)
    MyModel.objects.bulk_update(
        records, 
        fields=["json_array_field"], 
        batch_size=1000
    )
    

    with cProfile: 5077900 function calls (4617619 primitive calls) in 11.765 seconds

    Django: If updating a large number of columns in a large number of rows, the SQL generated can be very large. The batch_size parameter controls how many objects are saved in a single query.

    Performance are further improved by dividing the updates into 10 batches. Not bad, but again, we could make things faster.

  4. SQL way

    values_to_be_updated = [(json.dumps(<computation on json_value>), i.id) for i in records]
    with connection.cursor() as cursor:
        execute_values(
            cursor,
            "UPDATE my_table "
            "SET json_array_field = json_array_field::jsonb || "
            "(v.new_json_value)::jsonb FROM (VALUES %s) as v (new_json_value, bid) "
            "WHERE id = v.bid::uuid;",
            values_to_be_updated,
        )
    

    with cProfile: 802769 function calls (802756 primitive calls) in 3.603 seconds

    That calls it a day since it reduces down to just ~3 seconds for updating! It's one tenth of using the .save() one. I just feel so dumbed on trying different ORM methods at first and spent hours waiting it finish running. I should definitely use the SQL way at first. (Imagine if we are updating millions or billions of rows, how huge the difference would be.)

Key Takeaway (for me)

Maybe it's well-known already, but just want to say it again: when writing a one-off bulk insert/update script in a table with large amount of records, do try consider running a SQL at the very first beginning. That could save you a lot of time on waiting and optimising.

But if you needs to consider readability or you do not care about performance, ORM could be a better choice. Just remember in terms of performance, SQL always wins ;D

Discussion (0)