DEV Community

Sascha
Sascha

Posted on

Aggregation in Django JSONFields

This is my first post on dev.to and I think about it more as a note for myself and perhaps someone finds this as interesting as I do.

What I am doing (going to do) with Django and Postgres

I am currently working with some Twitter JSON data that I analyze. I use different libraries and tools for that, e.g. pandas and networkx for looking at the data and creating networks, also Kibana and Elasticsearch for some more quantitative analysis and visualization (Tweets per day and things like that).

However, I identified some users that I want to concentrate my analysis on, read some content, look at some user level visualizations, things like that. It's quite cumbersome to do these things in a jupyter notebook because it is lacking the easy usage of a webapplication that concentrates on these things.

Why JSONField and why Postgres

First thing is, Django works very well with Postgres and I don't want to integrate an extra NoSQL database to the project. Also, I have no idea, how well Django works with e.g. MongoDB. Second, I have no idea, how NoSQL databases "work". I work with MySQL and Postgres for years and even if I am no expert, I usually get along quite well. The third point (which is a little mixed up with the first one), is, I want to use the ORM and don't want to learn some new syntax just for data retrieval.

Data preparation

I don't want to go into detail on that part, because I have a quite large and self-made toolchain. The most important part is that at the end I have a pandas dataframe that I can export to a flat json structure. I decided not to create a Django model that can be used with this json structure because I am quite sure that the datastructure will change quite often in the future and I don't want to write migrations and touch my core application just because I added some fields to my datastructure.

The datastructure looks like this:

{
    "created_at": 15588051070000,
    "favorite_count": 4,
    "id": <unique_id>,
    "is_quote": null,
    "is_reply": null,
    "is_retweet": null,
    "quote_status_id": 0,
    "quote_text": null,
    "quote_user_id": null,
    "quote_user_name": null,
    "reply_count": 0,
    "reply_status_id": 0,
    "reply_user_id": null,
    "reply_user_name": null,
    "retweet_count": 5,
    "retweet_status_id": 0,
    "retweet_text": null,
    "retweet_user_id": null,
    "retweet_user_name": null,
    "text": "<tweet_text>",
    "text_is_extended": false,
    "user_id": <user_id>,
    "user_name": "<user_name>"
}

and I am mostly interested on the text, the creation time and the amount of retweets. There is some annotating and filtering necessary in order to calculate some of these values.

My first important query

Of course, I want my database to do the heavy lifting when it comes to data aggregation/grouping/filtering etc. If I wouldn't want the database to do this, I could just use json flatfiles and load them into python dictionaries, pandas dataframes or javascript objects.

The information I need on the first page of my webapp is the following:

All usernames with their respective amount of tweets and the sum of their retweets ordered by this sum.

That's not that hard to do in a "normal" Django application, I think you would do something like this:

# I wrote this from the top of my head so if it is wrong,
# please correct me in the comments

tweets = Tweet.objects.values("user_name")
    .annotate(
        tweet_count=Count("user_name"),
        retweets=Sum("retweet_count")
    )
    .order_by("-retweets")

Well, of course, I cannot do this, because my Tweet model has no Tweet.user_name or something like this. It just looks like this:

class Tweet(models.Model):
    name = models.CharField(max_length=200)
    data = JSONField()
    created_at = models.DateTimeField(null=True)

and all of the data I need are json formatted inside Tweet.data. I read about JSONField in the Django docs (https://docs.djangoproject.com/en/2.2/ref/contrib/postgres/fields/#key-index-and-path-lookups) and I knew already that I can do things like

Tweet.objects.filter(data__user_name="whoever")

which is really cool because that is more or less exactly the same syntax you use when you have a normal relational database model. So I thought, I could to the same in my annotations. Just change the Sum("retweet_count") to Sum("data__retweet_count") and I would be fine.

It turns out that this is not the case, because these aggregation functions only "see" the data thing of the object and not its content. So while it works to count these things, summing the numbers inside data__retweet_count is not doable with this approach.

After reading some really annoyingly complex solutions on stackoverflow that used RAWSql queries and such, I found the solution that I am using and that works really good:

Tweet.objects.annotate(
    user_name=Cast(
        KeyTextTransform("user_name", "data"), models.TextField()
    )
)
.values("user_name")
.annotate(
    tweet_count=Count("data"),
    retweets=Sum(
        Cast(
            KeyTextTransform("retweet_count", "data"), models.IntegerField()
        )
    ),
)
.order_by("-retweets")

So while that looks a lot more complicated than the query above, in reality it is not and I will explain the things that happen there one after the other.

The most important thing here is KeyTextTransform that you have to import from django.contrib.postgres.fields.jsonb. What it does is extracting the specified key from the specified json object for you (on an ORM level, I have no idea, how that works).

So the first annotation with the user_name is basically just Get the user_name from the json thing as a string (Textfield) and name it user_name. The only reason I do that is because if I would not, that value would be called data__user_name in the resulting data structure and I don't want that.

The next annotation is first the tweet count which is just a counter of the occurence of the username inside my database so I could put everything in there. The retweets annotation is basically the same as the user_name thing but with a cast to Integer instead of Text and a summation of the numbers inside there.

So, to summarize this query: The only things that are added on top are the key lookup in the json datastructure and the casting of that value to the datatype I need. That's all.

Final results

I put this query in a Manager and my Tweet model looks like this:

import logging
import pytz

from django.conf import settings
from django.contrib.postgres.fields import JSONField
from django.contrib.postgres.fields.jsonb import KeyTextTransform
from django.db import models
from django.db.models import Count, Sum
from django.db.models.functions import Cast

from django.utils import timezone

logger = logging.getLogger(__name__)


class TweetManager(models.Manager):
    def counts_only(self):
        return (
            Tweet.objects.annotate(
                user_name=Cast(
                    KeyTextTransform("user_name", "data"), models.TextField()
                )
            )
            .values("user_name")
            .annotate(
                tweet_count=Count("data"),
                retweets=Sum(
                    Cast(
                        KeyTextTransform("retweet_count", "data"), models.IntegerField()
                    )
                ),
            )
            .order_by("-retweets")
        )


class Tweet(models.Model):
    name = models.CharField(max_length=200)
    data = JSONField()
    created_at = models.DateTimeField(null=True)
    objects = TweetManager()

    def __str__(self):
        return self.name

    def save(self, *args, **kwargs):
        try:
            self.name = f"{self.data['user_name']}_{self.data['id']}"
            # the creation datetime is stored as miliseconds timestamp
            naive_created_at = timezone.datetime.fromtimestamp(
                self.data["created_at"] / 1000
            )
            self.created_at = pytz.timezone(settings.TIME_ZONE).localize(
                naive_created_at, is_dst=None
            )
        except Exception as e:
            logger.error(e)
        super().save(*args, **kwargs)

I now can call it in my view:

Tweet.objects.counts_only()

and get a result that I can pass to the frontend and work with it in my template or javascript without recalculating anything:

[{'user_name': 'user_0', 'tweet_count': 25, 'retweets': 2760},
{'user_name': 'user_1', 'tweet_count': 1, 'retweets': 891},
{'user_name': 'user_2', 'tweet_count': 165, 'retweets': 2265},
{'user_name': 'user_3', 'tweet_count': 12, 'retweets': 1769},
{'user_name': 'user_4', 'tweet_count': 59, 'retweets': 1663},
{'user_name': 'user_5', 'tweet_count': 6, 'retweets': 1657},
{'user_name': 'user_6', 'tweet_count': 8, 'retweets': 1420},
{'user_name': 'user_7', 'tweet_count': 26, 'retweets': 1186},
{'user_name': 'user_8', 'tweet_count': 32, 'retweets': 1076},
{'user_name': 'user_9', 'tweet_count': 473, 'retweets': 932}]

Discussion (6)

Collapse
reaganscofield profile image
Reagan Scofield • Edited

I spend couple hours trying to figure the right ORM Algorithms but I find your article so clean however I followed step by step my code doesn't work the way I expected it to work, the COUNT return result and SUM return None type it does seems like it doesn't getting a value of JSON keys can you please assist ? and what version of Django are you using for this article ?

Collapse
saschalalala profile image
Sascha Author

I am using Django in its most recent version which is 2.2.5.

Can you describe your data structure in more detail in order for me to understand your problem better?

Perhaps you could also provide some example json data?

Collapse
reaganscofield profile image
Reagan Scofield • Edited

Here are my code and I used Django 2.2.2, when I query data JSON value does come as None type and the count does bring value

from django.db import models
import uuid
from django.contrib.postgres.fields import JSONField
from django.contrib.postgres.fields.jsonb import KeyTextTransform
from django.db.models import Count, Sum
from django.db.models.functions import Cast
from django.db.models import FloatField
from django.db.models.functions import TruncDay       
from django.db.models import Q
from django.db.models.expressions import RawSQL


class Manager(models.Manager):
    def counts_and_sums(self, date_from, date_to):
        return (
            DataPoint.objects
                .annotate(day=TruncDay('datetime')).values('day')
                .annotate(count=Count('datetime'))
                .annotate(observed_irradiation_total = Sum(Cast(KeyTextTransform('irradiation', 'observed'), FloatField())) )
                .annotate(observed_energy_total = Sum(Cast(KeyTextTransform("energy", 'observed'), FloatField())))
                .annotate(expected_irradiation_total = Sum(Cast(KeyTextTransform('irradiation', 'expected'), FloatField())) )
                .annotate(expected_energy_total = Sum(Cast(KeyTextTransform('energy', 'expected'), FloatField())) )
                .values('day', 'count', 'observed_irradiation_total', 'observed_energy_total', 'expected_irradiation_total', 'expected_energy_total')
                .filter(Q(datetime__range=[date_from, date_to]))
        )

class DataPoint(models.Model):
    id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
    plant_id = models.ForeignKey(Plant, blank=True, null=True, on_delete=models.CASCADE)

    datetime = models.DateTimeField()
    expected = JSONField(blank=True, null=True)
    observed = JSONField(blank=True, null=True)
    objects = Manager()

    def __str__(self):
        return f"{self.datetime}"

query

        objectData = DataPoint.objects.counts_and_sums(date_from, date_to)

        for data in objectData:
            print(data)

my response

{'day': datetime.datetime(2019, 9, 8, 0, 0, tzinfo=<UTC>), 'count': 747, 'observed_irradiation_total': None, 'observed_energy_total': None, 'expected_irradiation_total': None, 'expected_energy_total': None}

and my object looks like this when I query it with normal DataPoint.objects.all()

 {
        "id": "742372b1-56b3-4d97-a7b1-a82eb205bcdf",
        "datetime": "2019-09-08T19:02:00Z",
        "expected": "{\"energy\": 90.78559770167864, \"irradiation\": 30.085498370965905}",
        "observed": "{\"energy\": 90.78559770167864, \"irradiation\": 30.085498370965905}",
        "plant_id": null
    },
Thread Thread
saschalalala profile image
Sascha Author • Edited

Unfortunately, all I can say is, it works for me.

Here is my example code that I used to create some example data:

import datetime as dt # don't care about timezones right now

expected = {"energy": 90.78559770167864, "irradiation": 30.085498370965905}
observed = {"energy": 10.78559770167864, "irradiation": 20.085498370965905}

today = dt.date.today()

# Just create some identical objects, I don't think that this matters right now

DataPoint.objects.create(datetime=today, expected=expected, observed=observed)
DataPoint.objects.create(datetime=today, expected=expected, observed=observed)
DataPoint.objects.create(datetime=today, expected=expected, observed=observed)
DataPoint.objects.create(datetime=today, expected=expected, observed=observed)

date_from = today - dt.timedelta(days=4)
date_to = today + dt.timedelta(days=3)

object_data = DataPoint.objects.counts_and_sums(date_from, date_to)

print(object_data)
<QuerySet [{'day': datetime.datetime(2019, 9, 10, 0, 0, tzinfo=<DstTzInfo 'Europe/Berlin' CEST+2:00:00 DST>), 'count': 4, 'observed_irradiation_total': 80.3419934838636, 'observed_energy_total': 43.1423908067146, 'expected_irradiation_total': 120.341993483864, 'expected_energy_total': 363.142390806715}]>

And just for the records, my version, taken from the same shell:

import django
print(django.VERSION)
(2, 2, 2, 'final', 0)

The only thing I changed in your code, is that I removed the plant_id ForeignKey

Collapse
arnababstentia profile image
arnab-abstentia

Facing a similar kind of problem in my office project. Just going through your code and implemented it and everything work's fine. Thanks for sharing !!!

Collapse
saschalalala profile image
Sascha Author

I love JSONFields. There are so many things that you can do with them (ArrayFields, also) which are somewhat counter intuitive compared to "classical rdbms thinking" but make some problems so much easier.