DEV Community

Furkan Kalkan
Furkan Kalkan

Posted on • Updated on

Fully automated metadata objects with Python 3.7's brand new dataclass library.

metadata banner

Dataclass is a brand new data structure which featured in Python 3.7. Recently @btaskaya write about it a great article. If you hadn't read yet, you can read on here.

Dataclass has promising features to create reusable, self-verified and automated metadata objects. Before that, I used to use dict format to create metadata objects but copying and pasting the same object all the time is boring and conflict with against DRY (Don't Repeat Yourself) rule.

It was like this:

Metadata = {}
Metadata["id"] = id
Metadata["url"] = url

if something:
    Metadata["some_field"] = some_data

Metadata["media"] = {}
Metadata["media"]["id"] = media_id 
...
Enter fullscreen mode Exit fullscreen mode

I can use NamedTuple or something instead of dict but they have some limitations and I really didn't have enough time to implement something fancier in the early days of the project. When I refactor code I realize that dataclass is more functional for my needs.

In this article, I will show you how to implement fully automated metadata objects with dataclasses step by step.

Part 1: Implement metadata fields that don't need calculation

There is no problem at all in this step. It's just standard implementation.

from dataclasses import *


@dataclass
class Metadata:
    title: str
    url: str
    created_at: str = None    # Fields may have default value
Enter fullscreen mode Exit fullscreen mode

Part 2: Add some fields that need calculation and let's calculate it automatically.

This fields will get values only after calculations. In our case, post_id should equal the random number plus url.

import random
from dataclasses import *
@dataclass
class Metadata:
    # Normal fields
    title: str
    url: str
    created_at: str = None
    # Calculated fields
    post_id: str = field(init=False)
    def __post_init__(self):
        random_number = random.randint(100000, 999999)
        self.post_id = f"{random_number}_{self.url}"
Enter fullscreen mode Exit fullscreen mode

__post_init__ function will calculate our field post_id after initiliasion.

Let's call it:

>>> Metadata(
...  title="Some Article",
...  url="https://example.com/article.html",
...  created_at="2018-09-23"
... )
Metadata(title='Some Article', url='https://example.com/article.html', created_at='2018-09-23', post_id='696953_https://example.com/article.html')

Enter fullscreen mode Exit fullscreen mode

Gotcha!

Part 3: Make our hands dirtier; add __post_init__ only pseudo fields

We may want to build autonomous complex structures. For instance, if one field annotated, dataclass can build the whole substructure for us. In our case, we use additional fields author_names and author_ids to construct authors field as list of dict. If author information not provided for the article, the value of authors field should be None.

import random
from dataclasses import *


@dataclass
class Metadata:
    # Normal fields
    title: str
    url: str
    created_at: str = None
    authors: list = None
    # Calculated fields
    post_id: str = field(init=False)
    # Non-nullable Pseudo fields
    author_names: InitVar[list]
    author_ids: InitVar[list]

    def __post_init__(self, author_names, author_ids): # You have to pass pseudo fields as the parameter.
        random_number = random.randint(100000, 999999)
        self.post_id = f"{random_number}_{self.url}"
        self.authors = []
        for i in range(0, len(author_names)):
            self.authors.append({"id": author_ids[i], "name": author_names[i]})

Enter fullscreen mode Exit fullscreen mode

Let's call it:

>>> Metadata(
...  title="Some Article",
...  url="https://example.com/article.html",
...  created_at="2018-09-23"
... )

TypeError: non-default argument 'author' follows default argument.
Enter fullscreen mode Exit fullscreen mode

It didn't work:(

Important Note: You have to group default and non-default fields.

Try again:

import random, json
from dataclasses import *


@dataclass
class Metadata:
    # Non-nullable Pseudo fields
    author_names: InitVar[list]
    author_ids: InitVar[list]
    # Normal fields
    title: str
    url: str
    created_at: str = None
    authors: list = None
    # Calculated fields
    post_id: str = field(init=False)

    def __post_init__(self, author_names, author_ids): # You have to pass pseudo fields as the parameter.
        random_number = random.randint(100000, 999999)
        self.post_id = f"{random_number}_{self.url}"
        self.authors = []
        for i in range(0, len(author_names)):
            self.authors.append({"id": author_ids[i], "name": author_names[i]})

    def to_json(self):
        json.dumps(asdict(self))
Enter fullscreen mode Exit fullscreen mode

Let's call it again:

>>> Metadata(
... title="Some Article",
... url="https://example.com/article.html",
... created_at="2018-09-23",
... author_names=["Furkan Kalkan", "John Doe"],
... author_ids=["1", "2"]
... )
Metadata(title='Some Article', url='https://example.com/article.html', created_at='2018-09-23', authors=[{'id': '1', 'name': 'Furkan Kalkan'}, {'id': '2', 'name': 'John Doe'}], post_id='692728_https://example.com/article.html')
Enter fullscreen mode Exit fullscreen mode

Yeah!

But wait... Where the author_names and author_ids are gone?

Note: Pseudo fields that InitVar instance, only used in __post_init__() as parameters, they are not a part of object.

>> Metadata.author_names
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: type object 'Metadata' has no attribute 'author_names'

Part 4: We don't need to define author_names.

We can make pseudo fields as optional, too.

import random, json
from dataclasses import *


@dataclass
class Metadata:
    # Non-nullable Pseudo fields
    author_ids: InitVar[list]
    # Normal fields
    title: str
    url: str
    created_at: str = None
    authors: list = None
    # Nullable Pseudo fields
    author_names: InitVar[list] = field(default=None)
    # Calculated fields
    post_id: str = field(init=False)

    def __post_init__(self, author_names, author_ids): # You have to pass pseudo fields as the parameter.
        random_number = random.randint(100000, 999999)
        self.post_id = f"{random_number}_{self.url}"
        if author_names:
            self.authors = []
            for i in range(0, len(author_names)):
                self.authors.append({"id": author_ids[i], "name": author_names[i]})

    def to_json(self):
        json.dumps(asdict(self))
Enter fullscreen mode Exit fullscreen mode

Call it:

>>> Metadata(
... title="Some Article",
... url="https://example.com/article.html",
... created_at="2018-09-23",
... author_ids=["1", "2"]
... )
Metadata(title='Some Article', url='https://example.com/article.html', created_at='2018-09-23',authors=None,post_id='692728_https://example.com/article.html')
Enter fullscreen mode Exit fullscreen mode

Part 5: We need JSON.

Python objects are good but we need to dump it as JSON to POST it to web services, MQs, etc. Dataclass library has builtin function asdict() which can dump our object to dict.

Let's write the wrapper for our object.

import random, json
from dataclasses import *


@dataclass
class Metadata:
    # Non-nullable Pseudo fields
    author_names: InitVar[list]
    author_ids: InitVar[list]
    # Normal fields
    title: str
    url: str
    created_at: str = None
    authors: list = None
    # Calculated fields
    post_id: str = field(init=False)

    def __post_init__(self, author_names, author_ids): # You have to pass pseudo fields as the parameter.
        random_number = random.randint(100000, 999999)
        self.post_id = f"{random_number}_{self.url}"
        if author_names:
            self.authors = []
            for i in range(0, len(author_names)):
                self.authors.append({"id": author_ids[i], "name": author_names[i]})

    def to_json(self):
        return json.dumps(asdict(self))
Enter fullscreen mode Exit fullscreen mode

Check it:

>>> m = Metadata(
... title="Some Article",
... url="https://example.com/article.html",
... created_at="2018-09-23",
... author_names=["Furkan Kalkan", "John Doe"],
... author_ids=["1", "2"]
... )
>>> m.to_json()
{"title": "Some Article", "url": "https://example.com/article.html", "created_at": "2018-09-23", "authors": [{"id": "1", "name": "Furkan Kalkan"}, {"id": "2", "name": "John Doe"}], "post_id": "466969_https://example.com/article.html"}

Enter fullscreen mode Exit fullscreen mode

Part 6: Remove unnecessary fields from json.

We want to remove None valued fields from json except the url field. It's possible with a little bit of change:


def to_json(self):    
    metadata = asdict(self)
    for key in list(metadata):
        if key != "url" and metadata[key] == None:
                del metadata[key]
    return json.dumps(metadata)
Enter fullscreen mode Exit fullscreen mode

Top comments (2)

Collapse
 
ipv6_python profile image
Gregory Wendel

Hello - thanks for the helpful article and code examples. Pylint suggested I use enumerate instead of for ... . Here is the code I changed to follow the advice. I think I am getting the same response, but am curious if you see any issues with it.

Original:
for i in range(0, len(author_names)):
self.authors.append({"id": author_ids[i], "name": author_names[i]})

My changes:
for count in enumerate(author_names):
self.authors.append({"id": author_ids[count], "name":author_names[count]})

Collapse
 
furkan_kalkan1 profile image
Furkan Kalkan

enumerate() is ok.