Dataclass is a brand new data structure which featured in Python 3.7. Recently @btaskaya write about it a great article. If you hadn't read yet, you can read on here.
Dataclass has promising features to create reusable, self-verified and automated metadata objects. Before that, I used to use dict
format to create metadata objects but copying and pasting the same object all the time is boring and conflict with against DRY (Don't Repeat Yourself) rule.
It was like this:
Metadata = {}
Metadata["id"] = id
Metadata["url"] = url
if something:
Metadata["some_field"] = some_data
Metadata["media"] = {}
Metadata["media"]["id"] = media_id
...
I can use NamedTuple
or something instead of dict
but they have some limitations and I really didn't have enough time to implement something fancier in the early days of the project. When I refactor code I realize that dataclass is more functional for my needs.
In this article, I will show you how to implement fully automated metadata objects with dataclasses step by step.
Part 1: Implement metadata fields that don't need calculation
There is no problem at all in this step. It's just standard implementation.
from dataclasses import *
@dataclass
class Metadata:
title: str
url: str
created_at: str = None # Fields may have default value
Part 2: Add some fields that need calculation and let's calculate it automatically.
This fields will get values only after calculations. In our case, post_id
should equal the random number plus url.
import random
from dataclasses import *
@dataclass
class Metadata:
# Normal fields
title: str
url: str
created_at: str = None
# Calculated fields
post_id: str = field(init=False)
def __post_init__(self):
random_number = random.randint(100000, 999999)
self.post_id = f"{random_number}_{self.url}"
__post_init__
function will calculate our field post_id
after initiliasion.
Let's call it:
>>> Metadata(
... title="Some Article",
... url="https://example.com/article.html",
... created_at="2018-09-23"
... )
Metadata(title='Some Article', url='https://example.com/article.html', created_at='2018-09-23', post_id='696953_https://example.com/article.html')
Gotcha!
Part 3: Make our hands dirtier; add __post_init__
only pseudo fields
We may want to build autonomous complex structures. For instance, if one field annotated, dataclass can build the whole substructure for us. In our case, we use additional fields author_names
and author_ids
to construct authors
field as list
of dict
. If author information not provided for the article, the value of authors
field should be None
.
import random
from dataclasses import *
@dataclass
class Metadata:
# Normal fields
title: str
url: str
created_at: str = None
authors: list = None
# Calculated fields
post_id: str = field(init=False)
# Non-nullable Pseudo fields
author_names: InitVar[list]
author_ids: InitVar[list]
def __post_init__(self, author_names, author_ids): # You have to pass pseudo fields as the parameter.
random_number = random.randint(100000, 999999)
self.post_id = f"{random_number}_{self.url}"
self.authors = []
for i in range(0, len(author_names)):
self.authors.append({"id": author_ids[i], "name": author_names[i]})
Let's call it:
>>> Metadata(
... title="Some Article",
... url="https://example.com/article.html",
... created_at="2018-09-23"
... )
TypeError: non-default argument 'author' follows default argument.
It didn't work:(
Important Note: You have to group default and non-default fields.
Try again:
import random, json
from dataclasses import *
@dataclass
class Metadata:
# Non-nullable Pseudo fields
author_names: InitVar[list]
author_ids: InitVar[list]
# Normal fields
title: str
url: str
created_at: str = None
authors: list = None
# Calculated fields
post_id: str = field(init=False)
def __post_init__(self, author_names, author_ids): # You have to pass pseudo fields as the parameter.
random_number = random.randint(100000, 999999)
self.post_id = f"{random_number}_{self.url}"
self.authors = []
for i in range(0, len(author_names)):
self.authors.append({"id": author_ids[i], "name": author_names[i]})
def to_json(self):
json.dumps(asdict(self))
Let's call it again:
>>> Metadata(
... title="Some Article",
... url="https://example.com/article.html",
... created_at="2018-09-23",
... author_names=["Furkan Kalkan", "John Doe"],
... author_ids=["1", "2"]
... )
Metadata(title='Some Article', url='https://example.com/article.html', created_at='2018-09-23', authors=[{'id': '1', 'name': 'Furkan Kalkan'}, {'id': '2', 'name': 'John Doe'}], post_id='692728_https://example.com/article.html')
Yeah!
But wait... Where the author_names
and author_ids
are gone?
Note: Pseudo fields that
InitVar
instance, only used in__post_init__()
as parameters, they are not a part of object.>> Metadata.author_names Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: type object 'Metadata' has no attribute 'author_names'
Part 4: We don't need to define author_names
.
We can make pseudo fields as optional, too.
import random, json
from dataclasses import *
@dataclass
class Metadata:
# Non-nullable Pseudo fields
author_ids: InitVar[list]
# Normal fields
title: str
url: str
created_at: str = None
authors: list = None
# Nullable Pseudo fields
author_names: InitVar[list] = field(default=None)
# Calculated fields
post_id: str = field(init=False)
def __post_init__(self, author_names, author_ids): # You have to pass pseudo fields as the parameter.
random_number = random.randint(100000, 999999)
self.post_id = f"{random_number}_{self.url}"
if author_names:
self.authors = []
for i in range(0, len(author_names)):
self.authors.append({"id": author_ids[i], "name": author_names[i]})
def to_json(self):
json.dumps(asdict(self))
Call it:
>>> Metadata(
... title="Some Article",
... url="https://example.com/article.html",
... created_at="2018-09-23",
... author_ids=["1", "2"]
... )
Metadata(title='Some Article', url='https://example.com/article.html', created_at='2018-09-23',authors=None,post_id='692728_https://example.com/article.html')
Part 5: We need JSON.
Python objects are good but we need to dump it as JSON to POST it to web services, MQs, etc. Dataclass library has builtin function asdict()
which can dump our object to dict
.
Let's write the wrapper for our object.
import random, json
from dataclasses import *
@dataclass
class Metadata:
# Non-nullable Pseudo fields
author_names: InitVar[list]
author_ids: InitVar[list]
# Normal fields
title: str
url: str
created_at: str = None
authors: list = None
# Calculated fields
post_id: str = field(init=False)
def __post_init__(self, author_names, author_ids): # You have to pass pseudo fields as the parameter.
random_number = random.randint(100000, 999999)
self.post_id = f"{random_number}_{self.url}"
if author_names:
self.authors = []
for i in range(0, len(author_names)):
self.authors.append({"id": author_ids[i], "name": author_names[i]})
def to_json(self):
return json.dumps(asdict(self))
Check it:
>>> m = Metadata(
... title="Some Article",
... url="https://example.com/article.html",
... created_at="2018-09-23",
... author_names=["Furkan Kalkan", "John Doe"],
... author_ids=["1", "2"]
... )
>>> m.to_json()
{"title": "Some Article", "url": "https://example.com/article.html", "created_at": "2018-09-23", "authors": [{"id": "1", "name": "Furkan Kalkan"}, {"id": "2", "name": "John Doe"}], "post_id": "466969_https://example.com/article.html"}
Part 6: Remove unnecessary fields from json.
We want to remove None
valued fields from json except the url field. It's possible with a little bit of change:
def to_json(self):
metadata = asdict(self)
for key in list(metadata):
if key != "url" and metadata[key] == None:
del metadata[key]
return json.dumps(metadata)
Top comments (2)
Hello - thanks for the helpful article and code examples. Pylint suggested I use enumerate instead of for ... . Here is the code I changed to follow the advice. I think I am getting the same response, but am curious if you see any issues with it.
Original:
for i in range(0, len(author_names)):
self.authors.append({"id": author_ids[i], "name": author_names[i]})
My changes:
for count in enumerate(author_names):
self.authors.append({"id": author_ids[count], "name":author_names[count]})
enumerate()
is ok.