DEV Community

loading...
Meeshkan

TypedDict vs dataclasses in Python — Epic typing BATTLE!

mikesol profile image Mike Solomon Originally published at meeshkan.com ・9 min read

We recently migrated our Meeshkan product from Python TypedDict to dataclasses. This article explains why. We'll start with a general overview of types in Python. Then, we'll walk through the difference between the two typing strategies with examples. By the end, you should have the information you need to choose the one that's the best fit for your Python project.

Table of Contents

Types in Python

PEP 484, co-authored by Python's creator Guido van Rossum, gives a rationale for types in Python. He proposes:

A standard syntax for type annotations, opening up Python code to easier static analysis and refactoring, potential runtime type checking, and (perhaps, in some contexts) code generation utilizing type information.

For me, static analysis is the strongest benefit of types in Python.

It takes code like this:

# exp.py
def exp(a, b):
 return a ** b

exp(1, "result")

Which raises this error at runtime:

$ python exp.py
  File "./exp.py", line 4, in <module>
    exp(1, "result")
  File "./exp.py", line 2, in exp
    return a ** b
TypeError: unsupported operand type(s) for ** or pow(): 'int' and 'str'

And allows you to do this:

# exp.py
def exp(a: int, b: int) -> int:
  return a ** b

exp(1, "result")

Which raises this error at compile time:

$ mypy exp.py # pip install mypy to install mypy
exp.py:4: error: Argument 2 to "exp" has incompatible type "str"; expected "int"
Found 1 error in 1 file (checked 1 source file)

Types help us catch bugs earlier and reduces the number of unit tests to maintain.

Classes and dataclasses

Python typing works for classes as well. Let's see how static typing with classes can move two errors from runtime to compile time.

Setting up our example

The following area.py file contains a function that calculates the area of a shape using the data provided by two classes:

# area.py
class RangeX:
  left: float
  right: float

class RangeY:
  up: float
  down: float

def area(x, y):
  return (x.right - x.lefft) * (y.right- y.left)

x = RangeX(); x.left = 1; x.right = 4
y = RangeY(); y.down = -3; y.up = 6
print(area(x, y))

The first runtime error this produces is:

$ python area.py
Traceback (most recent call last):
  File "./area.py", line 14, in <module>
    print(area(x, y))
  File "./area.py", line 10, in area
    return (x.right - x.lefft) * (y.right- y.left)
AttributeError: 'RangeX' object has no attribute 'lefft'

Yikes! Bitten by a spelling mistake in the area function. Let's fix that by changing lefft to left.

We run again, and:

$ python area.py
Traceback (most recent call last):
  File "./area.py", line 14, in <module>
    print(area(x, y))
  File "./area.py", line 10, in area
    return (x.right - x.left) * (y.right- y.left)
AttributeError: 'RangeY' object has no attribute 'right'

Oh no! In the definition of area, we have used right and left for y instead of up and down. This is a common copy-and-paste error.

Let's change the area function again so that the final function reads:

def area(x, y):
  return (x.right - x.left) * (y.up - y.down)

After running our code again, we get the result of 27. This is what we would expect the area of a 9x3 rectangle to be.

Adding type definitions

Now let's see now how Python would have caught both of these errors using types at compile time.

We first add type definitions to the area function:

# area.py
class RangeX:
 left: float
 right: float

class RangeY:
 up: float
 down: float

def area(x: RangeX, y: RangeY) -> float:
 return (x.right - x.lefft) * (y.right - y.left)

x = RangeX(); x.left = 1; x.right = 4
y = RangeY(); y.down = -3; y.up = 6
print(area(x, y))

Then we can run our area.py file using mypy, a static type checker for Python:

$ mypy area.py
area.py:10: error: "RangeX" has no attribute "lefft"; maybe "left"?
area.py:10: error: "RangeY" has no attribute "right"
area.py:10: error: "RangeY" has no attribute "left"
Found 3 errors in 1 file (checked 1 source file)

It spots the same three errors before we even run our code.

Working with dataclasses

In our previous example, you'll notice that the assignment of attributes like x.left and x.right is clunky. Instead, what we'd like to do is RangeX(left = 1, right = 4). The dataclass decorator makes this possible. It takes a class and turbocharges it with a constructor and several other useful methods.

Let's take our area.py file and use the dataclass decorator.

# area.py
from dataclasses import dataclass

@dataclass # <----- check this out
class RangeX:
  left: float
  right: float

@dataclass # <----- and this
class RangeY:
  up: float
  down: float

def area(x: RangeX, y: RangeY) -> float:
  return (x.right - x.left) * (y.up - y.down)

x = RangeX(left = 1, right = 4)
y = RangeY(down = -3, up = 6)

print(area(x, y))

According to mypy, our file is now error-free:

$ mypy area.py
Success: no issues found in 1 source file

And it gives us the expected result of 27:

$ python area.py
27

class and dataclass are nice ways to represent objects as types. They suffer from several limitations, though, that TypedDict solves.

TypedDict

But first...

Brief introduction to duck typing

In the world of types, there is a notion called duck typing. Here's the idea: If an object looks like a duck and quacks like a duck, it's a duck.

For example, take the following JSON:

{
  "name": "Stacey O'Hara",
  "age": 42,
}

In a language with duck typing, we would define a type with the attributes name and age. Then, any object with these attributes would correspond to the type.

In Python, classes aren't duck typed, which leads to the following situation:

# person_vs_comet.py
from dataclasses import dataclass

@dataclass
class Person:
  name: str
  age: int

@dataclass
class Comet:
  name: str
  age: int

Person(name="Haley", age=42000) == Comet(name="Haley", age=42000) # False

This example should return False. But without duck typing, JSON or dict versions of Comet and Person would be the same.

We can see this when we check our example with asdict:

from dataclass import asdict

asdict(Person(name="Haley", age=42000)) == asdict(Comet(name="Haley", age=42000)) # True

Duck typing helps us encode classes to another format without losing information. That is, we can create a field called type that represents a "person" or a "comet".

Working with TypedDict

TypedDict brings duck typing to Python by allowing dicts to act as types.

# person_vs_comet.py
from typing import TypedDict

class Person(TypedDict):
  name: str
  age: int

class Comet(TypedDict):
  name: str
  age: int

Person(name="Haley", age=42000) == Comet(name="Haley", age=42000) # True

An extra advantage of this approach is that it treats None values as optional.

Let's imagine, for example, that we extended Person like so:

# person.py
from dataclasses import dataclass
from typing import Optional

@dataclass
class Person:
  name: str
  age: int
  car: Optional[str] = None
  bike: Optional[str] = None
  bank: Optional[str] = None
  console: Optional[str] = None

larry = Person(name="Larry", age=25, car="Kia Spectra")
print(larry)

If we print a Person, we'll see that the None values are still present:

Person(name='Larry', age=25, car='Kia Spectra', bike=None, bank=None, console=None)

This feels a bit off - it has lots of explicit None fields and gets verbose as you add more optional fields. Duck typing avoids this by only adding existing fields to an object.

So let's rewrite our person.py file to use TypedDict:

# person.py
from typing import TypedDict

class _Person(TypedDict, total=False):
  car: str
  bike: str
  bank: str
  console: str

class Person(_Person):
  name: str
  age: int

larry: Person = dict(name="Larry", age=25, car="Kia Spectra")
print(larry)

Now when we print our Person, we only see the fields that exist:

Person(name='Larry', age=25, car='Kia Spectra')

Migrating from TypedDict to dataclasses

You may have guessed by now, but generally, we prefer duck typing over classes. For this reason, we're very enthusiastic about TypedDict. That said, in Meeshkan, we migrated from TypedDict to dataclasses for several reasons. Throughout the rest of this article, we'll explain why we made the move.

The two reasons we migrated from TypedDict to dataclasses are matching and validation:

  • Matching means determining an object's class when there's a union of several classes.
  • Validation means making sure that unknown data structures, like JSON, will map to a class.

Matching

Let's use the person_vs_comet.py example from earlier to see why class is better at matching in Python.

# person_vs_comet.py
from dataclasses import dataclass
from typing import Union

@dataclass
class Person:
 name: str
 age: int

@dataclass
class Comet:
 name: str
 age: int

def i_am_old(obj: Union[Person, Commet]) -> bool:
  return obj.age > 120 if isinstance(obj, Person) else obj.age > 1000000000

print(i_am_old(Person(name="Spacey", age=1000))) # True
print(i_am_old(Comet(name="Spacey", age=1000))) # False

In Python, isinstance can discriminate between union types. This is critical for most real-world programs that support several types.

In Meeshkan, we work with union types all the time in OpenAPI. For example, most object specifications can be a Schema or a Reference to a schema. All over our codebase, you'll see isinstance(r, Reference) to make this distinction.

TypedDict doesn't work with isinstance - and for good reason. Under the hood, isinistance looks up the class name of the Python object. That's a very fast operation. With duck typing, you'd have to inspect the whole object to see if "it's a duck." While this is fast for small objects, it is too slow for large objects like OpenAPI specifications. The isinstance pattern has sped up our code a lot.

Validation

Most code receives input from an external source, like a file or an API. In these cases, it's important to verify that the input is usable by the program. This often requires mapping the input to an internal class. With duck typing, after the validation step, this requires a call to cast.

The problem with cast is that it allows incorrect validation code to slip through. In the following person.py example, there is an intentional mistake. It asks if isinstance(d['age'], str) even though age is an int. cast, because it's so permissive, won't catch this error:

# person.py
from typing import cast, TypedDict, Optional

class Person(TypedDict):
  name: str
  age: Optional[int]

def to_person(d: dict) -> Person:
  if ('name' in d) and isinstance(d['name'], str) and (('age' not in d) or (( 'age' in d) and (isinstance(d['age'], str))):
    return cast(d, Person) # this will work at runtime even though it shouldn't
  raise ValueError('d is not a Person')

However, a class will only ever work with a constructor. So this will catch the error at the moment of construction:

# person.py
from typing import Optional
from dataclasses import dataclass

@dataclass
class Person:
 name: str
 age: Optional[int] = None

def to_person(d: dict) -> Person:
  if ('name' in d) and isinstance(d['name'], str) and (('age' not in d) or (( 'age' in d) and (isinstance(d['age'], str))):
   # will raise a runtime error for age when age is a str
   # because it is `int` in `Person` 
   return Person(**to_person)

 raise ValueError('d is not a Person')

The above to_person will raise an error, whereas the TypedDict version won't. This means that, when an error arises, it will happen later down the line. These types of errors are much harder to debug.

When we changed from TypedDict to dataclasses in Meeshkan, some tests started to fail. Looking them over, we realized that they never should have succeeded. Their success was due to the use of cast, whereas the class approach surfaced several bugs.

Conclusion

While we love the idea of TypedDict and duck typing, it has practical limitations in Python. This makes it a poor choice for most large-scale applications. We would recommend using TypedDict in situations where you're already using dicts. In these cases, TypedDict can add a degree of type safety without having to rewrite your code. For a new project, though, I'd recommend using dataclasses. It works better with Python's type system and will lead to more resilient code.

Disagree with us? Are there any strengths or weaknesses of either approach that we're missing? Leave us a comment!

Discussion (1)

pic
Editor guide
Collapse
travisjungroth profile image
Travis Jungroth • Edited

There are a few really important things in this post that are incorrect or misleading. There seems to be a fundamental misunderstanding about duck typing and static type checking.

TypedDict didn't bring duck typing to Python. Python was already the classic example of duck typing. You even linked to a Python lesson about it that doesn't include TypeDict.

Two Persons with the same attributes not being equal isn't because Python isn't duck typed (since it is). It's because the default for == (the __eq__ method) is just an identity check. You're free to define that method however you want for your classes. And on dataclasses, you could just pass eq=True to the decorator and it will check that all the attributes match and the types are the same.

Duck typing is the idea that a type is just defined by its method and attributes. Like:

def first_initial(thing):
    return thing.name[0]
Enter fullscreen mode Exit fullscreen mode

That will work on a Person or a Comet, since they both have the name attribute.

You example about cast is misleading. cast doesn't change the return value. It's a way to tell the type checker that you know something it doesn't. source

And the example about switching to using the Person constructor is just wrong. Passing a str instead of an int will not cause a runtime error, unless you also have installed some extra runtime typing package. Python type hinting is static. Check for yourself by running this:

from typing import Optional
from dataclasses import dataclass

@dataclass
class Person:
    name: str
    age: Optional[int] = None

def to_person(d: dict) -> Person:
    # Fixed the parens
    if ('name' in d) and isinstance(d['name'], str) and (('age' not in d) or (('age' in d) and (isinstance(d['age'], str)))):
        return Person(**d)  # fixed the variable
    raise ValueError('d is not a Person')

to_person({'name': 'Abraham', 'age': '100'})  # Works fine, no error
Enter fullscreen mode Exit fullscreen mode

The fact that you've got isinstance(r, Reference) all over your codebase is a hint that that the relationship between these types or your dependance on it is broken. If a function is able to get either, then it should be able to use either and let the instance handle the difference (polymorphism and/or duck typing).

You did end up at a better place, with dataclasses over TypedDicts. It sounds like you could also make good use of Protocols since what you seem to want is static duck typing, anyway.