DEV Community

Everton Tenorio
Everton Tenorio

Posted on • Edited on

Validating and Sanitizing Data with Pydantic

Since I started programming, I've mostly used structured and procedural paradigms, as my tasks required more practical and direct solutions. When working with data extraction, I had to shift to new paradigms to achieve a more organized code.

A example of this necessity was during scraping tasks when I needed to capture specific data that was initially of a type I knew how to handle, but then suddenly, it either didn't exist or appeared in a different type during the capture.

Consequently, I had to add some if's and try and catch blocks to check if the data was an int or a string ... later discovering that nothing was captured, None, etc. With dictionaries, I ended up saving some uninteresting "default data" in situations like:

data.get(values, 0)
Enter fullscreen mode Exit fullscreen mode

Well, the confusing error messages certainly had to stop appearing.

That's how Python is dynamic. Variables can have their types changed whenever it pleases, until you need more clarity about the types you are working with. Then suddenly, a bunch of information appears, and now I'm reading about how I can deal with data validation, with the IDE helping me with type hints and the interesting pydantic library.

Now, in tasks like data manipulation and with a new paradigm, I can have objects that will have their types explicitly declared, along with a library that will allow validating these types. If something goes wrong, it will be easier to debug by seeing the better-described error information.


Pydantic

So, here is the Pydantic documentation. For more questions, it is always good to consult.

Basically, as we already know, we start with:

pip install pydantic
Enter fullscreen mode Exit fullscreen mode

And then, hypothetically, we want to capture emails from a source that contains these emails, and most of them look like this: "xxxx@xxxx.com". But sometimes, it may come like this: "xxxx@" or "xxxx". We have no doubts about the email format that should be captured, so we will validate this email string with Pydantic:


from pydantic import BaseModel, EmailStr

class Consumer(BaseModel):
    email: EmailStr
    account_id: int

consumer = Consumer(email="teste@teste", account_id=12345)

print(consumer)
Enter fullscreen mode Exit fullscreen mode

Notice that I used an optional dependency, "email-validator", installed with: pip install pydantic[email]. When you run the code, as we know, the error will be in the invalid email format "teste@teste":

Traceback (most recent call last):
  ...
    consumer = Consumer(email="teste@teste", account_id=12345)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  ...: 1 validation error for Consumer
email
  value is not a valid email address: The part after the @-sign is not valid. It should have a period. [type=value_error, input_value='teste@teste', input_type=str]
Enter fullscreen mode Exit fullscreen mode

Using optional dependencies to validate data is interesting, just as creating our own validations is, and Pydantic allows this via field_validator. So, we know that account_id must be positive and greater than zero. If it's different, it would be interesting for Pydantic to warn that there was an exception, a value error. The code would then be:


from pydantic import BaseModel, EmailStr, field_validator

class Consumer(BaseModel):
    email: EmailStr
    account_id: int

    @field_validator("account_id")
    def validate_account_id(cls, value):
        """Custom Field Validation"""
        if value <= 0:
            raise ValueError(f"account_id must be positive: {value}")
        return value

consumer = Consumer(email="teste@teste.com", account_id=0)

print(consumer)
Enter fullscreen mode Exit fullscreen mode
$ python capture_emails.py
Traceback (most recent call last):
...
    consumer = Consumer(email="teste@teste.com", account_id=0)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

...: 1 validation error for Consumer
account_id
  Value error, account_id must be positive: 0 [type=value_error, input_value=0, input_type=int]
    For further information visit https://errors.pydantic.dev/2.8/v/value_error
Enter fullscreen mode Exit fullscreen mode

Now, running the code with the correct values:


from pydantic import BaseModel, EmailStr, field_validator

class Consumer(BaseModel):
    email: EmailStr
    account_id: int

    @field_validator("account_id")
    def validate_account_id(cls, value):
        """Custom Field Validation"""
        if value <= 0:
            raise ValueError(f"account_id must be positive: {value}")
        return value

consumer = Consumer(email="teste@teste.com", account_id=12345)

print(consumer)
Enter fullscreen mode Exit fullscreen mode
$ python capture_emails.py
email='teste@teste.com' account_id=12345
Enter fullscreen mode Exit fullscreen mode

Right?!

I also read something about the native "dataclasses" module, which is a bit simpler and has some similarities with Pydantic. However, Pydantic is better for handling more complex data models that require validations. Dataclasses was natively included in Python, while Pydantic is not—at least, not yet.

Top comments (0)