DEV Community

Cover image for Using Pydantic as a Parsing and Data Validation Tool
Nazli Ander
Nazli Ander

Posted on • Originally published at nander.cc

Using Pydantic as a Parsing and Data Validation Tool

Pydantic provides a BaseModel, which can be extended into different fields of collections for data modeling. It has support for Enum type, JSON conversion configurations, and even HTTP string parsing.

Of course, we need reasons to use all those nice functionalities. Hence, as every home-made-project has its storylines, I created a database use case related to NASA's APIs. First I will briefly explain the use case, followed by the concepts that I experimented with in my toy-project.

Introduction

NASA has a bunch of cool and publicly available APIs. All we need to do as developers to use them is to request an API KEY. Then we can request data related to the latest innovations that NASA has, or pictures of the day, or the weather notifications.

For this application, I chose the weather notifications (DONKI). Because it has some text and datetime fields that I could make use of in this parsing case.

With the requested weather notification data, I wanted to parse them nicely with Pydantic, then insert them into a document database (MongoDB).

A raw response looks like the following:

{
    "messageType":"RBE",
    "messageID":"20201007-AL-001",
    "messageURL":"https://kauai.ccmc.gsfc.nasa.gov/DONKI/view/Alert/15920/1",
    "messageIssueTime":"2020-10-07T17:18Z",
    "messageBody":"## NASA Goddard Space Flight Center, Space Weather Research Center ( SWRC )\n## Message Type: Space Weather Notification - Radiation Belt Enhancement\n##\n## Message Issue Date: 2020-10-07T17:18:51Z\n## Message ID: 20201007-AL-001\n##\n## Disclaimer: NOAA's Space Weather Prediction Center (http://swpc.noaa.gov) is the United States Government official source for space weather forecasts. This \"Experimental Research Information\" consists of preliminary NASA research products and should be interpreted and used accordingly.\n\n\n## Summary:\n\nSignificantly elevated energetic electron fluxes in the Earth's outer radiation belt. GOES \"greater than 2.0 MeV\" integral electron flux is above 1000 pfu starting at 2020-10-07T14:05Z. \n\nThe elevated energetic electron flux levels are caused by an S-type CME with ID 2020-09-30T12:09:00-CME-001.\n\nNASA spacecraft at GEO, MEO and other orbits passing through or in the vicinity of the Earth's outer radiation belt can be impacted.\n\nActivity ID: 2020-10-07T14:05:00-RBE-001."
}
Enter fullscreen mode Exit fullscreen mode

The main goal of the toy-project is to take this raw data and insert it into a MongoDB while having all the mandatory fields with correct data formats. An example record is expected to look like below:

{ 
    "_id" : ObjectId("5fd917b83ecc12560ee43ef1"),
    "insertion_date" : ISODate("2020-12-15T20:08:23.091Z"),
    "message_type_abbreviation" : "RBE",
    "message_url" : "https://kauai.ccmc.gsfc.nasa.gov/DONKI/view/Alert/15920/1",
    "message_body" : 
        {
            "message_type" : "Space Weather Notification - Radiation Belt Enhancement",
            "message_issue_date" : ISODate("2020-10-07T17:18:51Z"),
            "message_id" : "20201007-AL-001",
            "disclaimer" : "NOAA's Space Weather Prediction Center (http://swpc.noaa.gov) is the United States Government official source for space weather forecasts. This \"Experimental Research Information\" consists of preliminary NASA research products and should be interpreted and used accordingly.",
            "summary" : "Significantly elevated energetic electron fluxes in the Earth's outer radiation belt. GOES \"greater than 2.0 MeV\" integral electron flux is above 1000 pfu starting at 2020-10-07T14:05Z. The elevated energetic electron flux levels are caused by an S-type CME with ID 2020-09-30T12:09:00-CME-001. NASA spacecraft at GEO, MEO and other orbits passing through or in the vicinity of the Earth's outer radiation belt can be impacted. Activity ID: 2020-10-07T14:05:00-RBE-001.",
            "notes" : null
        }
}
Enter fullscreen mode Exit fullscreen mode

Some Remarks About the Dataset:

  • Message Type can be only one of the following categories: FLR, SEP, CME, IPS, MPC, GST, RBE, and Report.
  • Message Body has text fields separated with the following characters: \n##.

Some Nice Concepts From Pydantic

Pydantic comes with a BaseModel. With the BaseModel, we can parse some dictionaries with the correct typing, or change the behavior of a certain type while transforming the BaseModel into JSON.

Perhaps while dealing with nested forms of data, thinking from inside towards outside makes our lives easier. Hence, starting with the Message Body might be useful for this use case. The following example can be used for modeling the Message Body. For typing we can get help from the typing module, as an example to have optional fields:

from typing import Optional
from datetime import datetime
from pydantic import BaseModel


class NotificationMessageBody(BaseModel):
    message_type: str
    message_issue_date: datetime
    message_id: str
    disclaimer: Optional[str]
    summary: Optional[str]
    notes: Optional[str]
Enter fullscreen mode Exit fullscreen mode

We can transform a NotificationMessageBody type into a JSON with the following code:

parsed_and_cleaned_message_body_dict = {
    "message_type": "Space Weather Notification - Radiation Belt Enhancement",
    "message_issue_date": "2020-10-07T17:18:51Z",
    "message_id": "20201007-AL-001",
    "disclaimer": "NOAA's Space Weather Prediction Center (http://swpc.noaa.gov) is the United States Government official source for space weather forecasts. This \"Experimental Research Information\" consists of preliminary NASA research products and should be interpreted and used accordingly.",
    "summary": "Significantly elevated energetic electron fluxes in the Earth's outer radiation belt. GOES \"greater than 2.0 MeV\" integral electron flux is above 1000 pfu starting at 2020-10-07T14:05Z. The elevated energetic electron flux levels are caused by an S-type CME with ID 2020-09-30T12:09:00-CME-001. NASA spacecraft at GEO, MEO and other orbits passing through or in the vicinity of the Earth's outer radiation belt can be impacted. Activity ID: 2020-10-07T14:05:00-RBE-001."
}

print(
    NotificationMessageBody(
        **parsed_and_cleaned_message_body_dict
    ).json()
)
Enter fullscreen mode Exit fullscreen mode

This will print all the fields in a default format. The datetime field will be first transformed from string to Python datetime, then the default JSON transformation will output the ISO 8601 datetime string format. The notes field does not have an input in the example. For that reason the JSON transformation will ensure that it returns to null:

{
    "message_type": "Space Weather Notification - Radiation Belt Enhancement",
    "message_issue_date": "2020-10-07T17:18:51+00:00",
    "message_id": "20201007-AL-001",
    "disclaimer": "NOAA's Space Weather Prediction Center (http://swpc.noaa.gov) is the United States Government official source for space weather forecasts. This \"Experimental Research Information\" consists of preliminary NASA research products and should be interpreted and used accordingly.",
    "summary": "Significantly elevated energetic electron fluxes in the Earth's outer radiation belt. GOES \"greater than 2.0 MeV\" integral electron flux is above 1000 pfu starting at 2020-10-07T14:05Z. The elevated energetic electron flux levels are caused by an S-type CME with ID 2020-09-30T12:09:00-CME-001. NASA spacecraft at GEO, MEO and other orbits passing through or in the vicinity of the Earth's outer radiation belt can be impacted. Activity ID: 2020-10-07T14:05:00-RBE-001.",
    "notes": null
}
Enter fullscreen mode Exit fullscreen mode

Lastly, we can model the whole NotificationMessage. We may choose to use an Enum to check if one of those Enum values within the MessageTypeAbbreviationEnum exists. Also, we might choose to use HttpUrl type from Pydantic. This assures that the URL string contains HTTP or HTTPS protocol. Besides, it breaks the URL into the pieces of scheme, host, tld, host_type, and path fields:

from datetime import datetime
from pydantic import BaseModel, HttpUrl
from enum import Enum


class MessageTypeAbbreviationEnum(str, Enum):
    FLR = "FLR"
    SEP = "SEP"
    CME = "CME"
    IPS = "IPS"
    MPC = "MPC"
    GST = "GST"
    RBE = "RBE"
    Report = "Report"


class NotificationMessage(BaseModel):
    insertion_date: datetime
    message_type_abbreviation: MessageTypeAbbreviationEnum
    message_url: HttpUrl
    message_body: NotificationMessageBody

    class Config:
        json_encoders = {
            datetime: lambda v: v.strftime("%Y-%m-%d %H:%M:%S")
        }

Enter fullscreen mode Exit fullscreen mode

The example contains an additional configuration for the JSON encoders. This time different from the NotificationMessageBody example, the json_encoders ensures that the datetime is formatted as %Y-%m-%d %H:%M:%S as the JSON transformation is being done.

An example input:

parsed_and_cleaned_message_body_dict = {
    "message_type": "Space Weather Notification - Radiation Belt Enhancement",
    "message_issue_date": "2020-10-07T17:18:51Z",
    "message_id": "20201007-AL-001",
    "disclaimer": "NOAA's Space Weather Prediction Center (http://swpc.noaa.gov) is the United States Government official source for space weather forecasts. This \"Experimental Research Information\" consists of preliminary NASA research products and should be interpreted and used accordingly.",
    "summary": "Significantly elevated energetic electron fluxes in the Earth's outer radiation belt. GOES \"greater than 2.0 MeV\" integral electron flux is above 1000 pfu starting at 2020-10-07T14:05Z. The elevated energetic electron flux levels are caused by an S-type CME with ID 2020-09-30T12:09:00-CME-001. NASA spacecraft at GEO, MEO and other orbits passing through or in the vicinity of the Earth's outer radiation belt can be impacted. Activity ID: 2020-10-07T14:05:00-RBE-001."
}

parsed_and_cleaned_notification_message_dict = {
    "insertion_date": "2020-12-15T20:08:23.091Z",
    "message_type_abbreviation": "RBE",
    "message_url": "https://kauai.ccmc.gsfc.nasa.gov/DONKI/view/Alert/15920/1",
    "message_body": parsed_and_cleaned_message_body_dict
}

print(
    NotificationMessage(
        **parsed_and_cleaned_notification_message_dict
    ).json()
)
Enter fullscreen mode Exit fullscreen mode

The example gives the following output. The datetime fields (insertion_date and message_issue_date) contain the given datetime format (%Y-%m-%d %H:%M:%S) in the configuration class (2020-12-15T20:08:23.091Z becomes 2020-12-15 20:08:23). And all the message_body fields are parsed as the NotificationMessageBody model suggests:

{
    "insertion_date": "2020-12-15 20:08:23",
    "message_type_abbreviation": "RBE",
    "message_url": "https://kauai.ccmc.gsfc.nasa.gov/DONKI/view/Alert/15920/1",
    "message_body": 
        {
            "message_type": "Space Weather Notification - Radiation Belt Enhancement",
            "message_issue_date": "2020-10-07 17:18:51",
            "message_id": "20201007-AL-001",
            "disclaimer": "NOAA's Space Weather Prediction Center (http://swpc.noaa.gov) is the United States Government official source for space weather forecasts. This \"Experimental Research Information\" consists of preliminary NASA research products and should be interpreted and used accordingly.",
            "summary": "Significantly elevated energetic electron fluxes in the Earth's outer radiation belt. GOES \"greater than 2.0 MeV\" integral electron flux is above 1000 pfu starting at 2020-10-07T14:05Z. The elevated energetic electron flux levels are caused by an S-type CME with ID 2020-09-30T12:09:00-CME-001. NASA spacecraft at GEO, MEO and other orbits passing through or in the vicinity of the Earth's outer radiation belt can be impacted. Activity ID: 2020-10-07T14:05:00-RBE-001.",
            "notes": null
        }
}      
Enter fullscreen mode Exit fullscreen mode

Insertion into the DB

After parsing and validating the nested dictionaries to be compliant with a
NotificationMessage model, one can insert those many notifications into a database. This might be a document database, as they have a similar structure as JSON. The project uses MongoDB.

The popular package for MongoDB, pymongo, has a nice method for inserting many records at one batch. Surprisingly this is called as insert_many. And it requires dictionary type. Just to mention, I used the dictionary transformation method from Pydantic for this purpose:

notifications = donki_parser.create_message_dictionary()
notifications_as_dict = list(map(lambda n: n.dict(), notifications))

notifications_repository = NotificationsRepository(
    host=MONGO_HOST,
    port=MONGO_PORT
)

notifications_repository.insert_many(notifications_as_dict)
Enter fullscreen mode Exit fullscreen mode

Last Words

It is enjoyable to learn more about data parsing and validating in Python. Pydantic is a handy tool for this purpose. The examples that I gave require quite a bit of preprocessing. Perhaps that is because of the example data API that I have chosen. A large number of text fields in the message body make the data hard to parse. After solving this puzzle, Pydantic made sure that all fields are correct and ready to insert into a document database.

You can check more in the Pydantic documentation, about the functionalities that they provide.

For the whole project, please refer to the Github repository.

Top comments (0)