Kimmo Sääskilahti

Posted on Dec 10, 2022 • Originally published at kimmosaaskilahti.fi

Tips for building a clean REST API in Django

#python #django #webdev

Two and a half years ago we started developing a software application for creating training data for ML applications. The heart application of this annotation tool is a REST API built with Django. The API serves as the backend for a Vue front-end and a Python SDK.

Before starting the project, I personally did not have any experience of using Django. In this post, I'd like to share some of the lessons learned from creating and maintaining a REST API built with Django.

I highly recommend reading Tips for Building High-Quality Django Apps at Scale by DoorDash. Many of the tips below are inspired by the article and have proved to be invaluable for keeping the codebase maintainable.

If you're familiar with domain-driven design, some of the concepts below like services and repositories will sound familiar. This is no coincidence, because I'm a big fan of the book Architecture Patterns With Python. However, the terms used here are not to directly related to domain-driven design. For example, the "service layer" mentioned below is a mix of the "service layer" and "domain services" discussed in the book. Similarly, the concepts of repositories are related but used a bit differently here.

Designing the API

Putting effort into thinking about the interface between the backend and clients is a key for keeping the codebase maintainable and API operations re-usable across clients.

The recommended background reading for this section is RESTful API design by Microsoft. The tips discussed in this section are not specific to Django.

Document your API with OpenAPI

Unless you're creating a very small API, you need to document your API. The industry standard is to use OpenAPI specification, formerly known as Swagger.

When starting development, we searched for tools that could auto-generate the API documentation from code, similar to FastAPI. We could not find anything for Django, so we decided to start maintaining openapi.yaml in the repository by hand.

In hind-sight, this turned out to be a good decision. We have many developers contributing to the codebase, with variable knowledge of API design or how the existing API is structured. Having a separate openapi.yaml allows us to have discussions about API design in pull requests before diving into technical implementation. This helps us, for example, to keep the database models decoupled from the REST API resources and keep pull requests smaller.

Always return objects

When designing what to return from the API, always return objects that can be extended.

For example, consider an API operation GET /users returning the list of users and having another endpoint GET /users/:id for getting details about a single user by user ID. The minimal payload to return from the endpoint would be

[
    "user-id-1",
    "user-id-2",
    "user-id-3"
]

This gets the job done but is impossible to extend without breaking the schema. For example, we might notice our API to be too chatty and want to add user names to the payload. The following structure is a step in the right direction:

[
    { "id": "user-id-1", "name": "User 1"},
    { "id": "user-id-2", "name": "User 2"},
    { "id": "user-id-3", "name": "User 3"}
]

Now, we can extend our objects with more information freely without breaking existing clients.

But we can do better. What would happen if we had thousands of users and needed to add pagination? We could add pagination information to headers like in GitHub, but we want to retain the flexibility to add that information in the returned payload. In fact, that's what we do now in the API. This is possible if the returned payload contains separate key for every entity returned:

{
    "users": [
        { "id": "user-id-1", "name": "User 1"},
        { "id": "user-id-2", "name": "User 2"},
        { "id": "user-id-3", "name": "User 3"}
    ],
    "pagination": {
        "page": 1,
        "prev": null,
        "next": "/users?page=2",
        "per_page": 3,
    }
}

I have paid the price of using too strict payload formats before, having to update all clients when migrating to a more flexible format. Always keep extensibility in mind when designing.

Note that this does not apply to request payloads. For example, it's perfectly fine to use request payloads such as

{
    "user_id": "user-id-1",
    "organization_id": "organization-id-1
}

The backend can easily query for more information if needed. It is also easier to keep the backend backward compatible than keeping clients forward compatible in case of breaking schema changes.

Keep API resources decoupled from database models

This is so important that I'll explicitly mention the quote from the best practices document mentioned above:

"Avoid introducing dependencies between the web API and the underlying data sources. For example, if your data is stored in a relational database, the web API doesn't need to expose each table as a collection of resources. In fact, that's probably a poor design. Instead, think of the web API as an abstraction of the database. If necessary, introduce a mapping layer between the database and the web API. That way, client applications are isolated from changes to the underlying database scheme."

For basic API resources such as User, you will have a corresponding database table users and Django model User. But keep in mind that not all API resources need to expose all four CRUD operations. Not all database models need to be exposed as API resources. Not all API resources correspond to some database table.

Separate the concerns between the API and the database. This gives you as an architect a lot of flexibility in both how you design your database and what resources you expose to the outside world.

Defining the data and service layer

Keep your models lean

When we first started developing the annotation tool, my only source of best practices for Django was Tips for Building High-Quality Django Apps at Scale. The article recommended to avoid "fat models" that include business logic inside model methods. We have followed this approach and, based on my experience, it was a very good decision.

Getting the data layer right is a difficult task. There are lots of models in the data layer and the models may be coupled in complex ways. You can keep your model code much more readable by keeping the number of model methods to the minimum. Do not mix the data layer with the service layer (discussed below).

As a practical example, here is a slightly modified example from our codebase, the Django model for AnnotationGuideline:

# models.py

class AnnotationGuideline(ModelBase):
    id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
    project = models.ForeignKey(
        Project, on_delete=models.SET_NULL, null=True, blank=False
    )
    version = models.PositiveIntegerField(null=False, blank=False)
    ui = models.ForeignKey(
        AnnotationUI, on_delete=models.SET_NULL, null=True, blank=False
    )
    text = models.CharField(max_length=1024, null=False, blank=True)

    @transaction.atomic
    def save(self, *args, **kwargs):
        # Custom save logic such as validation
        ...

    class Meta:
        db_table = "annotation_guidelines"
        indexes = [...]

Every model inherits from a custom abstract ModelBase model that adds fields such as created_at and updated_at. Models also include id field used as primary key. Using UUIDs for primary keys has worked very well for us.

The model includes two foreign keys, representing the project and annotation UI that the annotation guideline belongs to. We also keep an incremental version field to keep track of versions. This model also implements its own save() method to customize the saving logic. In this case, the save() function ensures that version number is always incremented by one (code not shown).

Add a separate layer for business logic

If the business logic does not belong to models, where should it go? I recommend creating a separate module for "services". Under services, add all functions that you use to create, update or delete models in your data layer.

The pattern mentioned in the Tips for Building High-Quality Django Apps at Scale article under the section "Avoid using the ORM as the main interface to your data" is closely related.

Here's an example function used for creating new organizations:

# services.py

def create_organization(creating_user_email: str, name: str):

    if not can_create_organization(user_email=creating_user_email):
        logger.warn(f"User {creating_user_email} prevented from creating organization")
        raise Forbidden()

    user = get_user(email=creating_user_email)
    new_organization = Organization.objects.create(name=name, created_by=user)

    add_member_to_organization(added_by=None, user=user, organization=new_organization)

    add_role_binding(
        user=user,
        created_by=user,
        target_obj=new_organization,
        organization=new_organization,
        role=roles.Admin,
    )
    return new_organization

The function takes two input arguments: the e-mail of the creating user and organization name. The function then takes care of the full business logic, including: (1) checking that the user can create organizations, (2) creating the organization, (3) adding the user as a member to the organization, and (4) making the user an administrator in the organization.

We can use this function whenever we want to create new organizations. Functions like this are usually called from Django HTTP views, but they might also be called from unit tests (to set up tests, for example) or from non-HTTP "views" like Kafka consumers.

Notice how the services pattern separates the concerns. If the business logic changes, we usually do not need to modify the data layer. The drawback is that it might sometimes be difficult to track where models are being managed, because these functions are outside of the models.

This pattern also helps us mentally avoid the coupling between the data layer and the user-facing entities exposed by the REST API. If we added all business logic in model methods, that would encourage a mental pattern where modifications in API entities would be mapped 1-to-1 to modifications in the data layer.

The service layer introduced in the book Architecture Patterns With Python is defined as the layer that drives the application by running a bunch of simple steps like getting data, updating the domain model and persisting the changes. The actual business logic is contained in domain services. In our case, we do not have separate domain models containing business logic, so the service layer is responsible for both the "mundane tasks" and business logic.

Writing views

In Django views, we respond to HTTP requests with HTTP responses. The request has a method like POST and targets a specific route such as /organizations. The request contains additional parameters either in a payload (typically encoded as JSON) or as query parameters. As response, the API sends a payload typically corresponding to some entity.

Let's say that the user wants to query all organizations that they belong to. This could be implemented by operation GET /me/organizations. The response could be a list of organizations such as

{ 
    "organizations": [
        {
            "id": "f360a209-c9ac-43d3-9b9c-ad1a3cb5bd0b",
            "name": "Mega Corp."
        },
        {
            "id": "e4aa065d-9b6a-450c-ac6d-936e04f25448",
            "name": "Acme Corp."
        }
    ]
}

In this example, our view would first query the organizations to which the user belongs. These would be represented by Django models of type Organization. Then we need to serialize the model to convert each of them to objects such as above that include fields id and name. I call these objects "transport" models, because they represent the models transported between the systems such as backend and frontend.

Note that the data model and the client-facing transport model may be closely related, but still very different. The data model could have fields such as created_by and created_at that are either never exposed to clients or are only returned in specific queries. The transport models returned from the API might have fields not present in the model directly, such as the number of members in the organization. How do we do the conversion from the list of Django models to such "transport" objects? My recommendation is to create modules for repositories and transports.

Create a separate layer for transport objects

I recommend defining the transport models in their own module. Every model returned from the API then has a corresponding definition in this transport layer.

For the example above, we would add the following transport:

# transports.py

@dataclass(frozen=True)
class CompactOrganization:
    id: str
    name: str

This model would correspond to the "compact" organization returned as part of list queries such as above.

To implement a query asking for more detailed organization information about an organization (most likely implemented in operation such as GET /organizations/:organizationId), we would add a separate model Organization that might include fields such as created_by and created_at:
# transports.py

@dataclass(frozen=True)
class Organization:
    id: str
    name: str
    created_by: CompactUser
    created_at: datetime

Dataclasses are great, because they work nicely together with Python typing and are very simple to serialize to JSON. To serialize, we would create a function serialize_dataclass like this:

from dataclasses import asdict, is-dataclass

def serialize_dataclass(val: typing.Any):
    if not is_dataclass(val) or isinstance(val, type):
        raise ValidationError(f"Not a dataclass, got type: {type(val)}")
    return asdict(val)

Before we can write a view returning the list of organizations, we need to learn about repositories.

FastAPI makes it very natural to create transport models using pydantic. See the tutorial.

Repositories

In domain-driven design, repositories are an abstraction over data storage, allowing one to decouple the domain model layer from the storage layer. This way, we can keep our models independent of implementation details (like the database), similarly to the hexagonal architecture. It also makes the system more testable by hiding away the complexity of interacting with a database.

In our case, we want to abstract away the complexities of the underlying data layer from our views. For example, a view responsible for fetching the list of organizations should not need to know whether we're reading them from Django or from some other source like a NoSQL database. The view should only interact with transport objects. We therefore introduce repositories as an abstraction layer for getting data.

As an example, here's a repository for Organization objects and a static method for fetching the list of organizations by user:

# repositories.py

class Organizations:
    @staticmethod
    def _make_membership_queryset() -> QuerySet[models.OrganizationMembership]:
        return models.OrganizationMembership.objects.select_related(
            "organization"
        )

    @staticmethod
    def _make_transport(obj: models.Organization):
        return transports.CompactOrganization(
            id=obj.id,
            name=obj.name
        )

    @staticmethod
    def get_organizations_for_user(
        user_id: uuid.UUID
    ) -> typing.Sequence[transports.CompactOrganization]:
        queryset = Organizations._make_membership_queryset().filter(user_id=user_id)
        organizations = (obj.organization for obj in queryset)
        return [Organizations._make_transport(org) for org in organizations]

The static method get_organizations_for_user takes user_id as input argument and returns a list of transports.CompactOrganization objects. The helper method _make_membership_queryset sets up the Django queryset and uses select_related() to follow the foreign key organization in the query. Optimizations like select_related and prefetch_related are very important for performance, to minimize the number of database queries. Django is very good at hiding away complexity such as querying the database, so it's very important that the code for building queries and the code for accessing properties are as closely located as possible. In the case above, it's easy to see that accessing the organization attribute of the membership object does not incur any performance penalty from extra database queries. If the "serialization" function was located in some other module, it would be hard to keep the queries and attribute access in sync.

Finally, the helper method _make_transport converts the Django models to transport objects. In this simple case, this method does not need to access any nested attributes of the model object. But if you need to access a nested attribute such as obj.created_by.email, ensure that the corresponding columns are already fetched as part of the original query.

Putting it together

Here's an example of a view used for listing user's organizations:

class MeOrganizations(LoginRequiredMixin, View):
    def get(self, request):
        # List of `transports.Organization` objects
        organizations = repositories.Organizations.get_organizations_for_user(
            user_id=request.user.id
        )
        return JsonResponse({"organizations": [serialize_dataclass(org) for org in organizations]})

In this example, the view function calls the function repositories.Organizations.get_organizations_for_user that returns a list of objects of type transports.Organization]. We then serialize them and return the response to the user encoded as JSON.

Note how the view never needs to interact with any Django models. We have decoupled views from the data layer by introducing the transport layer and repositories. For a view responsible for creating, updating or deleting Django models, we would use the service layer to ensure separate concerns.

Why not use Django REST framework?

Django REST framework is a great toolkit for building Web APIs. It is hugely popular and simplifies building REST APIs in Django, offering tooling for model serialization, registering routes and even adding support for authentication. In future projects, I would consider using it.

The main reason for not using the framework was to reduce the learning curve for me and other developers. Django itself is a huge framework with a lot to learn, and adopting another framework on top of this seemed like a risk.

We also wanted to keep maximum flexibility. We wanted to be able to customize how to implement features such as user authentication, role-based access control, and how to serve big data sets. Django REST framework probably can handle all this, but it seemed easier for us to build such custom features directly on top of vanilla Django.

Finally, it seemed that Django REST framework could encourage some bad practices such as exposing database models directly as API resources. As mentioned in the beginning of the article, we wanted to avoid falling into the trap of too tightly coupling data models to API resources.

Conclusion

That's it, let me know what you think in the comments below! Thanks for reading!

Top comments (3)

cloutierjo • Dec 10 '22 • Edited

I'll get back reading all the detail later as they might get useful to our own project, but while I agree that using drf make it to easy to tied our api to our models, I'm wondering why using Django at all if its mainly for an api. Some lighter Python framework could be easier to learn and have a more flexible API design.

I'm saying that because after 3 year on a project where we went the fat model way, then backtrack to a more ddd architecture we question ourself if django is really the right solution for us. And frankly drf and ddd are not the cleanest or at least we haven't found yet a clean way to work every bits and pieces of that collaboration.

At the same time no way we are rewriting 3 year of project just to change the base framework. But for someone looking at a new project, I believe the question should be asked.

Kimmo Sääskilahti • Dec 12 '22

I'm wondering why using Django at all if its mainly for an api. Some lighter Python framework could be easier to learn and have a more flexible API design.

I totally agree! I would definitely consider using a more modern framework like FastAPI if starting from scratch today.

Then again, I think Django is very mature, robust and altogether well-executed framework. Django ORM works like a charm. So if I needed to build a large backend application, Django would be a strong contender.

And frankly drf and ddd are not the cleanest or at least we haven't found yet a clean way to work every bits and pieces of that collaboration.

I would love to hear more about that! I love the idea of DDD but haven't yet managed to think of a good way to apply it to practical backend development. Maybe my applications just haven't been large enough to justify the overhead?

cloutierjo • Dec 12 '22

About the ddd part, I find that too use drf to its full potential and thus have a lean simple and clean code base, you need to be somewhat close to your db model since the serialize and even the crud view can be fully created in only a few line of code. But when you use ddd, or like you suggest, create your api decoupled from database, then your serialize have to be fully written and they look very close to model definition. But then you have to manually write every view actions, since it doesn't know how to build the model behind it. In the end, we end up fighting drf to create a fully custom api.

The other issue we have with ddd and Django orm, is the complexity of our query set, but i believe we just have to revisit that part to create a better architecture. We started missing time after converting a large chunk of our business logic to ddd.

Finally, framework excluded, i must say i was reticent to go the ddd way at first. But we had some ugly piece of business logic (the kind that you tell your client, "we are not touching that!") that needed to be completely rewritten and we decided to give it a go. The ingredient for us to do the move where there, part of the code that has complex business logic, that need to be rewritten and for sure will evolve on a regular basis. It's not a very large part of the code, the legacy version was around 1000 loc. Starting to do ddd is a challenge in itself, understanding the concept have been way more complexes to me than learning any framework, language or other code pattern. We did that rework in pair programming, had a lot of discussion, probably rewrote a few times or domain, but in the end, the whole team is quite please with the result. We already had to do some evolution based on feature request and it's now a joy to update that code. Bonus point, we forced ourself to do it in tdd so it's fully tested.

We used cosmic Python (available online) as a guide to start, while i believe it's based on flask, there is a chapter on applying it to Django which discuss mostly the con but enable you to make it work when your already using it.

If i had to do it again, I'd probably try to start with an even smaller chunk of business logic to convert, but you still need to have business rules, if only doing crud, It's clearly not worth it.

DEV Community