DEV Community

loading...
Cover image for Making The Invalid Impossible - Choosing The Right Data Model

Making The Invalid Impossible - Choosing The Right Data Model

Frank Rosner
My professional interests are cloud and big data technologies, machine learning, and software development. I like to read source code and research papers to understand how stuff works. Pronoun: He
Updated on ・6 min read

Introduction

In my previous post I talked about complexity in software design. We discussed an example on how picking a simple data structure can make your code easier to understand and maintain.

When it comes to applications dealing with data, complexity is also hugely influenced by the amount of different states your applications can be in. The more combinations of values are allowed, the higher are the chances that some of them are invalid. Invalid state might, if you are lucky, raise an error message. If you are unlucky it might lead to undetected and/or unexpected behaviour.

In this blog post we are going to take a look at how to make invalid state impossible by choosing the right data model. The examples will be written in Scala but they can easily be ported to any other programming language.

State Complexity

Building stateless applications is a good practice. However there will always be state somewhere, even if it is only in your message queue connecting your stateless services. State is usually implemented based on your domain specific representation of the real world, i.e. as domain objects. These domain objects are typically stored in some data structure and represented as a type.

The more domain objects you have and the more complex they get, the more possible value combinations exist. If you are storing a set of customers with their name, address, and gender, you already have an almost unlimited amount of possible combinations. And here we only have three fields in one object. Think about a domain model that you have worked with in the past. How many different combinations of possible values exist? How many of them are valid? If you think about the combinations that are invalid, did you take them into account when interacting with your data? Are you sure that you caught all of the possible invalid combinations?

Managing state is not an easy task. If your domain is complex you will have a complex domain model. There is no way around this state explosion most of the time. However there is something you can do to decrease the chance of running into an invalid state: Avoiding it!

Modeling Your State

Let's look at an example. Imagine that you are a professor developing an application to store the course results of your students. Each student will have a student ID and a name. Students enrolled into the course should take the exam in the end where they will receive a grade. A grade can be anything from A to F. Each exam will have an ID. Students who participated in the course do not necessarily have to take the exam.

A first model you might come up with could be:

case class Student(
  id: String,
  name: String
)

case class CourseResult(
  id: String,
  students: Seq[Student],
  grades: Seq[String]
)
Enter fullscreen mode Exit fullscreen mode

Looks usable. We can store all the information we need. Both students taking the exam as well as their grades are stored in a list, connected by the index. The grade of the first student is the first grade in the grade sequence.

Avoiding Invalid State

Looking at the first implementation above can you already tell what might go wrong? What are possible instances of our exam result type? Let's create an exam taken by two students:

CourseResult(
  id = "2017-CS101",
  students = Seq(Student("1", "Frank"), Student("2", "Ralph")),
  grades = Seq("B", "A")
)
Enter fullscreen mode Exit fullscreen mode

Alright. What about the following example:

CourseResult(
  id = "2017-CS101",
  students = Seq(Student("1", "Frank"), Student("2", "Ralph")),
  grades = Seq("B")
)
Enter fullscreen mode Exit fullscreen mode

It appears that Ralph does not have a grade. Is this because he enrolled but did not participate in the exam? Or is it because there was a programming error and a grade got lost?

Maybe we need to add a requirement when constructing the exam result instance, throwing an exception if the size of the students and grades list differ:

case class CourseResult(
  id: String,
  students: Seq[Student],
  grades: Seq[String]
) {
  require(students.size == grades.size,
    "There need to be as many grades as there are students.")
}
Enter fullscreen mode Exit fullscreen mode

Trying again to construct an invalid instance would give the following error:

java.lang.IllegalArgumentException: requirement failed:
There need to be as many grades as there are students.
Enter fullscreen mode Exit fullscreen mode

This is nice but wouldn't it be better if trying to create the invalid state would not even compile? How could we change the implementation such that we don't need to rely on the runtime exception? A good way to do it is to have a list of student and grade pairs, or a dictionary to look up grades for each student. Choosing a dictionary also makes sure that each student can enroll into the course only once. If you used a sequence of pairs you would have to check for duplicates.

case class CourseResult(
  id: String,
  participations: Map[Student, String]
)
Enter fullscreen mode Exit fullscreen mode

Problem solved, you cannot have grades without any students or the other way around. But wait, we had the situation that some students enrolled into the course but did not take the exam. What we can do is to simply store them anyway and just set the grade to null. But who prevents you from also setting the exam ID to null. Or the student name?

This is a common problem with languages which permit null values. It adds more possible state that most of the time does not make sense in the domain. A better way to deal with optional grading is to use the Option type (also called Maybe in other languages). An Option is just a container for another value that might also be empty. I recommend to avoid using null values or switching to a programming language which does not have null.

case class CourseResult(
  id: String,
  participations: Map[Student, Option[String]]
)
Enter fullscreen mode Exit fullscreen mode

Now we can allow Ralph to not take the exam but still enroll:

CourseResult(
  id = "2017-CS101",
  participations = Map(
    Student("1", "Frank") -> Some("B"),
    Student("2", "Ralph") -> None
  )
)
Enter fullscreen mode Exit fullscreen mode

Now let's look at the grades. Grades can be from A to F. However it is anyway possible to construct a participation like this:

Student("1", "Frank") -> Some("X")
Enter fullscreen mode Exit fullscreen mode

How can we avoid that? My preferred way over adding preconditions and requirements is to make the compiler help you. As there is a finite (and reasonably small) number of possible grades, we can simply enumerate them.

sealed trait Grade
case object A extends Grade
case object B extends Grade
case object C extends Grade
case object D extends Grade
case object E extends Grade
case object F extends Grade

case class CourseResult(
  id: String,
  grades: Map[Student, Option[Grade]]
)
Enter fullscreen mode Exit fullscreen mode

Now we can write a function to calculate whether a student has passed the course using pattern matching:

def passed(g: Grade): Boolean = g match {
  case F => false
  case _ => true
}
Enter fullscreen mode Exit fullscreen mode

If we now removed the case _ => true branch (or forgot to implement it in the first place) the compiler will even warn us:

warning: match may not be exhaustive.
It would fail on the following inputs: A, B, C, D, E
       def passed(g: Grade): Boolean = g match {
Enter fullscreen mode Exit fullscreen mode

When implementing enumerations in Scala I recommend to take a look at Enumeratum, a very powerful enumeration implementation that integrates with many serialization libraries.

After a couple of changes to our initial draft we ended up with the following first final version:

case class Student(
  id: String
  name: String
)

sealed trait Grade
case object A extends Grade
case object B extends Grade
case object C extends Grade
case object D extends Grade
case object E extends Grade
case object F extends Grade

case class CourseResult(
  id: String,
  grades: Map[Student, Option[Grade]]
)
Enter fullscreen mode Exit fullscreen mode

Of course this might not be the perfect solution. If your requirements change you might also have to change your model. Changing the model is always associated with refactoring work. Depending on the usage scope you might have to touch many other services when making a change to the CourseResult definition. While in this example we only focused on changes to avoid invalid state it is of course important to also think about extensibility.

Conclusion

In this post we discussed the problem of state complexity. We saw a few examples on how adjusting only a few details like variable types can already reduce the complexity a lot. Making invalid state impossible leads to less errors at runtime because the compiler has your back.

Of course the problem we were looking at was simplified and you might already do many of the things I mentioned automatically. Nevertheless I believe that it is important to be aware of situations where your model makes invalid state possible. Even if it is only a small detail as choosing Set over List if you don't want to have duplicates.

Most of the time there are no perfect solutions. Choosing a model to represent the real world will always be an approximation. There will be trade-offs in terms of usability, maintainability, and extensibility. It is important to find the right balance based on your functional and non-functional requirements.

Now I am curious about your opinion. Which implementation would you choose if you were the professor? Would you implement it yourself or give it to a working student ;)? Let me know in the comments!

Discussion (8)

Collapse
maphengg profile image
maphengg

This is fine for synchronous method calls but will fail in distributed systems where asynchronous message passing means no service can determine whether or not some other service has updated a value (assuming we're not using update in place to a database record, for instance). For this particular problem, you need conflict-free replicated data types (CRDT).

If you had multiple instructors submitting grades for the same student (say in a class with Teaching Assistants), and you were the Dean facing a couple of anxious parents out in the hallway who are concerned that their child is going to fail out and cost them 10s of thousands of dollars, you would need to be able to tell that any particular update from an instructor was authoritative, or at least your data type would need to record any discrepancy so that a human (or Dean) could arbitrate the definitive value, or the system would need to implement a strategy for determining which conflicting write was the definitive one.

Collapse
gerrrr profile image
Aleksandr Sorokoumov • Edited

Hey @maphengg ,

Even in the presence of concurrent record modification, the last data model @frosnerd proposed would not lead to an inconsistent state.
For example, there is no sequence of transactions that would lead to CourseResult containing a different number of students and records.

Could you please elaborate how would CRDTs help in the situation you describe? It is not clear to me how different grades for a single student can be converged to a conflict-free state.

Collapse
maphengg profile image
maphengg • Edited

If you have multiple instances of the service and multiple users hitting different instances of the service at the same time in different data centers, you will have conflict. CRDTs are designed for this very situation. When I found this article I thought it would be about event sourcing, version vectors, or something related to that topic. All I wanted to do was point out that there's a narrow band of applications that this solution is appropriate for.

Thread Thread
frosnerd profile image
Frank Rosner Author

Thanks for sharing! I think there is a misconception here. What I presented is not a solution. I was discussing ways of avoiding invalid state by choosing the right data model. It has nothing to do with distributed systems, multiple data centers, etc.

Sorry that you didn't find what you expected. Would you like to share some more information about event sourcing, version vectors, etc.? I am very interested :)

Despite the fact that my post did not present a solution, I disagree that there is a narrow band of applications where making invalid state impossible is appropriate. In fact I believe that choosing a data model where you cannot even represent invalid state is appropriate in every application out there. May it be distributed or not.

Thread Thread
maphengg profile image
maphengg

The narrow band for this solution is a single-threaded application. That's totally fine; I just thought I would point out that it applies to that set of applications. What I expected to get from the title has to do with my own interests. I misread. Case closed.

Collapse
frosnerd profile image
Frank Rosner Author

Thanks for the comment! I agree with you that the solution is certainly not meant to work in a distributed system. The focus was more about how you can reduce the amount of possible state, especially by avoiding invalid state. I believe that this is true for both local and distributed systems.

In order to avoid conflicts with concurrent access you can use locking. In this case I believe that optimistic locking would be a good fit as it is probably not so likely to have a conflict when updating. In case you do, you'd have to retry.

If you need to synchronize replicas of this data structure you can store the data in a distributed data store that has mechanism to find consensus. Raft, e.g., is a very simple consensus algorithm that would allow you to have many replicas of the data but still be consistent.

Did I understand what you were refering to? How would you implement the solution using CRDTs?

Collapse
maphengg profile image
maphengg

Locking is fine in a single instance of an application but doesn't help in a distributed system where availability is the priority over consistency of data.

Thread Thread
frosnerd profile image
Frank Rosner Author

I agree. You have to understand whether you want availability or consistency if there is a network partition. Based on your requirements you should select the coordination algorithm that fits your needs.