Krste Šižgorić

Posted on May 11, 2024 • Originally published at krste-sizgoric.Medium on May 11, 2024

Assumptions in Software Development (with EF Core as example)

#softwareengineering #csharp #efcore #softwaredevelopment

Assumptions are the mother of all f**k ups. This is a beautiful and unbelievably true statement.

Photo by Bao Menglong on Unsplash

We as software developers are creating, or if you prefer developing, software. And that software is just a set of instructions that execute specific task. Of course, it is a little bit more complex than that, but if we take a look at the pure core of what we are doing, that is basically it.

We write one instruction followed by other instruction, followed by yet another instruction, that eventually leads us to the desired result. With the complexity of our software, the number of instructions will grow. To reduce that number and prevent the need to reinvent the wheel every time we need some functionality, we are delegating some of those instructions to framework, tools, or at least language features.

This is where problems could arise. Our software is relying on us to know what that delegated set of instructions is doing. We could argue that the API we are using should be intuitive enough for us to understand what will happen, and that is true. A lot of learning materials are focused on this topic. Whole principle of “method should not have side-effects” is built upon it.

When we buy a car, we do not read instructions on how to drive it. We sit behind the wheel, turn the car on, and drive it. We might spend some time checking where the reverse is in that specific model, but basically, every car is intuitive enough for us to be able to use it straight away.

But sometimes behavior that is hidden behind a method simply cannot be expressed with a method name. Some tools required us to know what we are doing to be able to operate it correctly. Airplane is more complex than the car because it has more features, and needs to handle more complex scenarios. In software development we often assume that we know how to fly the plane because we know how to drive a car. And then we get angry because we crashed that plane.

EF Core — misunderstood and misused tool

One of those tools that are often used, but rarely truly understood, is Entity Framework Core (EF Core). And with good reason. EF Core is huge, and has a lot of conventions and specific functionalities. Knowing all of them is often not needed because most of default behaviors are the best choice in the majority of cases. So developers do not feel the need to know what is going on under the hood.

Many developers assume how it works based on the API that is exposed. Or based on the produced effect of action that they tested or used before. With those conclusions as base, whole software systems are built. Often that is good enough.

But EF Core is much more than some popular ORM. It has many good features that could improve our code. It could reduce so much manual work or make our software more maintainable and decoupled. It is based on the Unit of work pattern and has a change tracker, which is a great foundation for DDD.

Based on which way we choose to go, it can scaffold entities from database tables (database first), or the other way around, create migrations based on changes in our entities (code first). And all of this without (or with minimal) configuration.

And one of most important features, without which use of EF Core does not make sense, is ability to translate Linq to SQL. If we want, we can write type safe querying that can look almost like SQL.

But all of those could be concluded from the use of EF Core. And that is not the purpose of this article. I would like to cover some small things that are often overlooked but could make a difference.

Configuration — Table names

EF Core uses a lot of conventions. Let’s presume we are using code first, and we generate our migrations from our entities. We probably assume that our table names are generated from entity names. But that is not the case. Or at least not entirely.

If we just create our entities and configuration for them, then the name of our tables will be the same as the name of our entities. For example, if we have an entity Movie, the table in the database will be named Movie. But if we add the DbSet property to our DbContext, the name of our table will be the same as that property.

That is the reason why often our migration generates plural names for table names. Not because there is some mechanism to decide plural names of entities, but because our IDE suggested the plural name for property, and once we create that property, migration will copy it.

If we name property differently, for example Films, the table in the database will be named Films, even though our entity is named Movie.

And of course, we can always explicitly set the name of the table in our configuration, or by setting attribute on entity which will take priority, and “override” convention set by EF Core.

Configuration — Primary key generation

All entities in EF Core need to have a primary key. It could have one or multiple properties (composite key), but it needs to be unique. Once the primary key is set, EF Core convention kicks in. Non-composite primary keys of type short, int, long, or GUID will have new values generated automatically on insert.

We probably assume that this means that it will be generated in the database once the new row is inserted, but that is not necessarily true. How the primary key is generated depends on the provider. For SQL Server, the numeric primary key is automatically set up to be an IDENTITY column as expected. But for GUID, the primary key value is generated in memory by the provider, similar to NEWSEQUENTIALID database function.

We need to be aware that those primary keys are generated in sequence, so behavior stays the same. This is very important to mention, because the primary key is clustered index (if we do not explicitly configure it differently), and clustered indexes sort and store the data rows in the table or view based on their key values. Each new row is “appended” to the end of the table, and there is no fragmentation of data.

Configuration — Global query filters

Global query filters are great way to prevent sensitive data from leaking to users that should not have access to that data. We simply create global filter that reference user specific data (like TenantId) in OnModelCreating method and all queries for user are limited by default. This is “weapon of choice” for many that are implementing simple multitenant application.

We might assume that this filter is built on the fly, when an instance of DbContext is created, but that is not true. Building configuration is expensive, so EF Core builds it based on condition, and default one is the type of DbContext. Since the type of DbContext will never change, it is built once, and that same configuration is used for all instances of that DbContext. In the current version, global query filter values are no longer added to the filter as constant (I believe since v5) and because of that this will work.

So why should we care how configuration is built if this works for us? Well, we might have different requirements. Let’s say we need to build role based limits, and each role has a different way of limiting data access for each table.

We could add a switch statement and for each role add different filters. But it would never work that way. Configuration is built once. That means that OnModelCreating method is run only once.

For all roles the same limits would be applied. To be precise, the global query filter would be set to the role limits of the first user that initialized the DbContext. If a regular user was first one that called Web API, it would be regular user limits. If admin was first, then it would be admin limits.

But there is a way to force DbContext to generate different configuration. This can be achieved by implementing a custom IModelCacheKeyFactory. Once implemented, we can replace existing service with our custom one by calling ReplaceService method on optionBuilder in OnModelCreating method.

This will generate new configuration for each role. But we should be careful. In this case it might not be the problem, because there should not be that many roles. But in essence, we just created a potential memory leak.

If, for example, a user can define its own roles, there is an unlimited number of potential configurations that will be generated. That could fill memory and crash our program, and that is the reason why we should know how configuration is built.

Fetching data — cartesian explosion

When retrieving data, we can include related entities to be returned as well. Those navigation properties could reference one element or list of them, depending how relation between entities are defined.

For example, one director could have multiple movies, but let’s say one movie can have only one director (bear with me). This relationship is one to many. We assume that this would be retrieved by fetching data from the Directors table, then gathering primary keys, and making another request to retrieve Movies for directors’ ids. Many other ORMs work this way, so this is a legitimate assumption. But that is not the case for EF Core. It will join tables and retrieve all data in one round trip to the database.

For example, if we want to return a list of 10 directors, we will receive 10 rows from the database. If we decide to include movies for those directors, the number of rows would grow (in this example it would be 15 rows). And by including movie genres, for each movie we would get additional rows (in this case a total of 42 rows).

Photo by Krste Šižgorić

This case is not that critical. All of this can easily be handled in memory. Entity Framework would decide if a new entity needed to be created, or related entities need to be connected to existing one.

But imagine we include 10 related entities, and each of those entities could have navigation properties that we could include. This is called cartesian explosion and it could easily become a huge performance hit.

There is already a simple solution for this. We could mark the query as AsSplitQuery(), and the query will behave as we are expecting it. It will be split into multiple queries, and each entity will be returned only once.

This could be marked as default behavior, but doing one roundtrip to the database is in most cases faster than doing 10, 15 or whatever number of navigation properties we are including. So now that we know how this works, we could decide for each case what is the best option.

Fetching data — subqueries

Writing queries sometimes requires some little bit more complex logic that expression on IQueriable do not support. If we were writing raw SQL, we would probably write some subquery logic. Often developers assume that this is impossible, but EF Core actually does support it.

As long as we did not call any method that does materialization, like FirstOrDefault, ToList, Count or something similar, we are golden. IQueriable could be used as a condition in other IQueriable. This will generate a subquery, and it is a completely legitimate query to execute by EF Core.

var movieCondition = context.Movies
    .Where(movie => movie.Ranking <= 10)
    .Select(movie => movie.Id);

var directors = await context.Directors
    .Where(director => director.Movies
        .Any(movie => movieCondition.Contains(movie.Id)))
    .ToListAsync();

But our subquery does not need to be limited only to where conditions. We could write subquery in select statements too. This gives us full flexibility of SQL with type safety of strongly typed language. But for safety we should always check what the generated query looks like, just to make sure that EF Core did not generate something that will have poor performance.

var directors = await context.Directors
    .Select(director => new 
    {
        DirectorName = director.Name,
        Keywords = context.Keywords
            .Where(keyword => keyword.Movies
                .Any(movie => movie.DirectorId == director.Id))
            .ToList()
    })
    .ToListAsync();

Fetching data — change tracker

EF Core heavily relies on the change tracker. And this change tracker is a special kind of beast. When an entity is retrieved from the database, by default it is tracked. But the change tracker does not track only changes on entities. It tracks relations between entities too. If we retrieve one entity from the database, and in the next request retrieve some other entity that is somehow related to the original entity, navigation property in the original entity will be filled with newly retrieved entities. This is called navigation fixup.

Let’s explain this through an example. If we retrieve the Director entity from the database, and do not include any navigation property, all navigation properties will be empty for that director. Let’s say it is Peter Jackson. Now, if we retrieve the list of the first 10 movies, the change tracker will do its magic.

In the list of 10 movies that are retrieved there are all three The Lord of the Rings movies. If we now take a look at the Director entity, we will see that the Movies navigation property contains 3 items. Change tracker detected relation between those entities, and connected references.

And if we take a look into the movie entity, even though we did not include the Director property when we were retrieving movies, the navigation property is filled with Peter Jackson entity.

If we decide to return this Peter Jackson entity to the user, without mapping them to DTOs, our operation will fail, because JSON deserialization would fail. This is a circular reference, and after X number of LoTR -> Peter Jackson -> LoTR -> Peter Jackson references it will throw an exception.

Now, if we decide to fetch the next 10 movies, and one of those movies is The Hobbit, that movie will be added to navigation property on Peter Jackson’s Director entity, because Peter Jackson directed that movie.

// director.Movies.Count == 0
var director = await context.Directors
    .FirstOrDefaultAsync(x => x.Name == "Peter Jackson");

// director.Movies.Count == 3
var movies = await context.Movies.Take(10).ToListAsync();

// director.Movies.Count == 4
var movies2 = await context.Movies.Skip(10).Take(10).ToListAsync();

If we do not want our code to behave this way because it is unpredictable to us, or we want to squeeze additional performance, we should disable the change tracker for those queries. This can be achieved by calling AsNoTracking() on DbSet, and then writing the rest of our query. Or we can simply use projection.

Projection is achieved by calling Select() method on our query. This automatically turns off the change tracker. By calling Select eager loading (if there are no additional conditions in Include) are irrelevant too, because we need to explicitly select properties that we want to return.

Modifying data — change tracker

Let’s continue with the change tracker, but this time let’s talk about modifying data. As already mentioned, EF Core is using the change tracker. By default, it creates a snapshot of every entity’s property values when it is first tracked by a DbContext instance.

Those values are later compared against the current values to determine which property values have changed. All those changes will be saved to the database on first call to SaveChanges or SaveChangesAsync. If the database provider supports it, all those changes will be applied in transaction. Developers often are not aware of this.

For example, let’s say we have an endpoint on our Web API, and that endpoint calls SaveChanges only once. In this case SaveChanges guarantees that either all changes are saved, or everything fails. There is no need to explicitly start a transaction. And it doesn’t matter if we are saving one, or we are saving 100 entities.

But let’s go back to the fact that ALL changes on entities are saved on call of SaveChanges method. In combination with fixup of navigation properties, this gives us some nice ability to reduce the number of lines of codes.

Adding a new entity works as we would assume. Once we call Add() method, the entity is tracked by the change tracker, and on SaveChanges it will be stored in the database. If that new entity has some navigation properties filled, those entities will be stored too.

So, if we add a new director, and add new movies to Movies property, both director and movies will be saved. Nice thing about this is that we do not need to explicitly set DirectorId for each movie we are adding. By convention, EF Core presumes that DirectorId needs to be set to Id of related Director.

If this was not the case, we would be forced to save director first, retrieve new id, and only then we would be able to set DirectorId and save those new movies to the database.

And it even works the other way around. If we set the DirectorId property to the id value of the existing tracked Director, and add that new movie entity to the change tracker by calling Add method, that movie will be automatically added to the Director’s Movies property.

// director.Movies.Count == 0
var director = await context.Directors.FirstOrDefaultAsync();

// director.Movies.Count == 0
var newMovie = new Movie { DirectorId = director.Id }; 

// director.Movies.Count == 1
context.Movies.Add(newMovie);

Now all we need to do is to call SaveChanges method, and that entity will be saved to the database. But we do not need to be this explicit to be able to save a related entity. If we simply add a new movie to the Director, we will have the same effect.

var director = await context.Directors.FirstOrDefaultAsync();

director.Movies.Add(new Movie());

await context.SaveChangesAsync();

There is one scenario when this will not work. In case we are setting ID property ourselves EF Core would not be able to recognize that this is a new entity. In that case, if we do not call Add() method explicitly this will cause an exception.

Considering that EF Core is saving entities based on their state in the change tracker, we can conclude that Update method on DbContext is misleading. Many developers assume that it is mandatory to call Update() method to update an entity, but it couldn’t be further from the truth.

It is even recommended that we do not do this. Since the entity is already tracked, EF Core will detect if it needs to be saved. But if we explicitly call Update() method we are marking it as changed, even though nothing changed.

var director = await context.Directors.FirstOrDefaultAsync();

// marked entity as modified
context.Directors.Update(director); 

// update row in database
await context.SaveChangesAsync();

If we are having timestamp columns on our entities (like ModifiedAt and ModifiedBy) that are filled automatically, this will update those columns. Since no data is modified, the value of these columns is greatly reduced, because we are losing true info about the last change.

By not using Update method explicitly we can utilize one more nice feature of EF Core. When we are mapping our command or DTO to our entity, the change tracker compares new and old values, and if they are the same, properties won’t be marked as modified. Meaning, once we call SaveChanges, the change tracker will look at the state of entities, see that there is nothing to save, and there will be no roundtrips to the database. Simple.

When an entity has been modified, EF Core will detect which properties are modified, and update only those properties. If we explicitly use Update method, that will not be the case. All properties will be marked as modified, and the SQL update statement will contain all columns in the table (except primary key).

In some cases, for deleting items, we can avoid explicitly calling the Remove method too. If we remove one item from the navigation property, that item will be marked as deleted, and SaveChanges will execute deletion of that row in the database.

var director = await context.Directors.FirstOrDefaultAsync();

var movieToDelete = director.Movies.First();
director.Movies.Remove(movieToDelete);

await context.SaveChangesAsync();

By now we should understand some of the concepts of how the change tracker works. By understanding that EF Core is applying changes to the database based on state in the change tracker, we can do things like deleting an entity without retrieving it from the database. Something like this:

context.Movies.Remove(new Movie() { Id = 1 });

await context.SaveChangesAsync();

If there is no row with provided ID, an exception will be thrown (The database operation was expected to affect 1 row(s), but actually affected 0 row(s)), so we should be careful when and how we use this. Same goes for updates. We can create an entity, set the primary key of that entity, and attach it by calling Update method. And this is finally a valid case for using Update method on DbContext.

Finally, if we utilize all these functionalities, we could change the way we write our code. We could load the main entity (root aggregate), do changes on it (add new, update existing and remove not needed items in navigation properties) and call SaveChanges. All this will be saved in the database in transaction, so there is no need to explicitly start it, or handle rollback logic.

Base for DDD

If we now stop and take a look at what EF Core provides us (especially the change tracker), we can see a couple of ways to decouple our business logic. We could use in-process message dispatching to delegate side effects of our business logic to handlers.

If we dispatch messages right before we call SaveChanges, we could load additional entities in those handlers and make changes on them. We do not save those changes to the database, but simply leave them like that. Change tracker will keep references to those entities, so the garbage collector cannot remove them from memory.

Once all handlers are done, we can call SaveChanges, and all data will be saved. Both our business logic, and side effect logic from handlers. If any of those fails, everything will be rolled back. Simple, extendible, and decoupled. And with minimal lines of code.

Now, if we want to go a step further, we can encapsulate our navigation properties, create domain events inside the entity itself if something is changed, add those events to a private list of domain entities, and dispatch them inside our repository just before we call SaveChanges. This is deferred approach that is a little bit different than the original DDD’s idea of domain events. But it is clean, testable, and maintainable.

Conclusion

This article is overwhelming on purpose, to demonstrate how something that we take for granted could be more complex than we expected. It is not important how we misuse EF Core. We could do the same thing with AutoMapper, RabbitMQ, AWS SQS or whatever tool or service we are using. We could do the same thing with patterns or architecture. Or even language features.

Fact is that we should dedicate some time to understand what we are using, and how it works. We should not assume that we have a car, just because it has wheels. Or we will end up getting all confused because we keep scratching everything and causing damage and jams, but ignoring the fact that we are driving the plane on the highway.

DEV Community