angelamchunu

Posted on Nov 4, 2019

Beginners guide to optimization in C# (Part One)

#optimization #csharp #datastructures #beginners

When you’re working on a legacy system, after some time there are bound to be a few issues. It’s an occupational hazard; you have different developers rotated on and off the project, different approaches and different priorities. Depending on the task, efficiency of a solution is usually an afterthought and over time this can cause some slow running queries.

Tackling slow running queries can be overwhelming, where do you even start? Is it your database, is it your code, is it the network, is it a ghost in the machine. You’ve looked at your profiler and it tells you which method is the time waster but what now? In this two part series I aim to show how changing little things can make quite a difference.

Part One of my optimization post focuses on the code side of things ; how to get the best out of LINQ by choosing the “right” data structure , deferred execution and your loading strategy.

Data Structures

Data structures are a way of storing data in an organised manner. Each structure is aimed at being efficient in a particular function, mainly sorting, searching and insertions/deletions. Therefore choosing the right one can save a substantial amount of time. Data structures are important when using LINQ, most beginners default to IEnumerable or Lists which gets the job done but isn’t necessarily optimal. Here are some examples of data structures and how to use each one for the best performance.

IEnumerable

IEnumerable is an interface that is implemented in most C# collections. They are best used when writing multiple and embedded LINQ queries.
They tend not to fair well with aggregate function such as count , max or average. This is because we don’t have any knowledge of the number of objects in the collection. A common mistake is using the Count() method which iterates through the entire collection to count each item, needless to say, this is where optimization goes to die.
IEnumerables also aren’t great when it comes to finding a particular value in the collection , it has to iterate through every item until it finds the value which sucks if the value is the last one.

Lists

Lists are one of the most common data structures, they are a collection of items. They implement IEnumerable interface, they have knowledge of the size of the collection and have indexes. Indexes are keys for finding a specific location, Lists use indexes to find values at a particular position in the collection. For instance if you need the 100th item in the list, it will go to the item directly and retrieve it, unlike having to search 99 items before finding it. This also makes insertions and deletions much easier by providing the index of the item.

Lists also have their downfalls, they are unable to search for a particular value without the index. So for example if you have a grocery list with Milk, Eggs, Wine and Broccoli and you wanted to find out if you remembered to add Wine on the grocery list. Using a list would not be ideal because it would iterate all items, meaning it would compare “Wine ”with “Milk ”and then “Wine” with “Eggs ”first before reaching “Wine”. This isn’t the worst thing if you only have 4 items, but what happens when you own a chain of restaurants? The list will obviously be more than 4 items , it’s most likely going to be extremely long. Will you iterate the entire list to find one item? In C# a common beginners mistake is using the Contains() method which does exactly that, when you have large amounts of data this can get slow.

Dictionary

Dictionaries are optimized for searching a particular item in a collection using key/value pairs. This works by choosing a key which acts as your index . The key is used in a hash function to create a hash value and store the data at that hash value. They are very similar to real life dictionaries , if you are searching for the definition of a word , you would check the index page for the word you are looking for and it gives you the location of the definition. Without the index, you would have to look through all the words in the dictionary until you find the word. Therefore on lookup , the key is used to find the particular value , giving a O(1) lookup no matter how large the data gets compared to lists which have O(n) .

For example, if we have a list of subjects and we want to check if “Theology” exists in this list. As opposed to using the Contains() method in a List, converting the List to a Dictionary and then searching for it as illustrated below.

Dictionary<string,int> subjectsDictionary = 
subjectsList.ToDictionary(key => key.SubjectName, value => value.SubjectID);
subjectsDictionary.ContainsKey(“Theology”);

This also should be used with caution, there is a cost to serializing a List to a Dictionary. The value of converting is seen when you have multiple lookups.

HashSet

Hash Set is an implementation of a Set that uses hash functions as its searching optimisation algorithm. Sets are the same as the mathematical definition which is a collection of distinct objects. Meaning it is a collection of objects that ensure there is no duplication. HashSets work very similar to dictionaries because they use a hash function on lookup but differ in that they don’t use key/value pairs, they only store that actual value. They are also optimal in finding a particular value.

Deferred Execution

The idea of deferred execution is that you only enumerate once the query is filtered to the detail you need it to be. i.e serializing an IEnumerable to a List (calling ToList) too soon can add extra time to your query. In calling ToList method, you are bringing the entire list into memory . If the list is not filtered accurately , unnecessary data will be enumerated. Basically, wait to the last possible moment before you enumerate your results .

Say for example we have two classes, Students and Subjects. We want to use Linq to find the names of students over 25 that are studying Psychology. One possible way of doing it is the example below:

var psycStudents = students.Where(s => s.SubjectID == 1).ToList();
var psycStudentsOverTwentyFive = psycStudents.Where(s => s.Age > 25).Select(s =>s.StudentName);

The above code would work, but the problem is that it enumerates before properly filtering. It brings all students that are studying Psychology(SubjectId =1) into memory which is unnecessary because we are only looking for a subset of these students. A better way of writing this query would be :

var psycStudentsOverTwenty = students.Where(s => s.SubjectID == 1 && s.Age > 25).Select(s => s.StudentName);

This way Linq only returns the data we need and saves time especially if the dataset is large. This can also help when you need to iterate through the items, it ensure that only the items you need are iterated over.

Loading Strategy

Entity Framework(EF) has 3 ways in which it handles the loading of data: lazy, eager and explicit loading . In this section I will briefly discuss what each strategy means and how to use each one to help optimise your code. The loading strategies are changed at the DbContext (intermediary class between your database and domain classes) level, for the examples let’s call the context VarsityContextEntities.

Lazy Loading

This is the most common of the three, the name gives some context into how it loads data. If you are a lazy person , you will do the bare minimum , the same applies with data retrieval . When making a query, only the requested entity is returned and nothing more. EF retrieves data as it is required in the code, so there is not need to load all entities upfront. Lazy loading becomes helpful in one-to-many relationships where one entity is linked to a list of another type of entity. Similar to the Student and Subjects example, each student can have many subjects. If we had to query for all the students, enabling lazy loading would ensure we would only get the Student entity and not Subjects which can save some time. An example in code would be as follows:

var students = context.Students.ToList<Student>()

The issue with lazy loading is that it leads to the N+1 query problem which is when a query is made for the parent entity (Student), there will be one made for each child entity(Subjects) . So instead of making one call to retrieve all entities, N+1 queries are made(N for retrieving the Subjects for each Student and 1 for the initial call for Student) . This causes multiple calls being made to the DB which isn't great for performance.

Eager Loading

When making a query to the database, the requested entity along with all its related entries are returned. For example, if the Student entity was linked to Subjects and Faculty. If eager loading is enabled , when a query for students is made, both Subjects and Faculty entities are retrieved. This becomes useful if both entities are needed, it reduces the number of calls made to the database. A way of implementing eager loading is using the Include() method , here’s an example of Student being loaded with their respective Subjects and Faculty using eager loading :

using (var context = new VarsityContextEntities())
{
    var students = context.Students.Include("Subject.Faculty")
                    .FirstOrDefault<Student>();
}

Eager loading is also a solution for the N+1 problem because all the entities you need are already loaded in just 1 query.

Explicit Loading

Explicit loading is similar to lazy loading in that it only returns the requested data, but you can explicitly request a particular entity. You tell EF when to load which entity . For example, if we want the students but somewhere later in the scope we would like to use the Subject entity. Instead of making two calls to the database, the first for retrieving Student and the second call to retrieve Subjects, you only make one call. Explicit loading is achieved by using the Load() method, using the above example, this is how it can be shown in code:

using (var context = new VarsityContextEntities())
{
    var students = context.Students.FirstOrDefault<Student>();
    context.Entry(student).Collection(s => s.Subject).Load(); 
}

As a beginner, C# provides you with quite a few optimisation traps. You just press dot (.) and you are instantly bombarded with a list of options to choose from. It is very easy to just pick the one that sounds closest to what you want without being cognisant of the ramifications. Hopefully this post makes you a bit more aware and intentional in your coding.

Top comments (6)

Davyd McColl • Nov 4 '19

Good article; some thoughts, if I may:

In addition to delaying actualising the collection as late as possible, also try to do all filtering (.Where) before all transforms (.Select) where possible. Flipping them will create more expensive queries which retrieve more data into memory on the server before filtering the set down.
When you really need to actualise, prefer .ToArray over .ToList. Arrays are lighter because they have less functionality, being of fixed size. Your code then also communicates to other parts of the code that the collection is not modifiable, unlike lists, which invite consumers to add or remove items from the list. This pattern is a side-effect of one of the tenets of functional programming: preferring immutability wherever possible.
Get to grips with .Aggregate. Correctly-structured .Aggregate calls can be run in SQL at the server-side, saving the amount of data to be brought across the wire (much like how good filtering - .Where - also helps). It's a balancing act though -- if you're getting information for a desktop app, it's often better to rather retrieve more data from a local datastore (eg SQL server on the network) than get the SQL server to do all the work. The best advice I can give here is to profile different approaches and find the one with the optimal outcome.
Beware refactoring within your LINQ queries -- whilst you may be able to pull out methods which work in pure LINQ, you'll find that they won't work when translated to SQL because the translation will literally contain the names of your refactored methods.
Please do test your LINQ-to-SQL queries against the actual data source you're going to use. Don't mock them (eg NSubstitute). Don't run them against a different engine (eg sqlite) for testing. There are differences in what LINQ code can be converted to SQL for different targets. For example: SQLCE does not support the LOWER function, so having LINQ-to-SQL which uses .ToLower won't work against SQLCE, but will work against SQL server. If you want to test against a local variant of SQL server, LocalDb is exactly the same engine, so use that.

Shameless plug: I have a bunch of nuget packages for providing temporary databases, specifically for testing. They implement the IDisposable interface, so you can, for example:

using (var db = new TempDBLocalDb())
{
  // in this block, you may attach your EF context to `db` and
  // perform queries. You will have to run any migrations necessary
  // to set up the required schema, if you're using a migration
  // strategy like EF migrations or FluentMigrator. If you're not,
  // you can always run in a schema script (or pass one or more
  // schema scripts to the constructor above)
}

angelamchunu • Nov 5 '19

Thank you so much for your insight. I'll definitely take these into consideration going forward.

James Eastham • Nov 4 '19

I've only recently discovered the pitfalls of all the different types of enumerators. I used to just use List wherever I went and was happy with that.

So many options... so much to remember :(

Fantastic post though!