Celso Jr

Posted on Apr 25, 2024 • Edited on Apr 28, 2024 • Originally published at celsojr.com

Closures: Performance implications

#csharp #lowlevel #internals #closures

Performance implications with closures capture

Originally posted on https://www.celsojr.com/post/closures-performance-implications

The default closure capture mechanism used by the compiler is often the most efficient way to handle closure capture in many scenarios. If your struct is large or you're unsure about the implications, it's generally recommended to avoid custom closure captures and instead use an imperative code style or traditional functions. Compositions with closures are not primarily intended to optimize performance, so they may not be the best choice in mission-critical, resource-saving operations.

But, if you want, you can always create your own helper class or struct to do this for you. However, believe me, it will be very hard for you to take care of all your closures alone depending on the size of your application and the APIs you're working with. And also very difficult to beat the efficiency of the default closure capture mechanism done by the compiler. But let's give it a try. Let's look at the following code example:

using System;
using System.Runtime.CompilerServices;

static class ClosureCompare
{
    private static int n = 0;
    private delegate void AddDelegate(int n);

    private readonly static DisplayStruct adder = new DisplayStruct(ref n);
    private readonly static AddDelegate invoke = adder.Add;

    static void Main()
    {
       invoke(1);
       Console.WriteLine(adder.GetValue()); // Output the result: 1
    }
}

readonly unsafe struct DisplayStruct
{
    private readonly int* num;

    public DisplayStruct(ref int initialValue)
    {
        fixed (int* ptr = &initialValue)
        {
            num = ptr;
        }
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public void Add(int n)
    {
        *num += n;
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public int GetValue()
    {
        return *num;
    }
}

Note that I'm working with a low-level pointer so I can change a "read-only" property, and so I'm also running this code in an unsafe environment. I know this code doesn't seem very convenient, but otherwise, believe me, it won't be worth doing your own closure capture because the compiler will do it better than you. If you check the low-level C# code, you will see that the old <>c__DisplayClass0_0 class is now gone.

But, on the other hand, if you do not want to use closures at all, you can still have a specialized struct to do the dirty work for you, like so:

using System;
using System.Runtime.CompilerServices;

static class ClosureCompare
{
    static void Main()
    {
       int num = 0;

       DisplayStruct adder = new DisplayStruct(ref num);

       adder.Add(1); // Perform addition operation

       Console.WriteLine(adder.GetResult()[0]); // Output the result: 1
    }
}

public readonly ref struct DisplayStruct
{
    private readonly Span<int> num;

    public DisplayStruct(ref int initialValue)
    {
        num = new Span<int>(ref initialValue);
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public void Add(int n)
    {
        if (num.Length > 0)
            num[0] += n;
        else
            throw new InvalidOperationException("Span is empty.");
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public Span<int> GetResult()
    {
        return num;
    }
}

Note the use of the ref modifier for this new struct. That will both allow you to use the Span<T> as a struct field and also prevent people from using this struct to work with the most common closures because you cannot use a ref local inside an anonymous method, lambda expression, or query expression as per their official documentation.

You can try this code yourself, but at least on my machine it was the fastest one in the last LTS runtime. And of course, it's always recommended to do your own benchmarks when necessary because this result below can vary from one machine to another:

// * Summary *

BenchmarkDotNet v0.13.7, Windows 11 (10.0.22631.3447)
AMD Ryzen 5 1600, 1 CPU, 12 logical and 6 physical cores
.NET SDK 8.0.204
  [Host]   : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX2
  .NET 7.0 : .NET 7.0.18 (7.0.1824.16914), X64 RyuJIT AVX2
  .NET 8.0 : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX2


|                Method |  Runtime |        Mean |     Error |    StdDev |      Median | Ratio | Rank |   Gen0 | Allocated |
|---------------------- |--------- |------------:|----------:|----------:|------------:|------:|-----:|-------:|----------:|
|  CustomClosureCapture | .NET 7.0 |   3.2424 ns | 0.0994 ns | 0.1104 ns |   3.2463 ns |  0.16 |    1 |      - |         - |
|             NoClosure | .NET 7.0 |  12.3134 ns | 0.2658 ns | 0.2356 ns |  12.2293 ns |  0.60 |    2 |      - |         - |
| DefaultClosureCapture | .NET 7.0 |  20.4729 ns | 0.4359 ns | 0.9840 ns |  20.2206 ns |  1.00 |    3 | 0.0017 |      88 B |
|                       |          |             |           |           |             |       |      |        |           |
|             NoClosure | .NET 8.0 |   0.0454 ns | 0.0299 ns | 0.0456 ns |   0.0337 ns | 0.002 |    1 |      - |         - |
|  CustomClosureCapture | .NET 8.0 |   3.1839 ns | 0.0956 ns | 0.1243 ns |   3.1440 ns | 0.149 |    2 |      - |         - |
| DefaultClosureCapture | .NET 8.0 |  21.5585 ns | 0.4636 ns | 0.4961 ns |  21.5625 ns | 1.000 |    3 | 0.0014 |      88 B |

How to avoid surprises with closures

This is better to understand how the closure capture work in C# to avoid surprises. We already know that closures are something running in a different scope or environment, whether it is a function or an expression. And by different scope or environment, closures can also be running in a different thread. That is when we should start to be more aware of how things work. Let's take a look at this code:

using System;
using System.Threading;

static class ClosureCompare
{
    static void Main()
    {
       int[] arr = [1, 2, 3, 4, 5];

       foreach(int n in arr)
       {
           ThreadPool.QueueUserWorkItem(_ => Console.Write(n));
       }

       // Wait a bit for the Thread Pool threads to do their work
       // as we are not joining the threads together again
       Thread.Sleep(2_000);
    }
}

This code should run smoothly and, on most machines, two seconds should be enough for all threads to be scheduled and perform their work on time. It should output something like 12345 to the console, but not always in the same order because the execution scheduling is not being managed by code and it depends on availability of threads.

So far so good, ah? And what about this next code snippet below, now using a for loop?

using System;
using System.Threading;

static class ClosureCompare
{
    static void Main()
    {
       int[] arr = [1, 2, 3, 4, 5];

       for (int i = 0; i < arr.Length; i++)
       {
           // Without manually capturing the closure here, value is always 5 leading
           // to an unhandled out of range exception that you may never know about
           ThreadPool.QueueUserWorkItem(_ => Console.Write(arr[i]));
       }

       // Wait a bit for the Thread Pool threads to do their work
       // as we are not joining the threads together again
       Thread.Sleep(2_000);
    }
}

But why this error if we have limited the loop for to the same size of the array correctly?

i < arr.Length

Well, we will see that the way these loops work is a little different. Microsoft has changed the way the foreach loop works since version 5.0 of the language. And according to the C# language specification [1], "The placement of v inside the while loop is important for how it is captured by any anonymous function occurring in the embedded_statement."

If you take a look at the generated low-level C# code, you will see that the foreach is still capturing the closure by reference. But now, the compiler is creating a new instance of that helper class with a copy of the array item for each iteration. Can you spot the difference in the code snippet below?

private static void Main()
{
    <>c__DisplayClass0_0 <>c__DisplayClass0_ = new <>c__DisplayClass0_0();
    int[] array = new int[5];
    RuntimeHelpers.InitializeArray(array, (RuntimeFieldHandle));
    <>c__DisplayClass0_.arr = array;
    int[] arr = <>c__DisplayClass0_.arr;
    int num = 0;
    // Foreach loop translated
    while (num < arr.Length)
    {
        <>c__DisplayClass0_1 <>c__DisplayClass0_2 = new <>c__DisplayClass0_1();
        <>c__DisplayClass0_2.n = arr[num]; // Making a copy of the array item here
        ThreadPool.QueueUserWorkItem(new WaitCallback(<>c__DisplayClass0_2.<Main>b__0));
        num++;
    }
    <>c__DisplayClass0_2 <>c__DisplayClass0_3 = new <>c__DisplayClass0_2();
    <>c__DisplayClass0_3.CS$<>8__locals1 = <>c__DisplayClass0_;
    <>c__DisplayClass0_3.i = 0;
    // Loop for translated
    while (<>c__DisplayClass0_3.i < <>c__DisplayClass0_3.CS$<>8__locals1.arr.Length)
    {
        ThreadPool.QueueUserWorkItem(new WaitCallback(<>c__DisplayClass0_3.<Main>b__1));
        <>c__DisplayClass0_3.i++;
    }
    Thread.Sleep(2_000);
}

This is interesting the way these loops are translated into the same while loop and are still different. But the key point is to know how do they work, because this is not a problem as it may look like. And the way the loop for works, can possibly be more performant if not degraded by JIT [2] compilation.

So, what's happening with the way this for loop is written is that when the loop is translated into a while loop, what we get is a reference to the variable i of the helper class. This way, when the variable is incremented for the last time within the loop from 4 to 5, before the last loop check, those threads that have a reference to that same variable will be trying to access index 5 of an array of size 4.

And the value of the variable will almost always be 5, because at this point the loop has already incremented it by 5 times. The loop runs faster, a matter of nano seconds or even less, than the operation necessary to instrument the creation of threads in the thread pool including, but not limited, to the scheduling of execution.

In order to make it work as expected, we just need to copy the current increment by manually capturing the closure by value instead of by reference, as shown in the code snippet below:

using System;
using System.Threading;

static class ClosureCompare
{
    static void Main()
    {
       int[] arr = [1, 2, 3, 4, 5];

       for (int i = 0; i < arr.Length; i++)
       {
           // Making a copy of the current increment
           // and capturing the closure by value, instead of reference
           int closureCapture = i;
           ThreadPool.QueueUserWorkItem(_ => Console.Write(arr[closureCapture]));
       }

       // Wait a bit for the Thread Pool threads to do their work
       // as we are not joining the threads together again
       Thread.Sleep(2_000);
    }
}

But this is not a "problem" reserved to the loops only. This can also happen with Timers. And worse than that, it can happen in the opposite way. Timers have low-level APIs and are not very commonly used because there are higher-level abstractions, such as BackgroudWorker [3], which offer more flexible APIs and a better experience. But let's check this pseudo code example below with a low-level Timer:

using System;
using System.Timers;
using System.Threading.Tasks;

using Timer = System.Timers.Timer;

class Program
{
    static int count = 0;
    static int[] items = [1, 2, 3];
    static TaskCompletionSource tcs = new TaskCompletionSource();

    static async Task Main()
    {
        var timer = new Timer() { Interval = 100 };

        timer.Elapsed += (sender, e) => CronJob(sender, e,
            count); // Capturing count by value

        timer.Enabled = true;
        await tcs.Task;

        timer.Stop();
        timer.Dispose();

        Console.WriteLine("Timer stopped.");
    }

    private static void CronJob(object? source, ElapsedEventArgs e, int count)
    {
        Console.WriteLine("Item: {0}", items[count]);

        count++;

        if (count == items.Length)
        {
            tcs.SetResult(true);
        }
    }
}

In this example, the variable count is being passed to the CronJob function by value and, therefore, will never be incremented more than once inside that function scope, leading to an infinity run.

By default, arguments in C# are passed to functions by value. That means a copy of the variable is passed to the method. [4]

To make this code work as expected, we just need to use a small ref key word that will work the same way the & sign works in PHP, as we saw in the first blog post of this closures series. Please check the updated code below:

using System;
using System.Timers;
using System.Threading.Tasks;

using Timer = System.Timers.Timer;

class Program
{
    static int count = 0;
    static int[] items = [1, 2, 3];
    static TaskCompletionSource tcs = new TaskCompletionSource();

    static async Task Main()
    {
        var timer = new Timer() { Interval = 100 };

        timer.Elapsed += (sender, e) => CronJob(sender, e,
            ref count); // Capturing count now by reference

        timer.Enabled = true;
        await tcs.Task;

        timer.Stop();
        timer.Dispose();

        Console.WriteLine("Timer stopped.");
    }

    private static void CronJob(object? source, ElapsedEventArgs e,
        ref int count) // Capturing count now by reference
    {
        Console.WriteLine("Item: {0}", items[count]);

        count++;

        if (count == items.Length)
        {
            tcs.SetResult(true);
        }
    }
}

Of course, this is not recommended to use low-level APIs in the development of enterprise applications unless it is really necessary. Low-level code is more error prone and, among other things, should also have a negative impact on readability.

The code example previously shown was just simulating the problems that can arise when we don't really know how things work. And with this, I hope to have helped more people understand a little more about scope, closures and compositions. And also, how to take advantage of it. Happy coding!

Disclaimer

It's worth noting that I'm not a Microsoft employee. All opinions in this blog post are my own. The information displayed here is not endorsed by Microsoft, .Net Foundation or any of their partners. This is not a sponsored post. All rights reserved.