Experimentation with Optimized Closures
At Sharp Cells, we love functional programming but we also love writing high-performance code! One feature we love - partial application, represents a potential source of performance frustration.
Partial Application?
For those unaware, partial application is a technique where you can pass some of the arguments to a function and receive a new function that contains the provided arguments and expects only the remaining arguments.
let add a b = a + b
// Call normally
add 1 2 = 3
// Partial application creates a new function that adds 1
let addOne = add 1
// Use the partially applied function
addOne 2 = 3
This technique is made easy from F# using a language feature known as currying and is common in many functional programming languages. But, partial application is not unique to functional programming languages and most programming languages can represent the same thing with some additional ceremony:
- Using JavaScript:
function add(x, y) {
return x + y;
}
// Partially apply the add function to create a new function that adds 1 to any number
const addOne = add.bind(null, 1);
// Call the new function with one argument
console.log(addOne(2)); // Output: 3
In JavaScript, we can use the bind
method to create a new function with one of the arguments applied.
- Using C#:
Func<int, int, int> add = (x, y) => x + y;
// Partially apply the add function to create a new function that adds 1 to any number
Func<int, int> addOne = x => add(1, y);
// Call the new function with one argument
Console.WriteLine(addOne(2)); // Output: 3
In C#, we can use lambda expressions to create a new Func
delegate that captures the first argument.
So What's the Problem?
The source of the potential performance issues is better highlighted when looking at the different languages' implementations. To partially apply a function we must create a new function that encapsulates the first argument. Normal functions can't hold any state so we must create a closure - a special type of object that represents the state and the function to be called.
If, each time we use partial application we create a closure - a new object, when we no longer need that closure it must be removed from the heap, creating more work for the garbage collector and slowing down our program.
The F# Compiler Authors are Smart
Really smart. We made several attempts to demonstrate the effect of creating closures but each time the performance was equivalent to the fully applied version of the function. That's great but what's going on?
A Brief Detour into Inlining
When your F# gets compiled, it gets transformed in many ways to try to run your code in the fastest way possible. One of the ways code can be optimized is inlining - the compiler takes the definition of the function and replaces it with the body of the function. We can see the effect of inlining and other optimizations on F# by using SharpLab to compile F# and then decompile it to a low-level form of C#:
let add a b = a + b
let accumulate (sum: byref<_>) f =
sum <- f sum
type StackClosures() =
member this.Inlined() =
let mutable sum = 0
for i in 0 .. 10000 do
sum <- add sum i
sum
member this.AlsoInlined() =
let mutable sum = 0
for i in 0 .. 10000 do
accumulate &sum (add i)
sum
Optimized then decompiled to C#:
public class StackClosures
{
public int Inlined()
{
int num = 0;
int num2 = 0;
while (num2 < 10001)
{
num += num2;
num2++;
}
return num;
}
public int AlsoInlined()
{
int num = 0;
int num2 = 0;
while (num2 < 10001)
{
num = num2 + num;
num2++;
}
return num;
}
}
As you can see, the add
and accumulate
functions get completely eliminated by the compiler. Even with the tricky byref
parameter. Similarly, the for
loop is replaced with an equivalent while
loop. Part of the reason F# can make these optimizations is immutability. If a value is immutable then it is always safe to replace it with its implementation. This is great for avoiding unnecessary closure allocation.
Defeating the Optimizer
Generally, this should be avoided, but for the sake of demonstration, we need to find a way to defeat the optimizer.
type StackClosures() =
member this.NormalClosure() =
let mutable sum = 0
let mutable x = 0
for i in 0 .. 10000 do
x <- i
accumulate &sum (add x)
sum
Adding an unnecessary variable mutable x
is sufficient to prevent the optimizer from doing its thing. Now when we look at the decompilation we can see how the closure is made:
public class StackClosures
{
public int NormalClosure()
{
int num = 0;
int num2 = 0;
int num3 = 0;
while (num3 < 10001)
{
num2 = num3;
int a = num2;
FSharpFunc<int, int> fSharpFunc = new NormalClosure@13(a);
num = fSharpFunc.Invoke(num);
num3++;
}
return num;
}
}
internal sealed class NormalClosure@13 : FSharpFunc<int, int>
{
public int a;
internal NormalClosure@13(int a)
{
this.a = a;
}
public override int Invoke(int b)
{
return a + b;
}
}
There's the class we were expecting. Now each time we go through the loop a new instance of NormalClosure@13
is created, invoked, and discarded. Yuck!
A Better Way?
Objects aren't the only way to represent state in the world of .NET. We can also use structs. These are value types that contain state like an object but can be allocated on the stack instead of the heap allowing them to be created and destroyed without involving the garbage collector.
We can take inspiration from the implementation of NormalClosure@13
to create a struct version of a closure:
[<Struct>]
type StackFunc<'A, 'B, 'C> =
{
mutable A: 'A
F: 'A -> 'B -> 'C
}
member inline this.Invoke(b) = this.F this.A b
Neat! But how do we know if it's faster?
BenchmarkDotNet
BenchmarkDotNet is the gold standard for benchmarking all flavors of .NET code. With this tool, we test our functions and compare their performance in a rigorous and reproducible way.
Turning our code into benchmark code is quite simple, we just some attributes to our class and away we go:
module TestFuncs =
let add a b = a + b
let accumulate (sum: byref<_>) f =
sum <- f sum
let accumulateFunc (sum: byref<_>) (f: StackFunc<_,_,_>) =
sum <- f.Invoke(sum)
let expected =
function
| 1 -> 1
| 10 -> 55
| 100 -> 5050
| 10000 -> 50005000
| x -> failwith $"Undefined for {x}"
open TestFuncs
[<MemoryDiagnoser>]
type StackClosures() =
[<Params(1, 10, 100, 10000)>]
member val N = 1 with get, set
[<Benchmark>]
member this.Inlined() =
let mutable sum = 0
for i in 0 .. this.N do
sum <- add sum i
if (sum <> expected this.N) then failwith "Inlined summed incorrectly"
[<Benchmark>]
member this.AlsoInlined() =
let mutable sum = 0
for i in 0 .. this.N do
accumulate &sum (add i)
if (sum <> expected this.N) then failwith "AlsoInlined summed incorrectly"
[<Benchmark>]
member this.NormalClosure() =
let mutable sum = 0
let mutable x = 0
for i in 0 .. this.N do
x <- i
accumulate &sum (add x)
if (sum <> expected this.N) then failwith "NormalClosure summed incorrectly"
[<Benchmark>]
member this.StackFunc() =
let mutable sum = 0
let mutable addOnStack =
{
StackFunc.A = 0
F = add
}
for i in 0 .. this.N do
addOnStack.A <- i
accumulateFunc &sum addOnStack
if (sum <> expected this.N) then failwith "StackFunc summed incorrectly"
BenchmarkRunner.Run<StackClosures>()
|> ignore
Results
Method | N | Mean | Error | StdDev | Gen0 | Allocated |
---|---|---|---|---|---|---|
Inlined | 1 | 1.568 ns | 0.0536 ns | 0.0550 ns | - | - |
AlsoInlined | 1 | 1.935 ns | 0.0611 ns | 0.0541 ns | - | - |
NormalClosure | 1 | 5.956 ns | 0.1424 ns | 0.2417 ns | 0.0029 | 48 B |
StackFunc | 1 | 9.369 ns | 0.1177 ns | 0.1101 ns | - | - |
Inlined | 10 | 4.108 ns | 0.0995 ns | 0.0930 ns | - | - |
AlsoInlined | 10 | 11.470 ns | 0.2381 ns | 0.2228 ns | - | - |
NormalClosure | 10 | 26.572 ns | 0.5540 ns | 1.2730 ns | 0.0158 | 264 B |
StackFunc | 10 | 35.152 ns | 0.7291 ns | 0.7161 ns | - | - |
Inlined | 100 | 28.490 ns | 0.3842 ns | 0.3594 ns | - | - |
AlsoInlined | 100 | 29.988 ns | 0.4265 ns | 0.3989 ns | - | - |
NormalClosure | 100 | 233.629 ns | 3.7461 ns | 3.5041 ns | 0.1447 | 2424 B |
StackFunc | 100 | 315.870 ns | 4.5391 ns | 4.2459 ns | - | - |
Inlined | 10000 | 2,224.525 ns | 30.0005 ns | 28.0625 ns | - | - |
AlsoInlined | 10000 | 2,216.351 ns | 26.0386 ns | 24.3565 ns | - | - |
NormalClosure | 10000 | 21,989.143 ns | 273.0534 ns | 242.0547 ns | 14.3433 | 240024 B |
StackFunc | 10000 | 30,848.998 ns | 589.8022 ns | 551.7014 ns | - | - |
Looking at the results we can see that we have successfully avoided allocation of the closure but the code is... slower. This is unexpected and extremely unfortunate but what's going on?
Let's take another look at the decompilation. Specifically the Invoke
method.
public C Invoke(B b)
{
return FSharpFunc<A, B>.InvokeFast(F@, A@, b);
}
That's not a normal function invocation and it's bound to be causing some additional overhead. But all is not lost! There's more than one way to represent a function in our struct.
Alternative Struct Representations
The two other possible ways we can represent a function are the Func
delegate and the OptimizedClosures.FSharpFunc
. Let's create two more structs and rerun the benchmarks.
[<Struct>]
type StackOptFunc<'A, 'B, 'C> =
{
mutable A: 'A
F: OptimizedClosures.FSharpFunc<'A, 'B, 'C>
}
member inline this.Invoke(b) = this.F.Invoke(this.A, b)
[<Struct>]
type StackDelegate<'A, 'B, 'C> =
{
mutable A: 'A
F: Func<'A, 'B, 'C>
}
member inline this.Invoke(b) = this.F.Invoke(this.A, b)
module TestFuncs =
let accumulateOptFunc (sum: byref<_>) (f: StackOptFunc<_,_,_>) =
sum <- f.Invoke(sum)
let accumulateDelegate (sum: byref<_>) (f: StackDelegate<_,_,_>) =
sum <- f.Invoke(sum)
[<MemoryDiagnoser>]
type StackClosures() =
[<Benchmark>]
member this.StackOptFunc() =
let mutable sum = 0
let mutable addOnStack =
{
StackOptFunc.A = 0
F = OptimizedClosures.FSharpFunc<_,_,_>.Adapt(add)
}
for i in 0 .. this.N do
addOnStack.A <- i
accumulateOptFunc &sum addOnStack
if (sum <> expected this.N) then failwith "StackOptFunc summed incorrectly"
[<Benchmark>]
member this.StackDelegate() =
let mutable sum = 0
let mutable addOnStack =
{
StackDelegate.A = 0
F = add
}
for i in 0 .. this.N do
addOnStack.A <- i
accumulateDelegate &sum addOnStack
if (sum <> expected this.N) then failwith "StackDelegate summed incorrectly"
And the results look great! The OptimizedClosures.FSharpFunc
version gets ahead of the Func
version and both are substantially faster (almost 50%) than the NormalClosure
case for large N. The Func
delegate does allocate some memory but as it is wrapping the original function it only needs to be allocated once. Both are, however, around, 5x slower than the fully inlined version. NormalClosure
and StackOptFunc
are roughly the same speed for N = 1.
Method | N | Mean | Error | StdDev | Gen0 | Allocated |
---|---|---|---|---|---|---|
Inlined | 1 | 1.568 ns | 0.0536 ns | 0.0550 ns | - | - |
NormalClosure | 1 | 5.956 ns | 0.1424 ns | 0.2417 ns | 0.0029 | 48 B |
StackOptFunc | 1 | 5.771 ns | 0.1231 ns | 0.1152 ns | - | - |
StackDelegate | 1 | 11.009 ns | 0.2426 ns | 0.4120 ns | 0.0038 | 64 B |
Inlined | 10 | 4.108 ns | 0.0995 ns | 0.0930 ns | - | - |
NormalClosure | 10 | 26.572 ns | 0.5540 ns | 1.2730 ns | 0.0158 | 264 B |
StackOptFunc | 10 | 16.039 ns | 0.3389 ns | 0.2830 ns | - | - |
StackDelegate | 10 | 25.177 ns | 0.5126 ns | 0.5484 ns | 0.0038 | 64 B |
Inlined | 100 | 28.490 ns | 0.3842 ns | 0.3594 ns | - | - |
NormalClosure | 100 | 233.629 ns | 3.7461 ns | 3.5041 ns | 0.1447 | 2424 B |
StackOptFunc | 100 | 115.131 ns | 1.3663 ns | 1.2781 ns | - | - |
StackDelegate | 100 | 184.305 ns | 3.0298 ns | 2.8340 ns | 0.0038 | 64 B |
Inlined | 10000 | 2,224.525 ns | 30.0005 ns | 28.0625 ns | - | - |
NormalClosure | 10000 | 21,989.143 ns | 273.0534 ns | 242.0547 ns | 14.3433 | 240024 B |
StackOptFunc | 10000 | 13,130.112 ns | 206.8013 ns | 193.4421 ns | - | - |
StackDelegate | 10000 | 15,543.613 ns | 169.1367 ns | 158.2106 ns | - | 64 B |
Conclusions
We have successfully demonstrated that it is possible to significantly improve the performance of closures for partial application by replacing the standard representation with a struct version. So are we going to replace all closures in our Sharp Cells add-in with StackOptFunc
? Not at this stage.
Without a detailed analysis of the compilation output, it's not clear whether the existing closures are being inlined (and therefore faster) or allocated (and slower) than the struct version. We hope that an optimization like this will find its way into the compiler so that everyone can enjoy faster closures when they cannot be fully inlined.
Top comments (0)