DEV Community

Cover image for Improving the performance of your code starting with Go
Kenta Takeuchi
Kenta Takeuchi

Posted on • Updated on • Originally published at bmf-tech.com

Improving the performance of your code starting with Go

This article is from bmf-tech.com - Goで始めるコードのパフォーマンス改善.

Day 9 of Makuake Advent Calendar 2022!

Improving the performance of your code starting with Go

When I decided to improve the performance of goblin, a homebrew HTTP Router, I tried to work on performance improvement in Go, so I write about my approach and the efforts I put into practice.

Prerequisites

I'm sure there is more knowledge needed for more in-depth tuning, but I'll just list the bare minimum required.

  • Garbage Collection
    • A function that automatically releases memory areas allocated by a program at runtime that are no longer needed.
  • Memory area
    • Text area
      • Area where a program converted into machine language can be stored
    • Stack area
      • Memory area Memory area allocated at program execution
      • Data whose size is fixed at runtime is targeted
      • Automatically released.e.g., when a function finishes executing and is no longer needed)
      • Ex. arguments, return values, temporary variables, etc.
    • Heap area
      • Memory area allocated during program execution
      • Data whose size is dynamically determined
      • Subject to garbage collection
    • Static area
      • Memory area allocated during program execution
      • Allocated until the program is terminated
      • Ex. global variables, static variables, etc.

Approach to Performance Improvement

The assumption is that there is a need to improve performance (is it worth sacrificing readability, can we determine that the application is the bottleneck in the first place, etc.), but we will proceed under the assumption that there is a need.

Some ways to improve the performance of the code include as follows:

  • Algorithm optimization 
  • Optimization of data structures
  • Use of cache
  • Application of parallel processing
  • Compile optimization

There are a number of things that come to mind, but before implementing improvements, measurement and analysis should be performed.
(It is assumed that the need for performance improvement is more important than measurement, but that depends on the needs of each individual, and is not discussed here.)

We will introduce some of the packages and tools for measurement and analysis in Go.

Benchmarks

Go includes Benchmarks in the standard package testing to get a benchmark of your code.

You can run the command go test -bench=. -benchmem to obtain a benchmark.

package main

import (
    "math/rand"
    "testing"
)

func BenchmarkRandIn(b *testing.B) {
    for i := 0; i < b.N; i++ { // b.N automatically specifies the number of times the benchmark can be trusted
        rand.Int() // Function to be measured
    }
}
Enter fullscreen mode Exit fullscreen mode

The output result will look like this.

goos: darwin
goarch: amd64
pkg: bmf-san/go-perfomance-tips
cpu: VirtualApple @ 2.50GHz
BenchmarkRandIn-8       87550500                13.53 ns/op            0 B/op          0 allocs/op
PASS
ok      bmf-san/go-perfomance-tips      1.381s
Enter fullscreen mode Exit fullscreen mode

The benchmark results that can be read from this are as follows:

  • 87550500
    • Number of function executions
    • The higher the number of executions, the better the performance is considered
  • 13.53 ns/op
    • Time required per function execution
    • Less time is considered better performance
  • 0 B/op
    • Size of memory allocated per function execution
    • The smaller the number, the better the performance is considered to be
  • 0 allocs/op
    • Number of memory allocations made per function execution
    • The fewer the number of allocations, the better the performance

Go allows for easy benchmarking in this way.

For other Go benchmarking features, see the documentation.
Benchmarks

The package benchstat is a good tool to compare benchmark results, showing the percentage of improvement in the benchmark results. The package benchstat is a good tool to compare benchmark results, as it shows the percentage of improvement.

The package bmf-san/goblin, which I manage, is incorporated into CI so that the results can be compared before and after a commit.

// This is an example where nothing has improved...
go test -bench . -benchmem -count 1 > new.out
benchstat old.out new.out
name           old time/op    new time/op    delta
Static1-36        248ns ± 0%     246ns ± 0%   ~     (p=1.000 n=1+1)
Static5-36        502ns ± 0%     495ns ± 0%   ~     (p=1.000 n=1+1)
Static10-36       874ns ± 0%     881ns ± 0%   ~     (p=1.000 n=1+1)
WildCard1-36     1.60µs ± 0%    1.50µs ± 0%   ~     (p=1.000 n=1+1)
WildCard5-36     4.49µs ± 0%    4.92µs ± 0%   ~     (p=1.000 n=1+1)
WildCard10-36    7.68µs ± 0%    7.58µs ± 0%   ~     (p=1.000 n=1+1)
Regexp1-36       1.38µs ± 0%    1.48µs ± 0%   ~     (p=1.000 n=1+1)
Regexp5-36       4.30µs ± 0%    4.11µs ± 0%   ~     (p=1.000 n=1+1)
Regexp10-36      7.66µs ± 0%    7.18µs ± 0%   ~     (p=1.000 n=1+1)
Enter fullscreen mode Exit fullscreen mode

Absolutely no performance degradation allowed! In such a case, it may be a good idea to use a mechanism to make CI fail.

If you want to check the actual memory allocation by looking at the results of such a benchmark, you can check it by building with the build option.

package main

import "fmt"

// Run build with go build -o example -gcflags '-m' gcflagsexample.go
func main() {
    a := "hello"
    b := "world"
    fmt.Println(a + b)
}
Enter fullscreen mode Exit fullscreen mode

Running go build -o example -gcflags '-m' gcflagsexample.go produces the following output.

# command-line-arguments
./gcflagsexample.go:9:13: inlining call to fmt.Println
./gcflagsexample.go:9:13: ... argument does not escape
./gcflagsexample.go:9:16: a + b escapes to heap
./gcflagsexample.go:9:16: a + b escapes to heap
Enter fullscreen mode Exit fullscreen mode

This is a simple example that is obvious, but it is also a useful method of analysis because memory allocation can be improved by identifying allocations to the heap in this way and reducing heap allocations.

Profiling

Go has a tool called pprof to analyze where the bottlenecks are at the function level.

package main

import (
    "sort"
    "testing"
)

func sortAlphabetically() {
    s := []string{"abc", "aba", "cba", "acb"}
    sort.Strings(s)
}

func BenchmarkSortAlphabetically(b *testing.B) {
    for i := 0; i < b.N; i++ {
        sortAlphabetically()
    }
}
Enter fullscreen mode Exit fullscreen mode

If you want to see the CPU profile, run the following.

go test -test.bench=BenchmarkSortAlphabetically -cpuprofile cpu.out && go tool pprof -http=":8888" cpu.out

cpu_profile

If you want to see the memory profile, run the following.

go test -test.bench=BenchmarkSortAlphabetically profilingexample_test.go -memprofile mem.out && go tool pprof -http=":8889" mem.out

memory_profile

Utilizing the UI of pprof makes it easier to identify where the bottleneck is in the process.

Practice

Raise an example of improving goblin, a home-made HTTP Router.

The PR on the subject is here.
Reduce the memory allocation by refactoring explodePath method #68

goblin is an HTTP Router that works well with the net/http interface based on a tri-tree.

As for functionality, it has the minimum that is considered necessary for routing.
cf. goblin#features

Benchmarks

First, we run a benchmark test to measure performance.

go test -bench=. -cpu=1 -benchmem
Enter fullscreen mode Exit fullscreen mode

The benchmark test has about three patterns for each test case: static routing (ex. /foo/bar), dynamic routing (ex. /foo/:bar), and routing using regular expressions (ex. /foo/:bar[^\d+$]).

The routing process involves

  1. put data into the tree structure (≒define routing)
  2. search for data from the tree structure (return data based on the requested path)

In this test case, only the latter is measured.

The output results are as follows:

goos: darwin
goarch: amd64
pkg: github.com/bmf-san/goblin
cpu: VirtualApple @ 2.50GHz
BenchmarkStatic1         5072353               240.1 ns/op           128 B/op          4 allocs/op
BenchmarkStatic5         2491546               490.0 ns/op           384 B/op          6 allocs/op
BenchmarkStatic10        1653658               729.6 ns/op           720 B/op          7 allocs/op
BenchmarkWildCard1       1602606               747.3 ns/op           456 B/op          9 allocs/op
BenchmarkWildCard5        435784              2716 ns/op            1016 B/op         23 allocs/op
BenchmarkWildCard10       246729              5033 ns/op            1680 B/op         35 allocs/op
BenchmarkRegexp1         1647258               733.2 ns/op           456 B/op          9 allocs/op
BenchmarkRegexp5          456652              2641 ns/op            1016 B/op         23 allocs/op
BenchmarkRegexp10         251998              4780 ns/op            1680 B/op         35 allocs/op
PASS
ok      github.com/bmf-san/goblin       14.304s
Enter fullscreen mode Exit fullscreen mode

Several trends can be read into each of the number of executions, number of executions per run, memory size per run, and number of memory allocations.

I am personally concerned that memory allocations are occurring even for static routing. (Other HTTP Router benchmarks show 0 allocs.)

Profiling

Next, use pprof to obtain a profile.

This time, we focus only on memory to obtain a profile.

go test -bench . -memprofile mem.out && go tool pprof -http=":8889" mem.out
Enter fullscreen mode Exit fullscreen mode

Graph output results.
! pprof_graph

We can see that the process with the largest box (using the most memory) is explodePath.

Even if you look at the Top (list in order of longest execution time), explodePath is at the top of the list.

pprof_top

Flat is the processing time of the function, and Cum is the processing time including waiting time.

Furthermore, check Source to see where in the function the processing is actually heavy.

pprof_source

Since Search is the core process responsible for the router's matching process, I thought it would be the bottleneck there, but it turns out that a specific process within it, explodePath, is the bottleneck.

Tuning

The process of explodePath is to split the received string by / and return it as a []string type.

// explodePath removes an empty value in slice.
func explodePath(path string) []string {
    s := strings.Split(path, pathDelimiter)
    var r []string
    for _, str := range s {
        if str != "" {
            r = append(r, str)
        }
    }
    return r
}
Enter fullscreen mode Exit fullscreen mode

Test code is also included for easy understanding of the specifications.

func TestExplodePath(t *testing.T) {
    cases := []struct {
        actual   []string
        expected []string
    }{
        {
            actual:   explodePath(""),
            expected: nil,
        },
        {
            actual:   explodePath("/"),
            expected: nil,
        },
        {
            actual:   explodePath("//"),
            expected: nil,
        },
        {
            actual:   explodePath("///"),
            expected: nil,
        },
        {
            actual:   explodePath("/foo"),
            expected: []string{"foo"},
        },
        {
            actual:   explodePath("/foo/bar"),
            expected: []string{"foo", "bar"},
        },
        {
            actual:   explodePath("/foo/bar/baz"),
            expected: []string{"foo", "bar", "baz"},
        },
        {
            actual:   explodePath("/foo/bar/baz/"),
            expected: []string{"foo", "bar", "baz"},
        },
    }

    for _, c := range cases {
        if !reflect.DeepEqual(c.actual, c.expected) {
            t.Errorf("actual:%v expected:%v", c.actual, c.expected)
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

Since the variable r defined in the [[]string type has no capacity defined, it is inferred that memory efficiency is likely to be poor.

The following is a benchmark test and results of adding append to slice prepared for verification.

package main

import "testing"

func BenchmarkSliceLen0Cap0(b *testing.B) {
    var s []int

    b.StartTimer()
    for i := 0; i < b.N; i++ {
        s = append(s, i)
    }
    b.StopTimer()
}

func BenchmarkSliceLenN(b *testing.B) {
    var s = make([]int, b.N)

    b.StartTimer()
    for i := 0; i < b.N; i++ {
        s = append(s, i)
    }
    b.StopTimer()
}

func BenchmarkSliceLen0CapN(b *testing.B) {
    var s = make([]int, 0, b.N)

    b.StartTimer()
    for i := 0; i < b.N; i++ {
        s = append(s, i)
    }
    b.StopTimer()
}
Enter fullscreen mode Exit fullscreen mode
goos: darwin
goarch: amd64
pkg: example.com
cpu: VirtualApple @ 2.50GHz
BenchmarkSliceLen0Cap0  100000000               11.88 ns/op           45 B/op          0 allocs/op
BenchmarkSliceLenN      78467056                23.60 ns/op           65 B/op          0 allocs/op
BenchmarkSliceLen0CapN  616491007                5.057 ns/op           8 B/op          0 allocs/op
PASS
ok      example.com     6.898s
bmf@bmfnoMacBook-Air:~/Desktop$
Enter fullscreen mode Exit fullscreen mode

This result suggests that the code could be somewhat more efficient by specifying the capacity.

Therefore, modify explodePath as follows.

func explodePath(path string) []string {
    s := strings.Split(path, "/")
    // var r []string
    r := make([]string, 0, strings.Count(path, "/")) // Specify capacity
    for _, str := range s {
        if str != "" {
            r = append(r, str)
        }
    }
    return r
}
Enter fullscreen mode Exit fullscreen mode

A little more in-depth and improved implementation.

func explodePath(path string) []string {
    splitFn := func(c rune) bool {
        return c == '/'
    }
    return strings.FieldsFunc(path, splitFn)
}
Enter fullscreen mode Exit fullscreen mode

We will compare benchmarks with three patterns: the original explodePath implementation, an implementation with reserved slice capacity, and an implementation using strings.FieldFunc.

package main

import (
    "strings"
    "testing"
)

func explodePath(path string) []string {
    s := strings.Split(path, "/")
    var r []string
    for _, str := range s {
        if str != "" {
            r = append(r, str)
        }
    }
    return r
}

func explodePathCap(path string) []string {
    s := strings.Split(path, "/")
    r := make([]string, 0, strings.Count(path, "/"))
    for _, str := range s {
        if str != "" {
            r = append(r, str)
        }
    }
    return r
}

func explodePathFieldsFunc(path string) []string {
    splitFn := func(c rune) bool {
        return c == '/'
    }
    return strings.FieldsFunc(path, splitFn)
}

func BenchmarkExplodePath(b *testing.B) {
    paths := []string{"", "/", "///", "/foo", "/foo/bar", "/foo/bar/baz"}
    b.StartTimer()
    for i := 0; i < b.N; i++ {
        for _, v := range paths {
            explodePath(v)
        }
    }
    b.StopTimer()
}

func BenchmarkExplodePathCap(b *testing.B) {
    paths := []string{"", "/", "///", "/foo", "/foo/bar", "/foo/bar/baz"}
    b.StartTimer()
    for i := 0; i < b.N; i++ {
        for _, v := range paths {
            explodePathCap(v)
        }
    }
    b.StopTimer()
}

func BenchmarkExplodePathFieldsFunc(b *testing.B) {
    paths := []string{"", "/", "///", "/foo", "/foo/bar", "/foo/bar/baz"}
    b.StartTimer()
    for i := 0; i < b.N; i++ {
        for _, v := range paths {
            explodePathFieldsFunc(v)
        }
    }
    b.StopTimer()
}
Enter fullscreen mode Exit fullscreen mode
goos: darwin
goarch: amd64
pkg: example.com
cpu: VirtualApple @ 2.50GHz
BenchmarkExplodePath             1690340               722.2 ns/op           432 B/op         12 allocs/op
BenchmarkExplodePathCap          1622161               729.5 ns/op           416 B/op         11 allocs/op
BenchmarkExplodePathFieldsFunc   4948364               239.5 ns/op            96 B/op          3 allocs/op
PASS
ok      example.com     5.685s
Enter fullscreen mode Exit fullscreen mode

The implementation using strings.PathFieldFunc seems to have the best performance, so it is adopted.

Measuring Effectiveness

Check the results after improving the implementation of explodePath.

Benchmarks

# Before
goos: darwin
goarch: amd64
pkg: github.com/bmf-san/goblin
cpu: VirtualApple @ 2.50GHz
BenchmarkStatic1         5072353               240.1 ns/op           128 B/op          4 allocs/op
BenchmarkStatic5         2491546               490.0 ns/op           384 B/op          6 allocs/op
BenchmarkStatic10        1653658               729.6 ns/op           720 B/op          7 allocs/op
BenchmarkWildCard1       1602606               747.3 ns/op           456 B/op          9 allocs/op
BenchmarkWildCard5        435784              2716 ns/op            1016 B/op         23 allocs/op
BenchmarkWildCard10       246729              5033 ns/op            1680 B/op         35 allocs/op
BenchmarkRegexp1         1647258               733.2 ns/op           456 B/op          9 allocs/op
BenchmarkRegexp5          456652              2641 ns/op            1016 B/op         23 allocs/op
BenchmarkRegexp10         251998              4780 ns/op            1680 B/op         35 allocs/op
PASS
ok      github.com/bmf-san/goblin       14.304s

# After
go test -bench=. -cpu=1 -benchmem -count=1
goos: darwin
goarch: amd64
pkg: github.com/bmf-san/goblin
cpu: VirtualApple @ 2.50GHz
BenchmarkStatic1        10310658               117.7 ns/op            32 B/op          1 allocs/op
BenchmarkStatic5         4774347               258.1 ns/op            96 B/op          1 allocs/op
BenchmarkStatic10        2816960               435.8 ns/op           176 B/op          1 allocs/op
BenchmarkWildCard1       1867770               653.4 ns/op           384 B/op          6 allocs/op
BenchmarkWildCard5        496778              2484 ns/op             928 B/op         13 allocs/op
BenchmarkWildCard10       258508              4538 ns/op            1560 B/op         19 allocs/op
BenchmarkRegexp1         1978704               608.4 ns/op           384 B/op          6 allocs/op
BenchmarkRegexp5          519240              2394 ns/op             928 B/op         13 allocs/op
BenchmarkRegexp10         280741              4309 ns/op            1560 B/op         19 allocs/op
PASS
ok      github.com/bmf-san/goblin       13.666s
Enter fullscreen mode Exit fullscreen mode

Comparing before and after improvements, we can say that the overall trend is toward improvement.

Profiling

pprof's graph.

pprof_graph_after

pprof's top.

pprof_top_after

You can see that the bottleneck has moved to strings.FieldsFunc, which is called in the explodePath.

Further Improvements

Here is the tag released after making other improvements to goblin.
6.0.0

Unfortunately, there are no significant improvements in the data structure and algorithms, so to speak, so there are no remarkable improvements.

I have a feeling that the data structure and algorithms they are using now are still difficult to use. (I have seen other routers that use more advanced trees, so I guess that's true...)

This is somewhat off topic, but I created a bench marker to compare with other routers and see if I can get some hints for improvement.

bmf-san/go-router-benchmark

It was interesting to compare them and see how ragged they are.

I would like to improve it by studying other router implementations, learning about advanced tree structures, etc., which I failed to do before.

Summary

  • Go makes benchmarking and profiling easy.
  • Don't guess, measure!
  • It is hard to get big results with small improvements (that's true)

Reference

Top comments (0)