Juanito Fatas

Posted on Feb 26, 2020 • Originally published at juanitofatas.com

Optimization Techniques by Benchmark Winners

#ruby #optimization #benchmark

This post is based on Jeremy Evans‘s Optimization Techniques Used by the Benchmark Winners presentation. Slides | YouTube at Ruby Kaigi 2019. It‘s better to watch his fabulous presentation, but you can come back here for written references. These techniques coming from maintaining Sequel and Roda. These two gems have always been 0 issues. It‘s probably the best maintained gems. Sequel and Roda are the leaders for Ruby in TechEmpower benchmark results. Let‘s see what lessons can we learn to optimize Ruby.

These techniques may not be the best to write at application code level but good when building libraries, optimize critical path of your application. The ideas and principles are universally applicable.

Avoid processing

2:24

No code is faster than no code
-- Merb Motto

Fastest code does the least. Execute less code. Run code once instead of many times, once is better than many. In big-o terms: O(1) > O(n).

Take initialize model instance as example, the Sequel is definitely faster than Active Record because it does much less things.

Active Record:

# https://github.com/rails/rails/blob/8ae8e19a33f4749e529b1649f4f0ace1dbb37285/activerecord/lib/active_record/core.rb#L357-L369
def init_with_attributes(attributes, new_record = false) # :nodoc:
  @new_record = new_record
  @attributes = attributes

  init_internals

  yield self if block_given?

  _run_find_callbacks
  _run_initialize_callbacks

  self
end

Sequel:

# https://github.com/jeremyevans/sequel/blob/4a146796f273ce14e121954ad4e7f3134209386d/lib/sequel/model/base.rb#L221
def call(values)
  o = allocate
  o.instance_variable_set(:@values, values)
  o
end

Delay Computations

For web app, it means to do at initialization time is better than doing at Runtime. See Hound applies this for putting environment variables into constants.

Example: Active Record & Sequel Callbacks

Active Record runs find and initialize callbacks during initialization. Even when there are no callbacks defined on the model, so it slows down:

# lib/active_record/core.rb
def init_with_attributes(attributes, new_record = false)
  ...

  _run_find_callbacks
  _run_initialize_callbacks

  ...
end

Sequel does this by having a Plugin System¹. In Sequel, if you need callbacks, add plugin :after_initialize to your model. Nothing gets run if you‘re not using callbacks.

Inheritance gives every subclass things that they may not need. Prefer Composition over Inheritance. Plugins give all subclass a chance to only take what they need. Sequel and Roda use plugins for most features so they are faster.

Reduce Object Allocations

10:28

The init_internals in Active Record set few instance variables to nil:

def init_internals
  @primary_key              = self.class.primary_key
  @readonly                 = false
  @destroyed                = false
  @marked_for_destruction   = false
  @destroyed_by_association = nil
  @_start_transaction_state = nil
  @transaction_state        = nil

  self.class.define_attribute_methods
end

Jeremy found avoid @instance_variable = nil resulted in 150% faster in Sequel and Roda. This is controversial because only in verbose mode, using instance variable without define first will emit warnings. Emit warnings makes it slower. He made a patch to remove such warnings but got rejected for backward compatibility.

You can find another example of delay allocation of instance variables in this commit from rails/rails. Delay setting instance variable to hash by setting it to nil first.

Another common place to reduce object allocations is string. Before Ruby 2.3, we need to call freeze on strings. Since 2.3, we can add a frozen string literal in a file and by default all strings in that file are frozen.

Reduce String Allocations

Before frozen string literal (Ruby < 2.3), introduce a constant for common places of strings SELECT = 'SELECT'.freeze was okay:

SELECT = 'SELECT'.freeze
SPACE = ' '.freeze
FROM = 'FROM'.freeze

def select_sql
  sql = String.new
  sql ‌<< SELECT ‌<< SPACE
  sql ‌<< literal(columns)
  sql ‌<< SPACE <<‌ FROM <<‌ SPACE
  sql ‌<< literal(table)
end

Since Ruby 2.3:

# frozen-string-literal: true

def select_sql
  sql = String.new
  sql << "SELECT "
  sql << literal(columns)
  sql << " FROM "
  sql << literal(table)
end

It‘s simpler and easier to understand. Avoids string allocations in the meantime with less operations (<<), hence faster.

Reduce Hash allocations

Avoid default hash in method definition:

class Sequel::Dataset
  def union(dataset, opts={})
    compound_clone(:union, dataset, opts)
  end
end

Introduce an empty frozen hash constant:

class Sequel::Dataset
  OPTS = {}.freeze

  # 100% faster than `opts={}`.
  def union(dataset, opts=OPTS)
    compound_clone(:union, dataset, opts)
  end

  # Keyword argument is slower than Optimal hash constant.
  def union(dataset, **opts)
    compound_clone(:union, dataset, opts)
  end
end

Right now double splat keyword arguments are slower (2.7) than above empty frozen hash constant technique. Keyword arguments also have maintenance barriers the method caller and callee all need to change when keyword changes.

Reduce Proc Allocation

15:03

For proc does not depend on something else or any runtime state:

def indifferent_params
  Hash.new { |h, k| h[k.to_s] if k.is_a?(Symbol) }
end

can extract into a constant:

IND = proc { |h, k| h[k.to_s] if k.is_a?(Symbol) }
def indifferent_params2
  Hash.new(&IND)
end

About 300%+ faster.

Benchmark.ips do |x|
  x.report("inline proc") { indifferent_params }
  x.report("constant proc") { indifferent_params2 }
  x.compare!
end

Comparison:
       constant proc:  5645873.6 i/s
         inline proc:  1469374.7 i/s - 3.84x  slower

Reduce Array Allocation

This is not from the presentation but I think it‘s great to also include an example here.

Use transform_values when modifying hashes. See this commit from rails/rails:

    def escape(params)
-     Hash[params.map { |k, v| [k, Rack::Utils.escape(v)] }]
+     params.transform_values { |v| Rack::Utils.escape(v) }
    end

Minimize Indirection

Say a common operation in our app is to convert a string into Integer. We could have a lambda, a kernel method, a dedicated object with call to do so, or an object that aliased call to Integer():

string = "42"

using_lambda = lambda { |str| Integer(str) }
using_kernel = Kernel.method(:Integer)
using_object = Object.new
def using_object.call(str); Integer(str); end
no_indirection_object = Object.new
class << no_indirection_object
  alias call Integer
  public :call
end

Benchmark.ips do |x|
  x.report("lambda") { using_lambda.call(string) }
  x.report("Kernel") { using_kernel.call(string) }
  x.report("Object") { using_object.call(string) }
  x.report("Object v2") { no_indirection_object.call(string) }
  x.compare!
end

Comparison:
  Object v2:  6258404.2 i/s
  Object:     5674514.7 i/s - same-ish: difference falls within error
  lambda:     5199998.6 i/s - 1.20x  slower
  Kernel:     5130274.6 i/s - 1.22x  slower

We see using Kernel.method is slowest here. A lambda is slightly faster than that. Define a method on an object is faster than calling a lambda. Avoid indirection (by alias call to Integer) further makes it ~10% faster. The benchmark example above aligns with Jeremy‘s observations on benchmarking Sequel from the presentation. A similar observation is also found from graphql-ruby land that they‘re now in favor of method call instead of using procs.

Defining Methods

19:02

def foo
  1
end

# 50% slower than `def foo; 1; end`
define_method(:foo) do
  1
end

So we always want to define method with def. Sometimes we want to define at runtime. Take Sequel as example, it needs to dynamically define model columns‘s setters and getters, can achieve with class_eval + def.

# def_column
columns.each do |column|
  class_eval("def #{column}; @values[:#{column}]; end")
  class_eval("def #{column}=(v); @values[:#{column}] = v; end")
end

This works for column name without spaces "name", with space "employee name" it does not work. We need to use define_method:

columns.each do |column|
  column = column.to_sym
  define_method(column) do
    @values[column]
  end

  define_method(:"#{column}=") do |v|
    @values[column] = v
  end
end

which is slower than using class_eval + def.

These columns with "bad" names are not the majority of the case but changing the implementation to this hurts performance. Here comes another general optimization principle.

Separate Common from Uncommon

21:18

We usually have a fast approach for the simple cases but it fails in more complex cases. Assume the simple cases are more common, we can speed up by separating the two cases. Sequel separates the common from the uncommon, to have the best of both worlds. Below are simplified from Sequel‘s column definitions:

def def_column_accessor(*columns)
  columns, bad_columns = columns.partition { |x| /\A[A-Za-z_][A-Za-z0-9_]*\z/.match(x.to_s) }

  bad_columns.each{|x| def_bad_column_accessor(x)} # define_method
  columns.each do |column|
    # faster with `def`
  end
end

In general def > define_method, but there is one case where define_method is preferred for performance. Let‘s see an example:

class Def
  def self.def_numbers(first, last)
    class_eval("def numbers; (#{first}..#{last}).to_a.freeze; end")
  end

  def self.def_numbers2(first, last)
    array = (first..last).to_a.freeze
    define_method(:numbers2) do
      array
    end
  end
end

Benchmark.ips do |x|
  x.report("def") { Def.def_numbers(1, 10) }
  x.report("define_method") { Def.def_numbers2(1, 10) }
  x.compare!
end

Comparison:
       define_method:   542551.0 i/s
                 def:    62085.7 i/s - 8.74x  slower

So prefer def over define_method unless you can access local variables from surrounding scope to avoid computations, only then define_method > def.

Related to this. If we need to accept blocks and store blocks, and use instance_exec to execute these blocks. Let‘s implement a before hook to demonstrate this idea.

def self.before(&block)
  before_hooks << block
end

def before
  # execute all blocks passed to self.before
  self.class.before_hooks.each { |block| instance_exec(&block) }
end

This works but instance_exec creates singelton class for the instance, it‘s faster to use methods. We can define before hooks methods by naming them based on their position in the array, then we pass to define_method:

def self.before(&block)
  meth = :"_before_hook_#{before_hooks.length}"
  define_method(meth, &block)
  before_hooks << meth
end

def before
  self.class.before_hooks.each { |m| send(m) }
end

Execute these before hooks could be a send which is faster. This is good but we can do better. Since we know which methods will be executed, we can define the before to execute these before hook methods by:

def self.before(&block)
  # ...

  class_eval("def before; #{before_hooks.join(';')}; end")
end

This is faster because it avoids the need to call each on before hooks array and each method invoked directly instead of one indirection send. But we can still do better.

def self.before(&block)
  # ...

  if before_hooks.length > 1
    class_eval("def before; #{before_hooks.join(';')}; end")
  else
    class_eval("alias before #{before_hooks.first}")
  end
end

If there is only a single before hook, we can alias to that method hence minimized indirection.

This approach to implementing before hooks compare to using define_method, is about 200% faster because it avoids a lot of internal indirections. One caveat of this approach is that it does not work with before { |x| }. But we can fix this by:

unless block.arity == 0 || block.arity == -1
  b = block
  block = lambda { instance_exec(&b) }
end

Optimize Inner Loops

26:16

Another best place to start optimizing is inside any inner loops. Let‘s see an example from Sequel‘s Anywhere adapter:

# https://github.com/jeremyevans/sequel/blob/ae50b5e98efb3ad902fb37f370cbf237644a12d3/lib/sequel/adapters/sqlanywhere.rb#L157-L187
while db.api.sqlany_fetch_next(rs) == 1
  max_cols = db.api.sqlany_num_cols(rs)
  h2 = {}
  max_cols.times do |cols|
    h2[col_map[db.api.sqlany_get_column_info(rs, cols)[2]]||db.api.sqlany_get_column_info(rs, cols)[2]] =
      cps[db.api.sqlany_get_column_info(rs, cols)[4]].nil? ?
          db.api.sqlany_get_column(rs, cols)[1] :
            db.api.sqlany_get_column_info(rs, cols)[4] != 500 ?
              cps[db.api.sqlany_get_column_info(rs, cols)[4]].call(db.api.sqlany_get_column(rs, cols)[1]) :
                convert ? cps[db.api.sqlany_get_column_info(rs, cols)[4]].call(db.api.sqlany_get_column(rs, cols)[1]) :
                  db.api.sqlany_get_column(rs, cols)[1]
  end
  yield h2
end unless rs.nil?

And there are some hidden ternary operators without parentheses...So this is complex, but not the reason why it‘s slow. It‘s slow because:

db.api.sqlany_get_column_info(rs, cols) repeated 5 times with same arguments in the inner loop. Another one is db.api.sqlany_get_column(rs, cols)[1] repeated 4 times. After a serious refactoring, now we arrive at:

def fetch_rows(sql)
  db = @db
  cps = db.conversion_procs
  api = db.api
  execute(sql) do |rs|
    convert = convert_smallint_to_bool
    col_infos = []
    api.sqlany_num_cols(rs).times do |i|
      _, _, name, _, type = api.sqlany_get_column_info(rs, i)
      cp = if type == 500
        cps[500] if convert
      else
        cps[type]
      end
      col_infos << [output_identifier(name), cp]
    end

    self.columns = col_infos.map(&:first)
    max = col_infos.length

    if rs
      while api.sqlany_fetch_next(rs) == 1
        i = -1
        h = {}
        while (i+=1) < max
          name, cp = col_infos[i]
          v = api.sqlany_get_column(rs, i)[1]
          h[name] = cp && v ? cp.call(v) : v
        end
        yield h
      end
    end
  end
  self
end

The inner loop now becomes much simpler:

while (i+=1) < max
  name, cp = col_infos[i]
  v = api.sqlany_get_column(rs, i)[1]
  h[name] = cp && v ? cp.call(v) : v
end

All operations inside are stored on local variables. This seems not a big deal but if you are retrieving 100,000 rows with each row of 10 columns, this makes a big difference. By define local variables, this saves millions of method calls. This is another general optimization principle.

Prefer Local Variables

29:02

Prefer local variables whenever possible especially in inner loop as demostrated above.

Local variables > Instance variables
Local variables > Constant
Local variables > Method Calls

local variables are faster because it minimize the number of indirections.

while v.s. each

Let‘s go back to the above refactoring example. The deliberate use of while in the inner loop:

while (i+=1) < max
  name, cp = col_infos[i]
  v = api.sqlany_get_column(rs, i)[1]
  h[name] = cp && v ? cp.call(v) : v
end

Inner loop like this is one of the few places that make sense to use while instead of each because using each for inner loops can hurt performance, because it requires a seperate a stackframe to push and pop for each iteration.

Choose Faster Algorithms

31:07

Sinatra routes (by pattern) v.s. Roda‘s routing tree

Given thousands of routes. Sinatra will be busy matching against all routes. O(N) Roda‘s routing tree would take route segments. Filter out the segment to the right tree. O(log(N))

Roda.route do |r|
  r.on "foo" do
    # /foo ...
  end
  r.on "bar" do
    # /bar ...
  end
end

But the worst case is still O(N) and Jeremy does not accept the worst case being O(n)... Here comes multi route:

Roda.plugin :multi_route

Roda.route("foo") { |r| # /foo... }
Roda.route("bar") { |r| # /bar... }

Roda.route { |r| r.multi_route }

Multi route builds a regexp to match initialial segment (/foo, /bar, ...) then dispatch:

Roda.route { |r| r.multi_route }

which brings the performance back to O(log(n)). But he wants O(1)...

Roda.plugin :static_routing

Roda.static_get("foo") { |r| # /foo... }
Roda.static_get("bar") { |r| # /bar... }

Roda.route { |r| }

But O(1) gives up many advantages. In the end, he created hash_routes plugin to have the best of both worlds:

Roda.plugin :hash_routes

Roda.hash_routes do
  on("foo") do |r|
    r.hash_routes
  end
  is("foo/bar") do |r|

  end

Roda.route do |r|
  r.hash_routes
end

Cache when Possible

39:40

Introduce cache is the highest ratio of the following formula:

  % Increase in Performance
-----------------------------
    Lines of Code Changed

Maximum performance increase over minimum lines of code changed.

Use case: Sequel Symbol Literalization

40:00

:column_name
# => '"column_name"'

def self.split_symbol(sym)
  v = case s = sym.to_s
  when /\A((?:(?!__).)+)__((?:(?!___).)+)___(.+)\z/
    [$1.freeze, $2.freeze, $3.freeze].freeze
  when /\A((?:(?!___).)+)___(.+)\z/
    [nil, $1.freeze, $2.freeze].freeze
  when /\A((?:(?!__).)+)__(.+)\z/
    [$1.freeze, $2.freeze, nil].freeze
  else
    [nil, s.freeze, nil].freeze
  end

  v
end

Introduce symbol cache:

SPLIT_SYMBOL_CACHE = {}

def self.split_symbol(sym)
  unless v = Sequel.synchronize { SPLIT_SYMBOL_CACHE[sym] }
    v = case s = sym.to_s
    when /\A((?:(?!__).)+)__((?:(?!___).)+)___(.+)\z/
      [$1.freeze, $2.freeze, $3.freeze].freeze
    when /\A((?:(?!___).)+)___(.+)\z/
      [nil, $1.freeze, $2.freeze].freeze
    when /\A((?:(?!__).)+)__(.+)\z/
      [$1.freeze, $2.freeze, nil].freeze
    else
      [nil, s.freeze, nil].freeze
    end
    Sequel.synchronize { SPLIT_SYMBOL_CACHE[sym] = v }
  end
  v
end

This is over 1000% faster after introduced this cache.

Globally Frozen, Locally Mutable

41:50

Freeze global state like classes, objects that is accessed by multiple threads. Local objects instantiated per request remain mutable for ease of use (cache). This approach is mainly to improve reliability (thread safe). This also improves performance because frozen object means it is easy to cache them.

Note that Frozen is not the same as Immutable. Frozen means Immutable State + Mutable Cache.

# https://github.com/jeremyevans/sequel/blob/cc5c1612fb4c42e7cbf6db6b0572001fd5bf6b58/lib/sequel/dataset/misc.rb#L25-L30
OPTS = {}
def initialize(db)
  @db = db
  @opts = OPTS # state in frozen hash
  cache = {} # <-- mutable cache
  freeze # <-- freeze this object
end

Make sure access to mutable cache is protected by Mutex for thread safety.

Optimization Through Metaprogramming

46:11

class Album < Sequel::Model
  dataset_module do
    def by_name
      order(:name)
    end

    def released
      where(released: true)
    end
  end
end

# Album.where(released: true).order(:name).first
# => Album.released.by_name.first

This is good to DRY up code but does not improve performance. Jeremy introduced metaprogramming for users to define like following and speed up by introduced dataset cache to these metaprogramming generated methods:

class Album < Sequel::Model
  dataset_module do
    order :by_name, :name # def by_name
                          #   cache { order(:name) }
                          # end

    where :released, released: true
    # def released
    #   cache { where(released: true) }
    # end
  end
end

Example: Roda String Matching

48:41

Roda was forked from Cuba. The matching of routing segment was:

def match_string(str)
  consume(Regexp.escape(str))
end

# https://github.com/soveran/cuba/blob/1dce54d0926c5d6e5daab033a8e9685c6faa4227/lib/cuba.rb#L214-L226
def consume(pattern)
  matchdata = env[Rack::PATH_INFO].match(/\A\/(#{pattern})(\/|\z)/)

  return false unless matchdata

  path, *vars = matchdata.captures

  env[Rack::SCRIPT_NAME] += "/#{path}"
  env[Rack::PATH_INFO] = "#{vars.pop}#{matchdata.post_match}"

  captures.push(*vars)
end
private :consume

This code looks perfectly fine. But let‘s see how Jeremy optimized it. First is to avoid allocations. The first change was to include the proceeding slash in the first capture to avoid extra array allocation for the captures:

- /\A\/(#{pattern})(\/|\z)/
+ /\A(\/(#{pattern}))(\/|\z)/

...

- path, *vars = matchdata.captures
+ vars = matchdata.captures

- env[Rack::SCRIPT_NAME] += "/#{path}"
+ env[Rack::SCRIPT_NAME] += vars.shift

Next is to use a regular expression feature, positive lookahead assertion instead of a capture:

- /\A(\/(#{pattern}))(\/|\z)/
+ /\A(\/(#{pattern}))(?=\/|\z)/

So we no longer need to pop the last element of capture variables (vars):

- env[Rack::PATH_INFO] = "#{vars.pop}#{matchdata.post_match}"
+ env[Rack::PATH_INFO] = matchdata.post_match

Also avoid the allocation of additional string by using post match directly.

He then profile and found generating a new regular expression every time consume was called takes a large portion of time. Then he introduced cached matcher:

   def match_string(str)
-    consume(Regexp.escape(str))
+    consume(self.class.cached_matcher(str) { Regexp.escape(str) })
   end

This change is possible because consume was a private method. Another optimizations technique is to keep most methods private.

Keep Most Methods Private

Only makes a method public if it needs to be public. When method is private, you are free to change its API to improve performance. Public methods have limit options to optimize.

Prefer String operations over Regexp operations

In Roda 3.0, the matching of string segment changed to use string operations instead of regular expression matching:

def match_string(str)
  rp = @remaining_path
  if rp.start_with?("/#{str}") # <----
    last = str.length + 1
    case = rp[last] # <----
    when "/"
      @remaining_path = rp[last, rp.length] # <----
    when nil
      @remaining_path = ""
    end
  end
end

But still, there are two String allocations: rp.start_with?("/#{str}") and rp[last].

Prefer Integer operations over String operations

Iterating by index and checking if first character of string is 47 (47 is ascii code of /):

def _match_string(str)
  rp = @remaining_path
  length = str.length

  match = case rp.rindex(str, length) # <------
  when nil
    return
  when 1
    rp.getbyte(0) == 47 # <--- start_with?("/")
  else
    length == 0 && rp.getbyte(0) == 47 # <--- start_with?("/")
  end

  if match
    length += 1
    case rp.getbyte(length)
    when 47
      @remaining_path = rp[length, 100000000]
    when nil
      @remaining_path = ""
    end
  end
end

The most common failure case when nil also moved up because it happens more often. Another trick is to use a large number that is larger than any reasonable path length: rp[length, 100000000].

Finally...

Remember that optimizations come last.
First make it work, then make it correct.
After you make it fun, then you can make it fast.
—Jeremy Evans

Profile to find where to optimize. Then Benchmark your change.

That‘s all from Jeremy‘s presentation.

There is no silver bullet. It depends on the context. We must try ourselves. Verify it‘s faster than the other instead of directly quoting anything here.

The faults are probably mine. I‘m happy to fix if you would let me know. You can reach out to him on twitter for more questions: @jeremyevans0 or GitHub to see more of his work.

Thanks for reading!

Originally published on juanitofatas.com on January 27th, 2020.

Learn more about the plugin system of Sequel and Roda from Janko Marohnić. ↩

DEV Community