Toxic Elephant

Don't bury it in your back yard!

Some thoughts on Ruby's speed

Posted by matijs 02/03/2013 at 16h42

Yesterday, I read Alex Gaynor’s slides on dynamic language speed. It’s an interesting argument, but I’m not totally convinced.

At a high level, the argument is as follows, it seems:

  • For a comparable algorithm, Ruby et al. do much more work behind the scenes than ‘fast’ languages such as C.
  • In particular, they do a lot of memory allocation.
  • Therefore, we should add tools to those languages that allow us to do memory allocation more efficiently.

Allocation

As an example, the creation of an array of squared integers is presented, where the runtime needs to dynamically allocate ever larger arrays to hold its result.

I can see the argument there, but I’m not convinced the solution is correct, for the example given. In particular, how much time is spent allocating and re-allocating the array of object references, and how much is spent elsewehere?

<typo:code lang=“ruby”> require ‘benchmark’

def doubles_unallocated n buf = [] i = 1; j = 0 while j < n do buf[i - 1] = i * i i += 1; j += 1 end buf end

def doubles_preallocated n buf = Array.new n i = 1; j = 0 while j < n do buf[j] = i * i i += 1; j += 1 end buf end

n = 10000

Benchmark.bmbm do |x| x.report(“unallocated”) { n.times { doubles_unallocated 1000 } } x.report(“preallocated”) { n.times { doubles_preallocated 1000 } } end </typo:code>

A typical result on Ruby 1.9.3 would be (I’m not showing the rehearsal section here):

                   user     system      total        real
unallocated    1.480000   0.020000   1.500000 (  1.493959)
preallocated   1.370000   0.030000   1.400000 (  1.404668)

Taking several runs into account, I conclude that pre-allocating the array shaves off 7 to 13 percent of the runtime.

Algorithms

The algorithms used above are obviously not idiomatic Ruby, but they have the advantage of allowing a comparison of allocated and unallocated arrays. Normally, I would probably use map with a range. Similarly, the Python code presented in the slides looks odd to me. I would have expected something using list comprehensions.

Here is the idiomatic Ruby implementation:

<typo:code lang=“ruby”> def doubles_map n (1..n).map {|i| i * i } end </typo:code>

And here is the Ruby equivalent of the Python example given (using each versus for gives no difference in speed):

<typo:code lang=“ruby”> def doubles_each n buf = [] (1..n).each do |i| buf << i * i end buf end </typo:code>

For Ruby, doubles_map is the idiomatic implementation. It also has the advantage that a dedicated implementation of Range#map would be able to allocate the result array in one go, transparantly avoiding the array growing overhead. Using list comprehensions in Python would probably have the same advantage.

Unfortunately, doubles_map is also a lot slower than doubles_each, except on MRI 1.8. Here is the benchmark I ran with a fixed-array version thrown in for good measure:

<typo:code lang=“ruby”> ARR = (1..1000).to_a def doubles_fixed_array ARR.map {|i| i * i } end

n = 10000

Benchmark.bmbm do |x| x.report(“using map”) { n.times { doubles_map 1000 } } x.report(“using each”) { n.times { doubles_each 1000 } } x.report(“using a fixed array”) { n.times { doubles_fixed_array } } end </typo:code>

Here is the result for MRI 1.9.3:

                          user     system      total        real
using map             1.220000   0.000000   1.220000 (  1.217607)
using each            1.070000   0.020000   1.090000 (  1.091302)
using a fixed array   0.940000   0.020000   0.960000 (  0.969685)

And for MRI 1.8.7:

                          user     system      total        real
using map             2.790000   0.000000   2.790000 (  2.795956)
using each            3.630000   0.010000   3.640000 (  3.644272)
using a fixed array   2.120000   0.010000   2.130000 (  2.133636)

A better Range#map

These benchmark results made me wonder how Range#map is implemented. Since it is written in Ruby, I checked the Rubinius implementation, and there at least the problem seems due to the lack of a dedicated Range#map implementation: The generic Enumerable#map is used, effectively turning it into doubles_each. I’m not entirely sure why it is actually slower, perhaps that’s due to the double yield involved and/or argument splatting.

Here are the results of the previous benchmark, run on Rubinius 2.0.0-rc1:

                          user     system      total        real
using map             1.312082   0.004000   1.316082 (  1.316717)
using each            0.996062   0.000000   0.996062 (  0.997418)
using a fixed array   0.720045   0.000000   0.720045 (  0.724615)

I made the following simple dedicated implementation for Range#map. It falls back to the default Enumerable#map for most cases, except for Fixnums:

<typo:code lang=“ruby”> class Range def map if block_given? && Fixnum === @begin first, last = @begin, @end last -= 1 if @excl size = last - first + 1

  out = Array.new size
  out_tuple = out.tuple

  i = first
  j = 0
  while i <= last
    out_tuple[j] = yield i
    i += 1
    j += 1
  end

  out
else
  super
end

end end </typo:code>

This makes doubles_map run slightly faster than the fixed array version:

                          user     system      total        real
using map             0.648040   0.000000   0.648040 (  0.650103)
using each            0.928058   0.000000   0.928058 (  0.930109)
using a fixed array   0.680042   0.004001   0.684043 (  0.686719)

Benchmarking Still Needed

While examining these options, I found that there are many ways to write this code, each subtly different, each with a different performance on different Rubies (the examples using while are the slowest on MRI 1.9.3, but the fastest on Rubinus). So, if you really want your code to be fast, you still have to benchmark, you still have to know which Ruby implementation your code will run on, and you still have to try all the options.

Tags , , , no comments no trackbacks

Comments

Comments are disabled