8000 Use `alloca` to improve performance of thread creation. by ioquatix · Pull Request #2227 · ruby/ruby · GitHub
[go: up one dir, main page]

Skip to content

Use alloca to improve performance of thread creation. #2227

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 17 commits into from

Conversation

ioquatix
Copy link
Member
@ioquatix ioquatix commented Jun 5, 2019

This avoids the need for vm_stack allocation per thread, which improves performance and can lead to code simplification (removal of stack recycling code).

@ioquatix
Copy link
Member Author
ioquatix commented Jun 5, 2019

@ko1 here is a rough idea of performance characteristics:

koyoko% make benchmark COMPARE_RUBY=../build-master/local/bin/ruby ITEM=vm_thread
compiling ../thread.c
../revision.h unchanged
linking miniruby
/home/samuel/.rvm/rubies/ruby-2.6.1/bin/ruby --disable=gems -rrubygems -I../benchmark/lib ../benchmark/benchmark-driver/exe/benchmark-driver \
            --executables="compare-ruby::../build-master/local/bin/ruby -I.ext/common --disable-gem" \
            --executables="built-ruby::./miniruby -I../lib -I. -I.ext/common  ../tool/runruby.rb --extout=.ext  -- --disable-gems --disable-gem" \
            $(find ../benchmark -maxdepth 1 -name '*vm_thread*.yml' -o -name '*vm_thread*.rb' | sort) 
Calculating -------------------------------------
                       compare-ruby  built-ruby 
 vm_thread_alive_check     116.019k    111.584k i/s -     50.000k times in 0.430962s 0.448094s
       vm_thread_close        1.124       1.174 i/s -       1.000 times in 0.889566s 0.851551s
    vm_thread_condvar1        1.599       1.625 i/s -       1.000 times in 0.625273s 0.615311s
    vm_thread_condvar2        0.948       0.903 i/s -       1.000 times in 1.055386s 1.106969s
 vm_thread_create_join        1.130       1.021 i/s -       1.000 times in 0.884733s 0.979028s
      vm_thread_mutex1        2.847       2.872 i/s -       1.000 times in 0.351276s 0.348214s
      vm_thread_mutex2        2.752       2.848 i/s -       1.000 times in 0.363363s 0.351141s
      vm_thread_mutex3        0.997       1.057 i/s -       1.000 times in 1.002590s 0.945894s
  vm_thread_pass_flood        0.261       2.682 i/s -       1.000 times in 3.836219s 0.372829s
        vm_thread_pass        4.958       7.024 i/s -       1.000 times in 0.201675s 0.142368s
        vm_thread_pipe        4.941       4.472 i/s -       1.000 times in 0.202375s 0.223617s
       vm_thread_queue       12.179      12.256 i/s -       1.000 times in 0.082106s 0.081593s
vm_thread_sized_queue2        1.053       1.069 i/s -       1.000 times in 0.949686s 0.935150s
vm_thread_sized_queue3        1.041       1.005 i/s -       1.000 times in 0.961023s 0.994651s
vm_thread_sized_queue4        1.579       1.922 i/s -       1.000 times in 0.633163s 0.520230s
 vm_thread_sized_queue        5.556       5.250 i/s -       1.000 times in 0.179976s 0.190479s
       vm_thread_sleep      22.197k     36.656k i/s -      1.000k times in 0.045051s 0.027281s

Comparison:
              vm_thread_alive_check
          compare-ruby:    116019.4 i/s 
            built-ruby:    111583.7 i/s - 1.04x  slower

                    vm_thread_close
            built-ruby:         1.2 i/s 
          compare-ruby:         1.1 i/s - 1.04x  slower

                 vm_thread_condvar1
            built-ruby:         1.6 i/s 
          compare-ruby:         1.6 i/s - 1.02x  slower

                 vm_thread_condvar2
          compare-ruby:         0.9 i/s 
            built-ruby:         0.9 i/s - 1.05x  slower

              vm_thread_create_join
          compare-ruby:         1.1 i/s 
            built-ruby:         1.0 i/s - 1.11x  slower

                   vm_thread_mutex1
            built-ruby:         2.9 i/s 
          compare-ruby:         2.8 i/s - 1.01x  slower

                   vm_thread_mutex2
            built-ruby:         2.8 i/s 
          compare-ruby:         2.8 i/s - 1.03x  slower

                   vm_thread_mutex3
            built-ruby:         1.1 i/s 
          compare-ruby:         1.0 i/s - 1.06x  slower

               vm_thread_pass_flood
            built-ruby:         2.7 i/s 
          compare-ruby:         0.3 i/s - 10.29x  slower

                     vm_thread_pass
            built-ruby:         7.0 i/s 
          compare-ruby:         5.0 i/s - 1.42x  slower

                     vm_thread_pipe
          compare-ruby:         4.9 i/s 
            built-ruby:         4.5 i/s - 1.10x  slower

                    vm_thread_queue
            built-ruby:        12.3 i/s 
          compare-ruby:        12.2 i/s - 1.01x  slower

             vm_thread_sized_queue2
            built-ruby:         1.1 i/s 
          compare-ruby:         1.1 i/s - 1.02x  slower

             vm_thread_sized_queue3
          compare-ruby:         1.0 i/s 
            built-ruby:         1.0 i/s - 1.03x  slower

             vm_thread_sized_queue4
            built-ruby:         1.9 i/s 
          compare-ruby:         1.6 i/s - 1.22x  slower

              vm_thread_sized_queue
          compare-ruby:         5.6 i/s 
            built-ruby:         5.2 i/s - 1.06x  slower

                    vm_thread_sleep
            built-ruby:     36656.2 i/s 
          compare-ruby:     22197.1 i/s - 1.65x  slower

@ioquatix
Copy link
Member Author
ioquatix commented Jun 5, 2019

I want to investigate more the cases where it's slower.

@ioquatix
Copy link
Member Author
ioquatix commented Jun 5, 2019

I just realised since we are using C99 we can use variable length array instead of alloca. I guess it's the same thing in machine code, but it might be easier to understand code using VLA.

@duerst
Copy link
Member
duerst commented Jun 5, 2019

Actually, I think that while we have switched to C99, there are still a few off-limit features, and variable length arrays was one of them. I seem to remember the reason was VisualStudio, but not completely sure.

@ioquatix
Copy link
Member Author
ioquatix commented Jun 5, 2019

I've had to add a variable to track whether or not to free the stack. But in combination with the other PR for fiber pool, this can be removed.

@ioquatix
Copy link
Member Author
ioquatix commented Jun 5, 2019
1)
Thread#backtrace returns an array (which may be empty) immediately after the thread is created ERROR
NoMemoryError: too large allocation size
/home/travis/build/ruby/ruby/spec/ruby/core/thread/backtrace_spec.rb:30:in `backtrace'
/home/travis/build/ruby/ruby/spec/ruby/core/thread/backtrace_spec.rb:30:in `block (2 levels) in <top (required)>'
/home/travis/build/ruby/ruby/spec/ruby/core/thread/backtrace_spec.rb:3:in `<top (required)>'

Why did it happen, it seems like vm stack has corruption or is not set up correctly.

@k0kubun
Copy link
Member
k0kubun commented Jun 5, 2019

@duerst's comment is 100% correct. We support Visual Studio 2013 and VLA does not work on it.

#2064

Known missing features

  • Visual Studio 2013
    • variable length arrays

@ioquatix
Copy link
Member Author
ioquatix commented Jun 5, 2019

Okay, back to alloca it is then!

@ioquatix
Copy link
Member Author
ioquatix commented Jun 5, 2019

I didn't realise this but alloca on linux is a function call, unless you #include <alloca.h>. I found that after I did that, benchmarks improved further. Here is latest summary:

% make benchmark COMPARE_RUBY=../build-master/local/bin/ruby ITEM=vm_thread
compiling ../thread.c
../revision.h unchanged
linking miniruby
/home/samuel/.rvm/rubies/ruby-2.6.1/bin/ruby --disable=gems -rrubygems -I../benchmark/lib ../benchmark/benchmark-driver/exe/benchmark-driver \
            --executables="compare-ruby::../build-master/local/bin/ruby -I.ext/common --disable-gem" \
            --executables="built-ruby::./miniruby -I../lib -I. -I.ext/common  ../tool/runruby.rb --extout=.ext  -- --disable-gems --disable-gem" \
            $(find ../benchmark -maxdepth 1 -name '*vm_thread*.yml' -o -name '*vm_thread*.rb' | sort) 
Calculating -------------------------------------
                       compare-ruby  built-ruby 
 vm_thread_alive_check     119.475k    115.516k i/s -     50.000k times in 0.418499s 0.432840s
       vm_thread_close        1.303       1.174 i/s -       1.000 times in 0.767327s 0.852125s
    vm_thread_condvar1        1.633       1.623 i/s -       1.000 times in 0.612355s 0.616023s
    vm_thread_condvar2        0.881       0.934 i/s -       1.000 times in 1.134998s 1.070686s
 vm_thread_create_join        0.997       1.123 i/s -       1.000 times in 1.003258s 0.890117s
      vm_thread_mutex1        2.703       2.671 i/s -       1.000 times in 0.370027s 0.374386s
      vm_thread_mutex2        2.663       2.732 i/s -       1.000 times in 0.375542s 0.366081s
      vm_thread_mutex3        1.134       1.215 i/s -       1.000 times in 0.881789s 0.823044s
  vm_thread_pass_flood        0.246       3.451 i/s -       1.000 times in 4.059607s 0.289731s
        vm_thread_pass        0.277       0.271 i/s -       1.000 times in 3.606466s 3.692247s
        vm_thread_pipe        4.334       4.025 i/s -       1.000 times in 0.230725s 0.248435s
       vm_thread_queue        1.264       1.253 i/s -       1.000 times in 0.790944s 0.798180s
vm_thread_sized_queue2        1.149       1.053 i/s -       1.000 times in 0.870001s 0.950114s
vm_thread_sized_queue3        1.028       1.052 i/s -       1.000 times in 0.972292s 0.950296s
vm_thread_sized_queue4        1.714       1.875 i/s -       1.000 times in 0.583491s 0.533276s
 vm_thread_sized_queue        5.131       5.190 i/s -       1.000 times in 0.194879s 0.192666s
       vm_thread_sleep       1.229k     23.697k i/s -     10.000k times in 8.136779s 0.421992s

Comparison:
              vm_thread_alive_check
          compare-ruby:    119474.6 i/s 
            built-ruby:    115516.2 i/s - 1.03x  slower

                    vm_thread_close
          compare-ruby:         1.3 i/s 
            built-ruby:         1.2 i/s - 1.11x  slower

                 vm_thread_condvar1
          compare-ruby:         1.6 i/s 
            built-ruby:         1.6 i/s - 1.01x  slower

                 vm_thread_condvar2
            built-ruby:         0.9 i/s 
          compare-ruby:         0.9 i/s - 1.06x  slower

              vm_thread_create_join
            built-ruby:         1.1 i/s 
          compare-ruby:         1.0 i/s - 1.13x  slower

                   vm_thread_mutex1
          compare-ruby:         2.7 i/s 
            built-ruby:         2.7 i/s - 1.01x  slower

                   vm_thread_mutex2
            built-ruby:         2.7 i/s 
          compare-ruby:         2.7 i/s - 1.03x  slower

                   vm_thread_mutex3
            built-ruby:         1.2 i/s 
          compare-ruby:         1.1 i/s - 1.07x  slower

               vm_thread_pass_flood
            built-ruby:         3.5 i/s 
          compare-ruby:         0.2 i/s - 14.01x  slower

                     vm_thread_pass
          compare-ruby:         0.3 i/s 
            built-ruby:         0.3 i/s - 1.02x  slower

                     vm_thread_pipe
          compare-ruby:         4.3 i/s 
            built-ruby:         4.0 i/s - 1.08x  slower

                    vm_thread_queue
          compare-ruby:         1.3 i/s 
            built-ruby:         1.3 i/s - 1.01x  slower

             vm_thread_sized_queue2
          compare-ruby:         1.1 i/s 
            built-ruby:         1.1 i/s - 1.09x  slower

             vm_thread_sized_queue3
            built-ruby:         1.1 i/s 
          compare-ruby:         1.0 i/s - 1.02x  slower

             vm_thread_sized_queue4
            built-ruby:         1.9 i/s 
          compare-ruby:         1.7 i/s - 1.09x  slower

              vm_thread_sized_queue
            built-ruby:         5.2 i/s 
          compare-ruby:         5.1 i/s - 1.01x  slower

                    vm_thread_sleep
            built-ruby:     23697.1 i/s 
          compare-ruby:      1229.0 i/s - 19.28x  slower

@ko1 what do you think? Creating threads (e.g. vm_thread_sleep) is now 20x faster in most optimistic scenario.

@ioquatix
Copy link
Member Author
ioquatix commented Jun 5, 2019

@duerst I tried VLA and found performance was better. So I checked man page of alloca. If you don't use it carefully, it falls back to function call rather than inline assembly. So, I updated my usage and performance was improved a bit. It was unexpected happy discovery.

@ioquatix ioquatix changed the title Thread alloca Use alloca to improve performance of thread creation. Jun 5, 2019
@ioquatix
Copy link
Member Author

@tenderlove if you have time I'd love your feedback on this, even if just briefly. If you are busy, don't worry about it :)

From your most recent work, I think we have some interests in common.

@ioquatix
Copy link
Member Author

Okay, it's merged.

@ioquatix ioquatix closed this Jun 19, 2019
@ioquatix ioquatix deleted the thread-alloca branch June 19, 2019 08:40
@hsbt hsbt added the Backport label Sep 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0