#1 √ resolved
Delano Mandelbaum

IO#select threading bug in Ruby 1.8

Reported by Delano Mandelbaum | August 27th, 2009 @ 11:36 PM | in 2.0.15

This was originally documented here:
https://capistrano.lighthouseapp.com/projects/8716/tickets/79-capis...

"Basically, what happens is, if you are running Ruby 1.8.7p160 through 1.8.7p174 (and possibly some later versions), and you have multiple threads calling IO#select on different and disjoint sets of IO objects, the calls may fail to return results even though there are results to return." -- Daniel Azuma

The fix is here:
http://www.daniel-azuma.com/blog/view/z2ysbx0e4c3it9/ruby_1_8_7_io_...

Comments and changes to this ticket

  • Delano Mandelbaum

    Delano Mandelbaum August 26th, 2009 @ 06:53 PM

    • → State changed from “new” to “open”
    • → Milestone changed from “” to “2.0.14”

    The 2.0 branch in the Github repo has been updated with the fix.

    http://github.com/net-ssh/net-ssh/tree/2.0

  • Daniel Azuma

    Daniel Azuma August 26th, 2009 @ 08:52 PM

    I tried the modifications against my tests. Looks good.

  • adlongwell

    adlongwell August 27th, 2009 @ 02:52 PM

    I haven't done extensive digging yet, but this issue appears to still be happening for me, even when running the 2.0 branch from Github. I'm going to do some manual verification that the patch is present in my executing code, but just wanted to chime in to indicate that switching to the Github 2.0 branch doesn't appear to fix the issue all on its own.

  • Delano Mandelbaum

    Delano Mandelbaum August 27th, 2009 @ 03:56 PM

    Aaron, can you send the output for the following commands:

    $ ruby -v
    
    $ uname -a
    
    $ ruby -e 'puts $:'
    

    And from inside your git clone for net-ssh:

    $ git branch
    

    Thanks!

  • Daniel Azuma

    Daniel Azuma August 27th, 2009 @ 04:06 PM

    Well, I guess the patch is only going to catch one case. It seems to be sufficient for my use cases, but it may not be sufficient for others, especially for people who are doing any blocking reads/writes.

    The underlying Ruby bug happens whenever ANY concurrent IO.select calls are running. So if we want to be comprehensive, we perhaps should be wrapping ALL net-ssh's IO.select calls (not just that one) within the same mutex. There are, on a quick search, three more instances: one in ssh/buffered_io.rb, one in ssh/connection/session.rb, and one more in ssh/transport/packet_stream.rb.

    Delano, I suppose in the end it's up to you, but looking at it again after seeing @adlongwell's feedback, I think I'd recommend including the other calls to see if we can catch more cases. It should be a safe change as long as all the synchronize blocks share the same mutex.

  • Daniel Azuma

    Daniel Azuma August 27th, 2009 @ 06:52 PM

    Here's a possible modified patch (also includes the Ruby 1.9/JRuby check). I'm a bit of a github newbie, so forgive me if I screwed up on the branch management...

    http://github.com/dazuma/net-ssh/commit/b1c5aa662d5a25b4c5987a9adff...
    http://github.com/dazuma/net-ssh/tree/2.0

    I rechecked my test case and it still looks okay.

  • Lee Hambley

    Lee Hambley August 27th, 2009 @ 08:04 PM

    Daniel, excellent work on that patch man, that's some slick Ruby code!

  • Delano Mandelbaum

    Delano Mandelbaum August 27th, 2009 @ 11:36 PM

    Nice work Daniel, thanks for the expanded patch!

    I've imported it to my local 2.0 branch and ran the test suite. It causes exactly one test to fail (so close!):

    $ /usr/bin/ruby -Ilib -Itest -rrubygems test/test_all.rb
    Skipping packet stream test for arcfour256
    Skipping packet stream test for arcfour512
    Skipping packet stream test for arcfour128
    Loaded suite test/test_all
    Started
    ...............................................................................
    ...............................................................................
    ......................................................................F........
    ...............................................................................
    ...............................................................................
    ...............................................................................
    ...............................................................................
    .................................................
    
    Finished in 4.271134 seconds.
    
      1) Failure:
    test_wait_for_pending_sends_should_write_multiple_times_if_first_write_was_partial(TestBufferedIo)
        [./test_buffered_io.rb:52:in `test_wait_for_pending_sends_should_write_multiple_times_if_first_write_was_partial'
         /Library/Ruby/Gems/1.8/gems/mocha-0.9.7/lib/mocha/integration/test_unit/gem_version_201_and_above.rb:20:in `run']:
    Exception raised:
    Class: <Mocha::ExpectationError>
    Message: <"unexpected invocation: IO.select(nil, [#<Mock:io>], nil, nil)\nunsatisfied expectations:\n- expected exactly once, not yet invoked: #<Mock:io>.send('ata', 0)\n- expected exactly once, not yet invoked: #<Mock:io>.send('me data', 0)\n- expected exactly twice, not yet invoked: IO.select(nil, [#<Mock:io>])\nsatisfied expectations:\n- expected exactly once, already invoked once: #<Mock:io>.send('here is some data', 0)\n">
    ---Backtrace---
    ./../lib/net/ssh/ruby_compat.rb:28:in `io_select'
    ./../lib/net/ssh/ruby_compat.rb:27:in `synchronize'
    ./../lib/net/ssh/ruby_compat.rb:27:in `io_select'
    ./../lib/net/ssh/buffered_io.rb:113:in `wait_for_pending_sends'
    ./test_buffered_io.rb:52:in `test_wait_for_pending_sends_should_write_multiple_times_if_first_write_was_partial'
    ./test_buffered_io.rb:52:in `test_wait_for_pending_sends_should_write_multiple_times_if_first_write_was_partial'
    ---------------
    
    602 tests, 2305 assertions, 1 failures, 0 errors, 0 pendings, 0 omissions, 0 notifications
    

    And here is the test in question (the error occurs in both Ruby 1.8 and 1.9):

      def test_wait_for_pending_sends_should_write_multiple_times_if_first_write_was_partial
        io.enqueue("here is some data")
    
        io.expects(:send).with("here is some data", 0).returns(10)
        io.expects(:send).with("me data", 0).returns(4)
        io.expects(:send).with("ata", 0).returns(3)
    
        IO.expects(:select).times(2).with(nil, [io]).returns([[], [io]])
        
        assert_nothing_raised { io.wait_for_pending_sends }
        assert !io.pending_write?
      end
    

    If I revert this one change, the tests passes:

    @@ -109,7 +110,7 @@ module Net; module SSH
         def wait_for_pending_sends
           send_pending
           while output.length > 0
    -        result = IO.select(nil, [self]) or next
    +        result = Net::SSH::Compat.io_select(nil, [self]) or next
             next unless result[1].any?
             send_pending
           end
    @@ -146,4 +147,4 @@ module Net; module SSH
    

    The Compat.io_select method calls IO#select with 4 arguments. It seems that sending nil for the 3rd and 4th arguments is causing the problem.

    I created a temporary fix by adding a second io_select method which takes only two arguments; however, this is less than elegant. Any suggestions for how to improve it?

    http://github.com/net-ssh/net-ssh/commit/0e5486eb89fedd5cf0c6ede30e...

  • Delano Mandelbaum

    Delano Mandelbaum August 27th, 2009 @ 11:40 PM

    Aaron, thanks for testing out the change so quickly. If you have a chance please try again with the latest version in the 2.0 branch.

  • Daniel Azuma

    Daniel Azuma August 28th, 2009 @ 12:32 AM

    Something we might try is rewriting Compat.io_select like this:

        if RUBY_VERSION >= '1.9' || RUBY_PLATFORM == 'java'
          def self.io_select(*params)
            IO.select(*params)
          end
        else
          SELECT_MUTEX = Mutex.new
          def self.io_select(*params)
            SELECT_MUTEX.synchronize do
              IO.select(*params)
            end
          end
        end
    

    I'm not sure if this will solve it (because I'm actually not sure how to run the test suite...) Do you think you can try it out?

  • Daniel Azuma

    Daniel Azuma August 28th, 2009 @ 12:59 AM

    Never mind my ignorance on running the tests... I just read your last message more carefully. :-) The above modification seems to work. I ran the test suite against 1.8 and 1.9.

    The change has been pushed to my fork:

    http://github.com/dazuma/net-ssh/tree/2.0

    http://github.com/dazuma/net-ssh/commit/46af52a240aaf3899d6baf5587f...

  • Delano Mandelbaum

    Delano Mandelbaum August 28th, 2009 @ 02:14 PM

    Right on. Thanks for the quick turn arounds Daniel. I pulled the changes into the 2.0 branch.

    http://github.com/net-ssh/net-ssh/tree/2.0

    http://github.com/net-ssh/net-ssh/commit/46af52a240aaf3899d6baf5587...

    All tests pass in Ruby 1.8 and 1.9. I'll wait a couple hours in case anyone else gets a chance to try the changes. I'll cut the 2.0.14 release at 12:00 (EST).

  • Delano Mandelbaum

    Delano Mandelbaum August 28th, 2009 @ 04:53 PM

    • → State changed from “open” to “resolved”

    Net::SSH 2.0.14 is now available on Rubyforge and Github. It should be available via the mirrors soon.

    http://rubyforge.org/frs/?group_id=274

    http://github.com/net-ssh/net-ssh/tree/v2.0.14

    Thank you everyone!

  • Will Bryant

    Will Bryant September 2nd, 2009 @ 01:00 PM

    That patch breaks programs that work with two connections, for example, when I use Snow Leopard's Ruby (ruby 1.8.7 (2008-08-11 patchlevel 72) [universal-darwin10.0]) to deploy using capistrano over a SSH gateway, it locks up here:

    /Library/Ruby/Gems/1.8/gems/net-ssh-2.0.14/lib/net/ssh/ruby_compat.rb:30:in `select': Interrupt
        from /Library/Ruby/Gems/1.8/gems/net-ssh-2.0.14/lib/net/ssh/ruby_compat.rb:30:in `io_select'
        from /Library/Ruby/Gems/1.8/gems/net-ssh-2.0.14/lib/net/ssh/ruby_compat.rb:28:in `synchronize'
        from /Library/Ruby/Gems/1.8/gems/net-ssh-2.0.14/lib/net/ssh/ruby_compat.rb:28:in `io_select'
        from /Library/Ruby/Gems/1.8/gems/net-ssh-2.0.14/lib/net/ssh/transport/packet_stream.rb:91:in `next_packet'
        from /Library/Ruby/Gems/1.8/gems/net-ssh-2.0.14/lib/net/ssh/transport/packet_stream.rb:90:in `loop'
        from /Library/Ruby/Gems/1.8/gems/net-ssh-2.0.14/lib/net/ssh/transport/packet_stream.rb:90:in `next_packet'
        from /Library/Ruby/Gems/1.8/gems/net-ssh-2.0.14/lib/net/ssh/transport/packet_stream.rb:86:in `loop'
        from /Library/Ruby/Gems/1.8/gems/net-ssh-2.0.14/lib/net/ssh/transport/packet_stream.rb:86:in `next_packet'
        from /Library/Ruby/Gems/1.8/gems/net-ssh-2.0.14/lib/net/ssh/transport/session.rb:169:in `poll_message'
        from /Library/Ruby/Gems/1.8/gems/net-ssh-2.0.14/lib/net/ssh/transport/session.rb:164:in `loop'
        from /Library/Ruby/Gems/1.8/gems/net-ssh-2.0.14/lib/net/ssh/transport/session.rb:164:in `poll_message'
        from /Library/Ruby/Gems/1.8/gems/net-ssh-2.0.14/lib/net/ssh/transport/session.rb:149:in `next_message'
    

    Adding some debugging output to the code:

          SELECT_MUTEX = Mutex.new
          def self.io_select(*params)
            puts "#{Thread.current.inspect} acquiring SELECT_MUTEX"
            SELECT_MUTEX.synchronize do
              puts "#{Thread.current.inspect} acquired mutex, calling select #{params.inspect}"
              result = IO.select(*params)
              puts "#{Thread.current.inspect} called select #{params.inspect}: #{result.inspect}; releasing SELECT_MUTEX"
              result
            end
          end
    

    and running till the point it freezes, the last output is:

    #<Thread:0x1015d5488 run> called select [[#<TCPSocket:0x1016403a0>], nil, nil, 0]: nil; releasing SELECT_MUTEX
    #<Thread:0x10113e730 run> acquired mutex, calling select [[#<TCPSocket:0x101107c58>, #<TCPSocket:0x1011eba98>, #<TCPServer:0x101137a48>], [#<TCPSocket:0x101107c58>], nil, 0.1]#<Thread:0x1015d5488 run> acquiring SELECT_MUTEX
    
    #<Thread:0x10113e730 run> called select [[#<TCPSocket:0x101107c58>, #<TCPSocket:0x1011eba98>, #<TCPServer:0x101137a48>], [#<TCPSocket:0x101107c58>], nil, 0.1]: [[], [#<TCPSocket:0x101107c58>], []]; releasing SELECT_MUTEX
    #<Thread:0x100170358 run> acquired mutex, calling select [[#<TCPSocket:0x10110bda8>], nil, nil, 0]#<Thread:0x10113e730 run> acquiring SELECT_MUTEX
    
    #<Thread:0x100170358 run> called select [[#<TCPSocket:0x10110bda8>], nil, nil, 0]: [[#<TCPSocket:0x10110bda8>], [], []]; releasing SELECT_MUTEX
    #<Thread:0x1015d5488 run> acquired mutex, calling select [[#<TCPServer:0x1015d43d0>, #<TCPSocket:0x1016403a0>], [], nil, 0.1]
    #<Thread:0x100170358 run> acquiring SELECT_MUTEX
    #<Thread:0x1015d5488 run> called select [[#<TCPServer:0x1015d43d0>, #<TCPSocket:0x1016403a0>], [], nil, 0.1]: nil; releasing SELECT_MUTEX
    #<Thread:0x10113e730 run> acquired mutex, calling select [[#<TCPSocket:0x1011eba98>], nil, nil, 0]#<Thread:0x1015d5488 run> acquiring SELECT_MUTEX
    
    #<Thread:0x10113e730 run> called select [[#<TCPSocket:0x1011eba98>], nil, nil, 0]: nil; releasing SELECT_MUTEX
    #<Thread:0x100170358 run> acquired mutex, calling select [[#<TCPSocket:0x10110bda8>]]#<Thread:0x10113e730 run> acquiring SELECT_MUTEX
    

    So we can see that the second thread is trying to acquire so it can select on its socket, but can't because the first thread is waiting on a select on its own socket while holding the mutex.

    Could you clarify how the patch is supposed to work, given that the point of IO.select is to block until there's an event on the socket?

  • Daniel Azuma

    Daniel Azuma September 2nd, 2009 @ 05:02 PM

    Ah. Oh. This is a good point. I think you're right that we need to think this through more carefully.

    The purpose of the patch is precisely to attempt to ensure that multiple threads are NOT attempting to select simultaneously, by blocking one until the other completes. This is to work around a bug that affects certain versions of MRI 1.8.6 and 1.8.7 in which there is a livelock in the threading code related to reporting results of select() calls issued from multiple threads.

    Most Capistrano users were running into this during the connect phase when two or more threads were trying to check (using IO#select) for read activity during the protocol negotiation. The code there used a zero timeout to perform quick checks for whether any bytes are available. Under these conditions, the mutex is safe from deadlocks because it never blocks.

    However, I expanded it to include ALL IO#select calls, not just that one, on the theory that we should cover all cases for completeness sake. I forgot that the case I was trying to avoid, multiple threads waiting (and blocked) on IO#select simultaneously, sometimes IS the desired behavior. So by avoiding the livelock, we open the door for a deadlock.

    Unfortunately, I then can't think of a good solution for all cases: eventually we may hit either the livelock or the deadlock, and I can't think of a way to avoid both of them.

    Perhaps the best approach is the cautious approach, at least until Ruby itself is fixed (which is the real solution). Scale back the patch so it only mutexes the first, safe case (the one with the zero timeout), which is the main case that I and (I think) most other people were running into anyway. Leave other IO#select calls with nonzero or indefinite timeouts unprotected. There's still the chance some people may run into the livelock, but those use cases, I suspect, are far less common.

    This may also be what Dirkjan is observing back in the Capistrano ticket.

  • Daniel Azuma
  • Delano Mandelbaum

    Delano Mandelbaum September 2nd, 2009 @ 06:23 PM

    • → State changed from “resolved” to “open”

    Will, thanks for the detailed description. Daniel, thanks for the quick reply.

    Two questions:

    • Can we be sure there are no (or sufficiently few) cases where multiple-threads could be used with a zero timeout?

    • Do we know which patch versions of Ruby 1.8.6 and 1.8.7 contain or will contain the proper fix?

  • Daniel Azuma

    Daniel Azuma September 2nd, 2009 @ 08:23 PM

    (1) I assume you mean to ask whether there are sufficiently few cases with a NON-zero timeout, since my last proposal keeps the zero-timeout case in a mutex, but un-protects the non-zero-timeout case.

    In my observations, the zero-timeout case is the common case that exposes the Ruby bug. (It usually takes place in Capistrano when connecting to multiple servers in parallel.) My original patch only handled this one case, and it seemed sufficient to prevent the hangs for me. Furthermore, it should not (cross fingers) cause hangs of the kind that Will is reporting because it's a zero-timeout call, which means the thread won't block within the mutex. So it should be no worse than 2.0.13.

    I'm pretty sure that there will be cases when multiple IO#select calls with non-zero timeouts are taking place in parallel. (And Will, in fact, points these out.) The question is, whether or not these events trigger the Ruby bug, and I actually don't know the answer to that. The sequence of events in the Ruby thread scheduler for this case is too complicated for me to trace through and determine whether the same livelock is likely. I do not believe I have observed a hang based on a non-zero-timeout case, but I'm just one user and I cannot be sure. Aaron's report suggests he might have run into such a case, but we never proved it, and it may have been something totally unrelated. (Delano, did you ever hear back from him, btw?)

    So the point is, I now think I was overreacting to Aaron's report when I extended the patch to include the non-zero-timeout cases. I hadn't thought it through: of course such an extension is dangerous because it's arbitrarily wrapping a mutex around a blocking call. It does result in deadlocks as Will has observed, but we have no proof that it improves the livelock situation at all. So my latest proposed change is effectively a rollback of that extension, and a reversion to the original patch.

    (2) In my understanding, the bug was introduced in revision 21165 of the Ruby svn. This means that 1.8.6p287 was not affected, but all subsequent versions so far (p368, p369, and p383) are affected. It means that 1.8.7p72 was not affected, but all subsequent versions so far (p160, p173, and p174) are affected. (Which matches my and others' observations that it started manifesting in the upgrade from 1.8.7p72 to 1.8.7p160.)

    The bug was fixed on the 1.8 branch through revisions 24413, 24416, and 24442. As of right now, the fix has not yet been backported into the 1.8.6 or 1.8.7 release branches. The bug report on redmine is now assigned to the 1.8.7 maintainer with a notation on which revisions to backport, so I expect that the next 1.8.7 patch release (whenever that happens) will contain the fix. However, I am not aware that the 1.8.6 branch maintainer has been pinged yet. That hasn't been a priority for me since I'm using 1.8.7, but I guess I could do it.

    Note that if, as Will suggests, the default Snow Leopard install includes 1.8.7p72, that version of Ruby probably will not exhibit the Ruby bug, and thus does not technically require any of these Net::SSH patches at all.

  • Will Bryant

    Will Bryant September 2nd, 2009 @ 11:07 PM

    Yeah, my personal inclination is that this is a very serious bug in MRI, and frankly, I think people should probably just downgrade or upgrade to a version not affected. Do we know of any distros - linux, OS X, whatever - that distribute the broken version, or is it just affecting people who built themselves from source/macports/etc?

    It's a pity we don't have any way to test if the running Ruby has the bug without leaving a couple of dead threads deadlocked. Meanwhile, is there a way we could be more conservative with the Ruby version test? I notice that there's an exemption for JRuby, but there's umpteen thousand variants of Ruby out there now - Rubinius, IronRuby...

    I'm not aware of any way to detect MRI other than the RUBY_COPYRIGHT, which would be nasty. Perhaps we should just tightly bound the version numbers?

    The suggestion to keep the mutex in the zero-timeout case seems sensible - that should help mitigate the bug until the Ruby maintainers get it fixed in the 1.8.6 and 1.8.7 trees and the affected people have the chance to upgrade, without causing any harm to the rest of us.

  • Will Bryant

    Will Bryant September 2nd, 2009 @ 11:23 PM

    Hans de Graff's comment at the end of http://redmine.ruby-lang.org/issues/show/1471 is interesting - bug possibly not completely fixed in the Ruby mainline yet?

  • Daniel Azuma

    Daniel Azuma September 3rd, 2009 @ 06:18 AM

    @Will: No, I'm pretty sure Redmine 1471 is a separate and probably unrelated issue. As John Carter originally reported it, it has to do with deadlock detection, not with hangs in IO#select. That said, though, the MRI 1.8 thread scheduler (the C code behind these issues) was and is a Frankensteinian mess. I spent many hours picking it apart while investigating this issue, and though I still understand very little of it, I found several more potential problems along the way. There are, I suspect, a number of intertwined bugs still extant in the code. One more reason to upgrade to 1.9 soon...

    It may be possible to bound the version numbers more tightly. We can detect 1.8.6 and 1.8.7 specifically, and there's the RUBY_PATCHLEVEL constant that, if present, would give the MRI patchlevel. But these are riskier checks that we'd have to test on other Ruby implementations, not to mention non-release subversion builds. I'm kind of hesitant to go that route and add more complexity, now that I've been burned once already... :-)

  • Will Bryant

    Will Bryant September 3rd, 2009 @ 10:54 AM

    Right, "Here be dragons" :). Thanks for those details.

    Delano, how do you feel about Daniel's fork? Keen to get a new net-ssh official release to unbreak deployments from Snow Leopard :).

  • Delano Mandelbaum

    Delano Mandelbaum September 3rd, 2009 @ 03:26 PM

    I'm going to catch up on the convo and prepare a release this afternoon.

  • Delano Mandelbaum

    Delano Mandelbaum September 3rd, 2009 @ 07:20 PM

    Daniel, I've included your changes in the 2.0 branch. I am going to create the 2.0.15 release right now based on these changes b/c it's important we correct the bug introduced in the previous release. I agree with Will that we should be more conservative in the Ruby test but I also agree with Daniel that it could introduce riskier changes at this time.

    Aside from the immediate release, I'd like to pursue a few tasks:

    • Daniel, could you ping the 1.8.6 maintainer? (I haven't heard from Aaron yet by the way, but I am interested to hear if these changes work for him).

    • create a test scenario so that everyone involved has a reliable way to reproduce the issue. I can extend the Rudy configuration (i.e. Rudyfile) which I already use for testing gem builds to automate the process of setting up the machines with the specific versions of Ruby 1.8.6 and 1.8.7. This will also help us verify which specific versions of Ruby have included the fix. Could one of you help me with the test scenario?

    • I'm going to look into how to safely determine the running engine (MRI, JRuby, Rubinius, etc...) and specific version. If anyone has any suggestions or experience in this area, please let me know.

    Anything else?

  • Delano Mandelbaum

    Delano Mandelbaum September 3rd, 2009 @ 07:31 PM

    • → Milestone changed from “2.0.14” to “2.0.15”

    The 2.0.15 release is now available on Rubyforge and Github. It should available on the Rubyforge mirrors in the next 30 minutes.

  • Daniel Azuma

    Daniel Azuma September 3rd, 2009 @ 08:08 PM

    I created a Redmine ticket in the 1.8.6 project and assigned it to the 1.8.6 maintainer. We now have the following tickets for the two versions:

    1.8.6: http://redmine.ruby-lang.org/issues/show/2039

    1.8.7: http://redmine.ruby-lang.org/issues/show/1993

  • Shawno

    Shawno October 28th, 2009 @ 09:40 PM

    • → Tag changed from “bug ruby18” to “bug ruby18 socket”

    Hi guys - great work with this library, and thanks for picking up the trail after Jamis's departure.

    Caught up with this thread after noticing what I "think" are similar issues to the one described here, that I'm experiencing with the 2.0.15 release. I'm no socket expert by any means, but I thought the symptoms I observed might be applicable.

    Looks like Capistrano is the big user of the library, but we're looking to use Net::SSH in a number of our security applications. Where I noticed an issue is when utilized a forwarded port created with Net::SSH to connect a local port to a remote host http proxy server. A browser, then configured locally to use said local port will obviously utilize alot of simultaneous connections as it makes requests through the connection. We're noticing that if you're "nice" and allow pages to load (connections to be created, used, then shutdown gracefully) you're fine - but if you interrupt the connection by browsing to other links in mid-download, or hit your stop button, we see "connection forcibly reset" exceptions from buffered_io.rb's fill method, and then the forwarded connection is basically locked up until you re-establish. Wasn't sure if this issue might be related, but it's definitely a good way to reproduce the issue.

    Based on the link above with the open issues directed to ruby 1.8.6 and 1.8.7, is it likely this is resolved in the HEAD of 1.8?

    Thanks again for the great work - great library to work with.

  • Shawno

    Shawno October 28th, 2009 @ 09:44 PM

    Incidentally, our testing was done on: ruby -v

    ruby 1.8.6 (2008-08-11 patchlevel 287) [i386-mswin32]

  • Shawno

    Shawno October 28th, 2009 @ 11:02 PM

    Just caught Daniel's blog and it looks like the workaround to the issue reported here was patched in 2.0.14. So the locked connection I described above may not be the same issue. Here's some sample code that will kick out the issue:

    def launch_ssh()

    Thread.new{
    begin
      if $session
        $session.close
        $session = nil
      end
      $session = Net::SSH.start('your hose', 'user', :password => 'password')
      $session.forward.local(8080,'some proxy server host',proxy server port)
      $session.loop{true}
    rescue => e
      puts e.message
      puts e.backtrace
    end
    }
    

    end

    If you overload or interrupt a connection using the forward, you see this:

    An existing connection was forcibly closed by the remote host. - recvfrom(2)
    C:/Ruby/lib/ruby/gems/1.8/gems/net-ssh-2.0.15/lib/net/ssh/buffered_io.rb:65:in recv'<br/> C:/Ruby/lib/ruby/gems/1.8/gems/net-ssh-2.0.15/lib/net/ssh/buffered_io.rb:65:infill'
    C:/Ruby/lib/ruby/gems/1.8/gems/net-ssh-2.0.15/lib/net/ssh/connection/session.rb:228:in postprocess'<br/> C:/Ruby/lib/ruby/gems/1.8/gems/net-ssh-2.0.15/lib/net/ssh/connection/session.rb:224:ineach'
    C:/Ruby/lib/ruby/gems/1.8/gems/net-ssh-2.0.15/lib/net/ssh/connection/session.rb:224:in postprocess'<br/> C:/Ruby/lib/ruby/gems/1.8/gems/net-ssh-2.0.15/lib/net/ssh/connection/session.rb:203:inprocess'
    C:/Ruby/lib/ruby/gems/1.8/gems/net-ssh-2.0.15/lib/net/ssh/connection/session.rb:161:in loop'<br/> C:/Ruby/lib/ruby/gems/1.8/gems/net-ssh-2.0.15/lib/net/ssh/connection/session.rb:161:inloop_forever'
    C:/Ruby/lib/ruby/gems/1.8/gems/net-ssh-2.0.15/lib/net/ssh/connection/session.rb:161:in loop'

    It may indeed be the remote host closing the connection, but the forwarded connection provided by Net::SSH hangs at this point and doesn't recover.

    Again - sorry if this isn't related to this issue -

  • Lee Hambley

    Lee Hambley October 29th, 2009 @ 12:01 AM

    Shawno,

    I suspect Delano will move this to a separate bug/ticket, it's not related - your mileage may vary by handling this error and forcing a reconnect, we had a similar problem for Capistrano, for which the solution is to expect the network to fail, and handle it by reconnecting as required - after all network problems happen quite frequently :)

    -- Over to Delano!

  • Shawno

    Shawno October 29th, 2009 @ 12:15 AM

    Thanks for the chime in Lee. I did do some tinkering in bufferd_io.rb to catch the exception and force a retry, or even just return 0, which seemed to slow the problem, but didn't resolve it. But I'll admit, my ruby-kung-fu is pretty limited,

    In any case, swapping out the ruby implementation of the forward, for another ssh client (like putty, etc) produces no issue - which at least seemingly eliminates external causes. Whether this is something located in Net::SSH or deeper in the bowels of ruby's networking, I'll leave to the experts. Could just be some connection robustness needed.

    Kudos again folks - looking forward to following the project.

  • Delano Mandelbaum

    Delano Mandelbaum October 29th, 2009 @ 03:43 PM

    Shawno, thanks for identifying the issue. I created a new ticket:
    http://net-ssh.lighthouseapp.com/projects/36253-net-ssh/tickets/7-a...

    This is a very busy week for me but I'll get to it as soon as I can (probably early next week).

  • Manish Shah

    Manish Shah December 8th, 2009 @ 05:47 PM

    Any more progress on this? We're still having issues with this and capistrano. Any ideas?

    • establishing connection to gateway xx.xx.xx.xx:9022'
    • Creating gateway using xx.xx.xx.xx:9022
    • establishing connection to qa' via gateway /opt/local/lib/ruby/gems/1.8/gems/net-ssh-2.0.16/lib/net/ssh/ruby_compat.rb:36:inselect': closed stream (IOError)
      from /opt/local/lib/ruby/gems/1.8/gems/net-ssh-2.0.16/lib/net/ssh/ruby_compat.rb:36:in `io_select'
      from /opt/local/lib/ruby/gems/1.8/gems/net-ssh-2.0.16/lib/net/ssh/connection/session.rb:201:in `process'
      from /opt/local/lib/ruby/gems/1.8/gems/net-ssh-gateway-1.0.1/lib/net/ssh/gateway.rb:193:in `initiate_event_loop!'
      from /opt/local/lib/ruby/gems/1.8/gems/net-ssh-gateway-1.0.1/lib/net/ssh/gateway.rb:192:in `synchronize'
      from /opt/local/lib/ruby/gems/1.8/gems/net-ssh-gateway-1.0.1/lib/net/ssh/gateway.rb:192:in `initiate_event_loop!'
      from /opt/local/lib/ruby/gems/1.8/gems/net-ssh-gateway-1.0.1/lib/net/ssh/gateway.rb:190:in `initialize'
      from /opt/local/lib/ruby/gems/1.8/gems/net-ssh-gateway-1.0.1/lib/net/ssh/gateway.rb:190:in `new'
      from /opt/local/lib/ruby/gems/1.8/gems/net-ssh-gateway-1.0.1/lib/net/ssh/gateway.rb:190:in `initiate_event_loop!'
       ... 51 levels...
      from /opt/local/lib/ruby/gems/1.8/gems/capistrano-2.5.10/lib/capistrano/cli/execute.rb:14:in `execute'
      from /opt/local/lib/ruby/gems/1.8/gems/capistrano-2.5.10/bin/cap:4
      from /opt/local/bin/cap:19:in `load'
      from /opt/local/bin/cap:19
      
      *** [deploy:update_code] rolling back
  • Daniel Azuma

    Daniel Azuma December 13th, 2009 @ 11:42 PM

    Manish, that sounds like a different issue, if the symptom is the connection dropping or stream being closed, rather than the select calls returning the wrong results. Can you open a separate bug? I think this current one ought to be closed.

  • Delano Mandelbaum

    Delano Mandelbaum December 14th, 2009 @ 06:13 PM

    • → State changed from “open” to “resolved”

    Hi Manish, please open a separate ticket for your issue.

    I left this ticket open to catch any potential issues with the fix but it's been quiet so I'll assume all is good and close it (Daniel, thanks for the reminder).

Please Sign in or create a free account to add a new ticket.

With your very own profile, you can contribute to projects, track your activity, watch tickets, receive and update tickets through your email and much more.

New-ticket Create new ticket

Create your profile

Help contribute to this project by taking a few moments to create your personal profile. Create your profile »

Pure Ruby implementation of an SSH (protocol 2) client

Shared Ticket Bins

Attachments

You can update this ticket by sending an email to from your email client. (help)