DRb Mysterious Stops

darrint · August 24, 2009, 8:40pm

I’m running a fairly complicated build and test system with DRb over
Ruby 1.8.6. It involves 12 Linux machines running several different
distro versions and one Windows machine.

Lately I’ve been having problems where once in awhile the machines
involved in this system just stop communicating, and I can’t figure out
why. I’ve found on occasion I can work around the problem by changing
the order of the operations or the frequency of them. It’s more or less
random when it occurs.

The only thing I can think of is that this all started when I added suse
9.3 and 9.4 machines to this system.

The other possibility is that now I have 12 Linux machines and a Windows
machine all more or less arbitrarily talking with each other, so there
might be a slowly increasing probability of a deadlock that I’m suddenly
noticing because it’s more likely with more machines.

I’m sitting here thinking of exotic ways TCP could be misconfigured out
of the box on suse 9. But deep in my soul I’m sure it’s some stupid code
I wrote.

Anyway, the idea here is that a Windows machine sends messages to
several Linux machines and the Linux machines send back log messages and
occasionally a series of messages that represent the contents of a file.

If anyone has insight, I’d appreciate it. I’m running out of good ideas
here.

–
Darrin

darrint · August 24, 2009, 8:53pm

Darrin T. wrote:

The only thing I can think of is that this all started when I added suse

Anyway, the idea here is that a Windows machine sends messages to
several Linux machines and the Linux machines send back log messages and
occasionally a series of messages that represent the contents of a file.

If anyone has insight, I’d appreciate it. I’m running out of good ideas
here.

–
Darrin

It might help to add

Thread.abort_on_exception = true

in case a drb thread is dying silently. (DRb might be smarter than that,
though.)

darrint · August 25, 2009, 7:46pm

Joel VanderWerf wrote:

It might help to add

Thread.abort_on_exception = true

in case a drb thread is dying silently. (DRb might be smarter than that,
though.)

Tried that. No processes died or left any traces in my logs.

What I did get was more consistent [bad] behavior, at least for today.

It seems that any time my windows machine calls a method on my hosted
suse9 64 bit machine, and I return a large return value from that
method, the conversation somehow gets “stuck”. Large might be an array
of 8000 lines of text from a file.

I watched the conversation with wireshark and I saw that on one of my
failures, right before everything hung, there were 15 tcp dup acks.

That’s all I’ve got so far. Any more help?

–
Darrin

darrint · August 27, 2009, 8:07am

On Thu, 2009-08-27 at 00:39 +0900, Darrin T. wrote:

That’s all I’ve got so far. Any more help?

Hi,
I’ve had same problem a while ago - my program simply stopped to
communicate with the remote and I also couldn’t figure out why. First I
was restarting the program periodically via cron and later rewrote it to
send just UDP messages. I just needed to signal another process on
another host so this was a good option for me. The program was running
(and stopping) on FreeBSD.
If you have a lot of data to send I’d consider to use xml-rpc or soap
instead or drb.

Martin

darrint · August 27, 2009, 9:00am

2009/8/27 Martin B. removed_email_address@domain.invalid:

instead or drb.
Alternatively implement file transfer on top of DRb, which could be
simply remote iterating through the file in chunks. That would avoid
issues with arbitrary large DRb method arguments or return values.
Although I have to say that my expectation would be that arbitrary
large Strings should not cause issues with DRb - that would sound like
a bug to me.

Kind regards

robert

PS: I don’t believe in IP misconfiguration either.

darrint · August 27, 2009, 3:00pm

Robert K. wrote:

Although I have to say that my expectation would be that arbitrary
large Strings should not cause issues with DRb - that would sound like
a bug to me.

So trolling through the drb code I came across this:

def load(soc)  # :nodoc:
  begin
    sz = soc.read(4)        # sizeof (N)
  rescue
    raise(DRbConnError, $!.message, $!.backtrace)
  end
  raise(DRbConnError, 'connection closed') if sz.nil?
  raise(DRbConnError, 'premature header') if sz.size < 4
  sz = sz.unpack('N')[0]
  raise(DRbConnError, "too large packet #{sz}") if @load_limit < sz
  begin
    str = soc.read(sz)
  rescue
    raise(DRbConnError, $!.message, $!.backtrace)
  end
  raise(DRbConnError, 'connection closed') if str.nil?
  raise(DRbConnError, 'premature marshal format(can\'t read)') if

str.size <
sz
Thread.exclusive do
begin
save = Thread.current[:drb_untaint]
Thread.current[:drb_untaint] = []
Marshal::load(str)
rescue NameError, ArgumentError
DRbUnknown.new($!, str)
ensure
Thread.current[:drb_untaint].each do |x|
x.untaint
end
Thread.current[:drb_untaint] = save
end
end
end

Is it possible that the thread.exclusive bit could deadlock on a windows
machine?

–
Darrin

darrint · August 27, 2009, 3:45pm

2009/8/26 Darrin T. removed_email_address@domain.invalid:

What I did get was more consistent [bad] behavior, at least for today.

It seems that any time my windows machine calls a method on my hosted
suse9 64 bit machine, and I return a large return value from that
method, the conversation somehow gets “stuck”. Large might be an array
of 8000 lines of text from a file.

I watched the conversation with wireshark and I saw that on one of my
failures, right before everything hung, there were 15 tcp dup acks.

That many dup acks are somewhat suspicious but by themselves they
should not cause a lockup. If there is network problem the connection
should break eventually. However, some dumps of the part of the
conversation that causes excessive packet duplication might be useful.
Can you replicate the packet duplication with something simple like
scp file transfer or the like?

Are some of the earlier machines also 64bit?

I am not sure how 32bit vs 64bit integers work with marshalling. It
should work but perhaps some testing to ensure it really works well
would be a good idea.

Thanks

Michal

darrint · August 28, 2009, 6:52pm

Michal S. wrote:

That many dup acks are somewhat suspicious but by themselves they
should not cause a lockup. If there is network problem the connection
should break eventually. However, some dumps of the part of the
conversation that causes excessive packet duplication might be useful.
Can you replicate the packet duplication with something simple like
scp file transfer or the like?

I can replicate it with a tiny tiny drb pair of programs.

On my SLES9 machine:

cat test.rb

require ‘drb’

class Echo
def ping(length)
return ‘a’ * length
end
end

echo = Echo.new

DRb.start_service(“druby://0.0.0.0:9000”, echo)
DRb.thread.join

On my Windows machine:
require ‘drb’

echo = DRb::DRbObject.new_with_uri(“druby://172.31.192.159:9000”)
response = echo.ping(ARGV[0].to_i)
puts response.length

When the program succeeds it’s prints the number given. When it fails,
it hangs until I kill it. I’m finding that short values always succeed,
like 1024. I get some successful and some failed when I provide 44230 as
the arg.

I have saved packet traces of a successful run at 1024 and a failed run
at 1 Mb. The traces are a few K and 70+K respectively. I can provide
them privately.

Are some of the earlier machines also 64bit?

Yes.

I am not sure how 32bit vs 64bit integers work with marshalling. It
should work but perhaps some testing to ensure it really works well
would be a good idea.

A lot of other 32/64 bit conversations with other machines are working
fine, so I’m reluctant to go there.

–
Darrin

darrint · August 28, 2009, 10:01pm

SEKI Masatoshi wrote:

DRb.start_service(“druby://0.0.0.0:9000”, echo, {:load_limit =>
2**31}) ?

No effect.

–
Darrin

darrint · August 28, 2009, 9:03pm

On 2009/08/29, at 1:55, Darrin T. wrote:

   return 'a' * length
end
end

echo = Echo.new

DRb.start_service(“druby://0.0.0.0:9000”, echo)
DRb.thread.join

DRb.start_service(“druby://0.0.0.0:9000”, echo, {:load_limit =>
2**31}) ?

darrint · August 28, 2009, 10:19pm

SEKI Masatoshi wrote:

DRb.start_service(“druby://0.0.0.0:9000”, echo, {:load_limit =>
2**31}) ?

I’m breaking this with message sizes of ~50k.

–
Darrin

darrint · August 31, 2009, 4:48pm

2009/8/31 Darrin T. removed_email_address@domain.invalid:

Other machines it can run as long as I let it.

Sorry for the noise.

No problem. Btw, netcat is another tool for your network debugging
toolbox which might help for these network transmission tests:

Kind regards

robert

darrint · August 31, 2009, 2:41pm

Darrin T. wrote:

I can replicate it with a tiny tiny drb pair of programs.

And I think I’ve just ruled out ruby and DRb as the culprits here.

I ran a test like this:

ssh root@ipofbadmachine cat /dev/urandom | hexdump -C

I run that from windows xp in cygwin and it hangs after a few seconds.
Other machines it can run as long as I let it.

Sorry for the noise.

–
Darrin