I’m running a fairly complicated build and test system with DRb over
Ruby 1.8.6. It involves 12 Linux machines running several different
distro versions and one Windows machine.
Lately I’ve been having problems where once in awhile the machines
involved in this system just stop communicating, and I can’t figure out
why. I’ve found on occasion I can work around the problem by changing
the order of the operations or the frequency of them. It’s more or less
random when it occurs.
The only thing I can think of is that this all started when I added suse
9.3 and 9.4 machines to this system.
The other possibility is that now I have 12 Linux machines and a Windows
machine all more or less arbitrarily talking with each other, so there
might be a slowly increasing probability of a deadlock that I’m suddenly
noticing because it’s more likely with more machines.
I’m sitting here thinking of exotic ways TCP could be misconfigured out
of the box on suse 9. But deep in my soul I’m sure it’s some stupid code
I wrote.
Anyway, the idea here is that a Windows machine sends messages to
several Linux machines and the Linux machines send back log messages and
occasionally a series of messages that represent the contents of a file.
If anyone has insight, I’d appreciate it. I’m running out of good ideas
here.
The only thing I can think of is that this all started when I added suse
Anyway, the idea here is that a Windows machine sends messages to
several Linux machines and the Linux machines send back log messages and
occasionally a series of messages that represent the contents of a file.
If anyone has insight, I’d appreciate it. I’m running out of good ideas
here.
–
Darrin
It might help to add
Thread.abort_on_exception = true
in case a drb thread is dying silently. (DRb might be smarter than that,
though.)
in case a drb thread is dying silently. (DRb might be smarter than that,
though.)
Tried that. No processes died or left any traces in my logs.
What I did get was more consistent [bad] behavior, at least for today.
It seems that any time my windows machine calls a method on my hosted
suse9 64 bit machine, and I return a large return value from that
method, the conversation somehow gets “stuck”. Large might be an array
of 8000 lines of text from a file.
I watched the conversation with wireshark and I saw that on one of my
failures, right before everything hung, there were 15 tcp dup acks.
On Thu, 2009-08-27 at 00:39 +0900, Darrin T. wrote:
That’s all I’ve got so far. Any more help?
Hi,
I’ve had same problem a while ago - my program simply stopped to
communicate with the remote and I also couldn’t figure out why. First I
was restarting the program periodically via cron and later rewrote it to
send just UDP messages. I just needed to signal another process on
another host so this was a good option for me. The program was running
(and stopping) on FreeBSD.
If you have a lot of data to send I’d consider to use xml-rpc or soap
instead or drb.
instead or drb.
Alternatively implement file transfer on top of DRb, which could be
simply remote iterating through the file in chunks. That would avoid
issues with arbitrary large DRb method arguments or return values.
Although I have to say that my expectation would be that arbitrary
large Strings should not cause issues with DRb - that would sound like
a bug to me.
Kind regards
robert
PS: I don’t believe in IP misconfiguration either.
Although I have to say that my expectation would be that arbitrary
large Strings should not cause issues with DRb - that would sound like
a bug to me.
So trolling through the drb code I came across this:
def load(soc) # :nodoc:
begin
sz = soc.read(4) # sizeof (N)
rescue
raise(DRbConnError, $!.message, $!.backtrace)
end
raise(DRbConnError, 'connection closed') if sz.nil?
raise(DRbConnError, 'premature header') if sz.size < 4
sz = sz.unpack('N')[0]
raise(DRbConnError, "too large packet #{sz}") if @load_limit < sz
begin
str = soc.read(sz)
rescue
raise(DRbConnError, $!.message, $!.backtrace)
end
raise(DRbConnError, 'connection closed') if str.nil?
raise(DRbConnError, 'premature marshal format(can\'t read)') if
str.size <
sz
Thread.exclusive do
begin
save = Thread.current[:drb_untaint]
Thread.current[:drb_untaint] = []
Marshal::load(str)
rescue NameError, ArgumentError
DRbUnknown.new($!, str)
ensure
Thread.current[:drb_untaint].each do |x|
x.untaint
end
Thread.current[:drb_untaint] = save
end
end
end
Is it possible that the thread.exclusive bit could deadlock on a windows
machine?
What I did get was more consistent [bad] behavior, at least for today.
It seems that any time my windows machine calls a method on my hosted
suse9 64 bit machine, and I return a large return value from that
method, the conversation somehow gets “stuck”. Large might be an array
of 8000 lines of text from a file.
I watched the conversation with wireshark and I saw that on one of my
failures, right before everything hung, there were 15 tcp dup acks.
That many dup acks are somewhat suspicious but by themselves they
should not cause a lockup. If there is network problem the connection
should break eventually. However, some dumps of the part of the
conversation that causes excessive packet duplication might be useful.
Can you replicate the packet duplication with something simple like
scp file transfer or the like?
Are some of the earlier machines also 64bit?
I am not sure how 32bit vs 64bit integers work with marshalling. It
should work but perhaps some testing to ensure it really works well
would be a good idea.
That many dup acks are somewhat suspicious but by themselves they
should not cause a lockup. If there is network problem the connection
should break eventually. However, some dumps of the part of the
conversation that causes excessive packet duplication might be useful.
Can you replicate the packet duplication with something simple like
scp file transfer or the like?
I can replicate it with a tiny tiny drb pair of programs.
On my SLES9 machine:
cat test.rb
require ‘drb’
class Echo
def ping(length)
return ‘a’ * length
end
end
When the program succeeds it’s prints the number given. When it fails,
it hangs until I kill it. I’m finding that short values always succeed,
like 1024. I get some successful and some failed when I provide 44230 as
the arg.
I have saved packet traces of a successful run at 1024 and a failed run
at 1 Mb. The traces are a few K and 70+K respectively. I can provide
them privately.
Are some of the earlier machines also 64bit?
Yes.
I am not sure how 32bit vs 64bit integers work with marshalling. It
should work but perhaps some testing to ensure it really works well
would be a good idea.
A lot of other 32/64 bit conversations with other machines are working
fine, so I’m reluctant to go there.