Random cpu spikes, EBADF errors

Already covered in this thread:
http://rubyforge.org/pipermail/mongrel-users/2007-October/004132.html

Delaying the accept() would be more helpful for load balancers which,
after a timeout for connect, cycle to another load balancer in the
pool. Failing that, a 503 would be reasonable, and it offers a hint to
users as to what’s really happening. The open/close does not.

According to the http/1.1 spec the 503 and refuse-to-accept are both
correct (
http://www.w3.org/Protocols/HTTP/1.1/rfc2616bis/draft-lafon-rfc2616bis-03.txt
)

10.5.4. 503 Service Unavailable

The server is currently unable to handle the request due to a
temporary overloading or maintenance of the server. The implication
is that this is a temporary condition which will be alleviated after
some delay. If known, the length of the delay MAY be indicated in a
Retry-After header. If no Retry-After is given, the client SHOULD
handle the response as it would for a 500 response.

  Note: The existence of the 503 status code does not imply that a
  server must use it when becoming overloaded.  Some servers may
  wish to simply refuse the connection.

Anyhow, I suggested one means for doing that in a previous thread (
entitled num_threads or accept/close or sth like that )

On Mon, 29 Oct 2007 16:09:17 -0400
“Zachary P.” [email protected] wrote:

anything, then try to close it again with result EBADF, because the file
descriptor has been closed already.

Take a look in the mongel.log with debugging on as there might be a
complaint about LSWS’s interpretation of the HTTP protocol which is
causing mongrel to close the connection due to a malformed request.
Normally, the only time that Mongrel will abort a connection is when the
client (LSWS in this case) sends a malformed request according to the
HTTP grammar. When Mongrel reports what caused the close it tells you
the full request that was bad. The error is usually BAD CLIENT.

Also go use ethereal to get a packet trace of the traffic between the
two servers to see what is being sent. You’ll probably find that LSWS
is doing something that no other web server does, or at least some clue
as to what is making Mongrel barf. One thing to watch for is that some
web servers acting as a proxy don’t honor the Connection:close header on
responses and try to keep the socket forced open, which also violates
the RFC.

Finally, a stack trace of where the EBADF shows up would let the Mongrel
team just not close it if it’s already closed (again). Ultimately Ruby
shouldn’t be throwing these errnos as separate exceptions since it means
having to compensate for every platform’s interpretation of the sockets
API and what should be thrown when.


Zed A. Shaw

On Mon, 29 Oct 2007 16:27:49 -0400
Robert M. [email protected] wrote:

When mongrel was working, it should send the reply back to LSWS
before closing the socket.

There’s a string prepared for the purpose in mongre.rb

ERROR_503_RESPONSE="HTTP/1.1 503 Service Unavailable\r\n\r\nBUSY".freeze

It’s a one-liner to send that to the socket before calling close.

No, that’s not the best way to do this. Think for a minute. Mongrel is
overloaded. It’s having a hard time sending data. Now you want it to
waste more time sending data?

The general practice that works best is when a server is overloaded it
aborts connections it can’t handle in order to get some free time to
service more requests. This way existing pending requests get some
service and in a load balancing situation the server can move on to the
next available backend. The alternative in trying to handle all
requests, even with small responses, will mean that nobody gets service.

In reality, I bet that LSWS doesn’t try to move on to the next backend
when the connection is aborted. If you think about this also, it means
that when LSWS is behaving as a proxy, and one of your backends goes
down, then LSWS won’t adapt and will instead complain to the user.

A properly functioning proxy server that is behaving as a load balancer
should try all servers possible several times until it either gets a
response or has to give up because everything is down and/or it is
overloaded as well.


Zed A. Shaw

On Mon, 29 Oct 2007 17:43:59 -0400
“Evan W.” [email protected] wrote:

Does Litespeed support x-sendfile? Maybe the DirHandler should be
updated to take advantage of that.

Uh, wait, Mongrel is serving files?

Nobody sees the problem with that? This isn’t a best practice at all,
so first quit doing that and then see if the problem persists. There’s
nothing Mongrel can do if you overload the Ruby interpreter with simple
file requests that LSWS could handle.


Zed A. Shaw

On Mon, 29 Oct 2007 21:27:52 -0400
“Evan W.” [email protected] wrote:

I think currently it accepts the connection and then immediately
closes it, which is not consistent with the spec.

It can’t close a connection it hasn’t accepted yet, and in practice you
find that your LB gets overloaded if you don’t close it right away.


Zed A. Shaw

On Tue, 30 Oct 2007 10:53:29 +1100
Clifford H. [email protected] wrote:

Surely it’s preferable to just delay the accept() until there’s a
thread to
assign it to? That way the client sees a slow connection-establishment
and can draw their own conclusions, including deciding how long to
wait or whether to retry.

No, then the load balancer gets bound waiting for a response from a
backend that can’t respond. This causes the load balancer to get a ton
of dead sockets.

The LB should take the closed connection to mean “backend screwed, try
again” and move to the next one.


Zed A. Shaw

Hi Evan,

You are doing a really great job supporting so many people on this list

  • thank you! I’m learning a lot just listening.

I’ve been considering that what you’re asking below (and what Robert
Mela has been pushing everyone on) is essentially identifying that
there might need to be at least two modes to approaching Mongrel
queuing:

  1. Mongrel queues all requests (current model)
  2. Load balancer / webserver (or even IP stack) queues most requests

I think Mongrel right now is designed solely for the case where Mongrel
is supposed to queue all requests? Robert M. seems to want an
environment where Mongrel queues only some or no requests b/c he seems
to have a way to get Apache + mod_proxy_magic_bullet to queue and
re-try failed requests from mongrels.

I wonder if it makes sense to create a mode for Mongrel where it queues
only a few (or no) requests and the load balancer or webserver (or even
ip stack) is designed to queue the majority of backlogged requests?
Could this be a user-configurable setting (:queue_length => 2.requests
or whatever)?

In the event that the queue length grows bigger than this limit,
Mongrel responds with a 503. If the calling agent understands 503, it
be able to try other mongrels in the cluster until it finds one that is
free. If they are all busy it would just keep knocking on doors until
one frees up.

This approach could make things worse in extreme load environments b/c
now you have backed up mongrels and a pile of re-requests hammering on
the door to all the mongrels as well. But that’s a worst case scenario
(e.g. slashdotting) that is going to break SOMETHING SOMEWHERE anyway.
So why not have it melt-down at the interface between the webservers
and the mongrel cluster instead of inside the mongrels (what’s the
difference)?

The benefit of this alternate mode of operation would be that free
mongrels get called more often and overloaded mongrels get skipped more
often, which creates a much smoother user experience on the front-end,
generally speaking (this approach improve performance of moderately
loaded websites at the expense of punishing heavily loaded ones - who
should probably add more mongrels/hardware anyway).

The only changes to Mongrel code would be to allow a configurable queue
length on a per-mongrel basis (maybe already in there?) and a setting
to cause Mongrels to accept and return 503 instead of accepting and
closing the connection? Defaults would remain the same as they now…

Would such a dual mode of operation for mongrels make sense for some
users or am I just completely barking up the wrong tree here? Apologies
if this is a distraction from the real issue you are discussing.

Best,

Steve

At 06:27 PM 10/29/2007, you wrote:

whatever we want.

I think the issue might be, if you can only handle 500 requests p/s,
and you are getting 600, if Mongrel closes the connection, at least
those 500 will get served, but if Mongrel returns 503, the web server
will say “hey, error” and try on the next mongrel, which won’t help
clear the request queue. The requests will still queue, just at a
higher level, and noone will end up getting a request served in a sane
amount of time.

Evan

Wow, triggered a whole discussion there (most of which was over my head,
at
least at this hour). I’ve bumped it up to four mongrels to see if that
solves the problem (temporarily) and I’ll turn the mongrel.log debug on
and
see what I can find.
Thanks all,

Zach

On Tue, 30 Oct 2007 09:19:24 -0400
“Zachary P.” [email protected] wrote:

Wow, triggered a whole discussion there (most of which was over my head, at
least at this hour). I’ve bumped it up to four mongrels to see if that
solves the problem (temporarily) and I’ll turn the mongrel.log debug on and
see what I can find.

It is a common issue though with the HTTP RFC and what load balancers
should be doing. Effectively, the RFC describes a web server, proxy,
and client, but not really an LB of any kind. When people follow the
RFC they get some dumb behaviors from their web server that shouldn’t
apply to an LB. For example, many web servers will take the 503
responses from the backends and then show them to the end user, which if
you read the RFC is kind of right but really wrong (it should try
again). Others will take the RFC literally and make a connection to a
backend then hang out, which is wrong in a practical sense since that
means a mis-configured backend can cripple the LB.

Imagine if the LB had to wait for the “official” TCP timeout of anywhere
from 60 seconds to 200,000 days depending on the operating system. (Yes
pedants, that’s exagerated.)

There’s also practical considerations when dealing with heavy loaded
network servers in general. I believe that the HTTP people got this one
all wrong in that they require a response, but logically if your server
is overloaded, you can’t give a response.

So yes, you started a useful conversation since people are going to keep
hitting this over and over. The solution of course is the following:

** The HTTP RFC doesn’t cover load balancers (or even proxy servers)
in any sufficient detail to be useful. **

That’s the gist of it really.

Let us know what comes of your changes.


Zed A. Shaw