Random cpu spikes, EBADF errors

zachary_p · October 24, 2007, 3:31am

In May I had problem with mongrels suddenly consuming huge cpu resources
for
a minute or two and then returning to normal (load average spikes up
to 3.8and then back down to a regular
0.2 over the course of 5 minutes, then again 1/2 hour later. or 4 hours
later, no predictable rhythm).

I posted to Litespeed forums because I thought the problem was there but
didn’t get far. And a week later migrated hosting companies and the
problem
was gone. Now its returned. We make a lot of changes, but I’ve gone over
the
repo for the last few weeks and can’t see anything structural that
should
effect it. It only happens with our main front end app (the other two
are
fine), but happens at all times of day(/night) so doesn’t seem triggered
by
a heavy load. Basically a mongrel gets stuck on one or two cached files
for
a few minutes (but still functions fine for other requests, I can ping
specific rails pages on all mongrels during this period).

strace -e read,write,close produces this repeatedly the whole time
(short
excerpt of 1000s of lines):

close(5) = -1 EBADF (Bad file
descriptor)
read(5, “GET /flower_delivery/florists_in_covehithe_suffolk_england_uk
HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, image”…,
16384)
= 473
close(5) = 0
close(5) = -1 EBADF (Bad file descriptor)
read(5, “GET /flower_delivery/florists_in_rowde_wiltshire_england_uk
HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, image/p”…,
16384) = 471
close(5) = 0
close(5) = -1 EBADF (Bad file descriptor)
read(5, “GET /flower_delivery/florists_in_cove_south_suffolk_england_uk
HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, imag”…,
16384) =
474
close(5) = 0
close(5) = -1 EBADF (Bad file descriptor)
read(5, “GET /flower_delivery/florists_in_covehithe_suffolk_england_uk
HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, image”…,
16384)
= 473
close(5) = 0
close(5) = -1 EBADF (Bad file descriptor)
read(5, “GET /flower_delivery/florists_in_rowde_wiltshire_england_uk
HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, image/p”…,
16384) = 471
close(5) = 0
close(5) = -1 EBADF (Bad file descriptor)
read(5, “GET /flower_delivery/florists_in_cove_south_suffolk_england_uk
HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, imag”…,
16384) =
474
close(5) = 0
close(5) = -1 EBADF (Bad file descriptor)
read(5, “GET /flower_delivery/florists_in_covehithe_suffolk_england_uk
HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, image”…,
16384)
= 473
close(5) = 0
close(5) = -1 EBADF (Bad file descriptor)
read(5, “GET /flower_delivery/florists_in_rowde_wiltshire_england_uk
HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, image/p”…,
16384) = 471
close(5) = 0
close(5) = -1 EBADF (Bad file descriptor)
read(5, “GET /flower_delivery/florists_in_cove_south_suffolk_england_uk
HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, imag”…,
16384) =
474
close(5) = 0
close(5) = -1 EBADF (Bad file descriptor)
read(5, “GET /flower_delivery/florists_in_covehithe_suffolk_england_uk
HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, image”…,
16384)
= 473
close(5) = 0
close(5) = -1 EBADF (Bad file descriptor)
read(5, “GET /flower_delivery/florists_in_rowde_wiltshire_england_uk
HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, image/p”…,
16384) = 471
close(5) = 0
close(5) = -1 EBADF (Bad file descriptor)
read(5, “GET /flower_delivery/florists_in_cove_south_suffolk_england_uk
HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, imag”…,
16384) =
474
close(5) = 0
close(5) = -1 EBADF (Bad file descriptor)
read(5, “GET /flower_delivery/florists_in_covehithe_suffolk_england_uk
HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, image”…,
16384)
= 473
close(5) = 0
close(5) = -1 EBADF (Bad file descriptor)

the file its trying to get is page cached, and exists/is fine (can even
go
to url while this is going on).

Could it still be a problem with Litespeed (actually requesting this
file
many times?). Litespeeds cpu usuage does go up during this period, but
stracing it doesn’t give anything useful.

thanks for any tips/directions.

Zach

zachary_p · October 24, 2007, 4:06am

What system architecture are you using?

Evan

zachary_p · October 24, 2007, 5:10am

On 10/23/07, Zachary P. [email protected] wrote:

repo for the last few weeks and can’t see anything structural that should
close(5) = -1 EBADF (Bad file descriptor)
read(5, “GET /flower_delivery/florists_in_covehithe_suffolk_england_uk
HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, image”…, 16384)
= 473

It appears that the file is either being prematurely closed, doesn’t
exist
or some other file related error is occurring.

Please give us the details of your setup and architecture.

The first thing I would check is your open file limit. If this process
is
opening more files than the user it is running as is allowed to open,
then
the OS will begin denying the process files it tries to open. This can
be
caused either by bad code not closing files or perhaps your app simply
has
to open that many files due to application design and load. Either way,
check the limit and compare to how many files the process has open at
the
time you see this occurring.

I hope that this helps,

~Wayne

zachary_p · October 24, 2007, 5:42am

Are you using a web server in front of your mongrels ? It should be
picking up the page cached file before even considering handing the
request to a mongrel.

Cheers

Dave

zachary_p · October 24, 2007, 2:09pm

Hi Guys,

I’m using latest Litespeed 3.2.4 (just upgraded), Mongrel 1.0.1, and
Ruby
1.8.6, running Red Hat Enterprise Linux ES 4. We have

Web/Mysql/Mail server:

RAID Configuration: RAID 1 (73GBx2)

HP Memory: 2 GB HP RAM

HP DL385 G2 Processor: Dual Socket Dual Core Opteron 2214 2.2 GHz

(runs between 0.10 - 0.20, mostly MySql, but spikes up when issue occurs
with litespeed taking ~30% cpu)

and App Server:

RAID Configuration: RAID 1 (146GBx2)

HP Memory: 4 GB HP RAM

HP DL385 G2 Processor: Dual Socket Dual Core Opteron 2214 2.2 GHz

(usually 0.20 - 0.60 with legitimate spikes up to 1.8 for backend
processes.
Spikes up to 2-4 when it happens, depending on how many mongrels get the
problem (sometimes 2))

And these are the mongrels running:

MONGREL CPU MEM VIR RES DATE TIME PID
8010:hq 3.2 3.4 145 138 Oct23 43:13 20409
8011:hq 0.6 3.0 132 125 Oct23 8:15 20412
8012:hq 0.1 1.8 81 74 Oct23 1:28 20415
8015:dhq 0.0 1.0 50 44 02:41 0:08 4775
8016:dhq 0.0 0.7 34 30 02:41 0:01 4778
8017:dhq 0.0 0.7 36 30 02:41 0:01 4781
8020:af 9.0 3.3 143 137 Oct23 114:41 26600
8021:af 5.6 2.0 90 84 Oct23 71:56 26607
8022:af 2.4 1.8 80 74 Oct23 30:37 26578
8025:daf 0.0 1.0 49 42 02:41 0:04 4842
8026:daf 0.0 0.7 34 30 02:41 0:02 4845
8027:daf 0.0 0.7 36 30 02:41 0:02 4848
8030:pr 0.1 1.5 67 61 Oct23 1:50 16528
8031:pr 0.0 0.9 47 40 Oct23 0:17 16532
8032:pr 0.0 0.9 44 38 Oct23 0:13 16536
8035:dpr 0.2 0.7 36 30 12:30 0:02 22335
8036:dpr 0.2 0.7 35 30 12:30 0:02 22338
8037:dpr 0.2 0.7 35 30 12:30 0:02 22341

(the ones starting with D are in dev mode, will try turning them off
tonight, I hadn’t considered this a spill over issue, but it happened
just
now and turning them off didn’t ease it. We had alot less when the
problem
was occurring before, but also a 1 box set-up).

Its the 8020-8022 ones that have trouble. It is indeed picking up the
page
cache, and while its happening I can go to one of those pages in
question,
or cat it in SSH with no problems. I’ve monitored the rails log while it
was
happening and haven’t seen any EBADF spilling over. Though
its conceivable that a spike of hits from Google crawl could cause a
problem, I could try siege/ab tonight.

Not familiar with checking file limits, but this is what I get from
googling
a command:

cat /proc/sys/fs/file-nr:
2920 0 372880

ulimit -a:
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
file size (blocks, -f) unlimited
pending signals (-i) 1024
max locked memory (kbytes, -l) 32
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 77823
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

Will post the ls -l /proc/{mongrel.pid}/fd/, next time it happens, and
see
how close its getting to 1024. (Though again it seems strange that it
could
be having a max open problem when the other non-cached pages on that pid
that are definitely opening files work fine).

Thanks again,

Zach

zachary_p · October 24, 2007, 5:14pm

Assuming you have access to a root shell on your hosting service, then
when you are watching a slow system, you can start up a root shell and
renice that shell to a value of -1. This will give the shell a higher
priority and your command will be much quicker. For example.

bash #: su
Password:
root #: ps
PID TTY TIME CMD
1443 pts/2 00:00:00 bash
1447 pts/2 00:00:00 ps
root #: renice -1 1443
root #: suspend
bash #:

Use the suspend to leave the root job around and then foreground it
when you have the problem and use it.

If you don’t have a root shell, but have sudo, then use

sudo nice -1 ps of -p 27787
etc.

- Jim Hogue

----- Original Message -----
From: “Zachary P.” [email protected]
Cc: [email protected]
Sent: Wednesday, October 24, 2007 9:14 AM
Subject: Re: [Mongrel] random cpu spikes, EBADF errors

cache-expiry while reading or anything weird like that).

and Ruby

8020:af 9.0 3.3 143 137 Oct23 114:41 26600
8037:dpr 0.2 0.7 35 30 12:30 0:02 22341

core file size (blocks, -c) 0
max user processes (-u) 77823
be having a max open problem when the other non-cached pages on

image"…, 16384)

zachary_p · October 24, 2007, 3:15pm

Hi, short follow up. I did
lsof -p 27787 | grep mongrel -c
=> 64 files open

while it was spiking, so it doesn’t look like a max file issue. Looking
thorough the list, the only thing of note was

mongrel_r 27787 rails 5w unknown
/proc/27787/fd/5 (readlink: No such file or directory)

5 was likely the id of the file getting EBADF (usually 4-6). I didn’t
catch
which file it was before it the problem ended (running lsof was very
slow,
took 20 secs while problem was occuring), but previously when I’ve
checked
its always been there and accessable from web. Also, the dir in question
is
full of files, so it hasn’t been sweeped recently (couldn’t have been
cache-expiry while reading or anything weird like that).

Zach

zachary_p · October 24, 2007, 5:22pm

Whoops, that nice command needs a --1 as in

sudo nice --1 ps of -p 27787

- Jim

----- Original Message -----
From: “Jim Hogue” [email protected]
To: [email protected]; [email protected]
Sent: Wednesday, October 24, 2007 11:07 AM
Subject: Re: [Mongrel] random cpu spikes, EBADF errors

zachary_p · October 29, 2007, 9:28pm

When mongrel was working, it should send the reply back to LSWS
before closing the socket.

There’s a string prepared for the purpose in mongre.rb

ERROR_503_RESPONSE="HTTP/1.1 503 Service

Unavailable\r\n\r\nBUSY".freeze

It’s a one-liner to send that to the socket before calling close.

zachary_p · October 29, 2007, 10:44pm

Perhaps you are exhausting the number of worker threads in the queue
available to mongrel (default 900ish I think). If your cached files
are very big, maybe they aren’t being served quickly enough by the
DirHandler and your queue becomes clogged.

Should mongrel definitely send 503 after this state, or not? I think
there was some debate recently about the same issue and the resolution
was inconclusive.

Does Litespeed support x-sendfile? Maybe the DirHandler should be
updated to take advantage of that.

Evan

zachary_p · October 29, 2007, 9:12pm

Hi All,
Follow up to the CPU/EBADF issue I was having with lsws:

Here is the message that has just been posted:

The problem is on mongrel side. As shown in the strace output, file
handle 5
is the reverse proxy connection from LSWS to mongrel. Mongrel read the
request, then it closed the connection immediately without sending back
anything, then try to close it again with result EBADF, because the file
descriptor has been closed already.

When mongrel was working, it should send the reply back to LSWS before
closing the socket.

The root cause of the problem is on Mongrel side, however, LSWS should
fail
the request after a few retries. We will implement that in our 3.3
release.

Zach

zachary_p · October 30, 2007, 12:48am

The Camping handler already has in-Mongrel support for X-SENDFILE. It
would make sense to add a configurable option to the DirHandler and
Rails handler as well as the Camping handler to actually pass the
header along to webservers that support it, so it does what it’s
supposed to instead of faking it. Also, X-SENDFILE was actually
invented by lighttp, so it’s not a super-proprietary thing if you’re
concerned about that.

On the other hand, if the webserver doesn’t have a superset of
permissions the mongrel does, it could blow up, and if it does, it
could possibly be a vector for security breaches.

However the X-SENDFILE code in the Camping handler should probably be
moved into core somewhere instead of the handler, it’s kind of a weird
place to have it.

Just saying. I definitely understand your concerns.

Really the issue is whether we should return any response when closing
a connection due to resource overloading, I think.

Evan

zachary_p · October 30, 2007, 12:57am

Evan, I hear you! I know you have the best interests of Mongrel in mind.

X-SendFile is just a header, right? If so, yeah, it could be moved to
core.

If we’re talking the Ruby Sendfile, then I think that should NOT be in
core. I recall many people
having issues (i.e. it doesn’t work) with that.

Regarding the closing of the socket without notice, is that something
that Ruby does, or is it that
a resource limit was reached, and this handle was chosen by the OS to be
closed? If the form, a HTTP
503 response is appropriate. If the latter, seems to me that another
Mongrel should be employed in a
cluster configuration, or the app code examined to see if it might be
the source of the problem.

==
Will G.

zachary_p · October 30, 2007, 12:29am

I disagree. As soon as you start putting code specific to other web
servers in the core Mongrel,
you’ll either have to start adding it for other servers as well (thus
bloating the code!), or take a
stance on which other servers to support. I think this goes completely
against Zed’s originally vision.

Zed has written several times about how easy it is to create your own
handler (I believe as a Gem
plugin), and to configure Mongrel to use it. Plus, if you bundle it as a
plugin, you can distribute
it separately from the Mongrel core, so those of us who don’t need it
won’t have to load code for
program we choose not to run.

==
Will G.

zachary_p · October 30, 2007, 12:59am

Surely it’s preferable to just delay the accept() until there’s a
thread to
assign it to? That way the client sees a slow connection-establishment
and can draw their own conclusions, including deciding how long to
wait or whether to retry.

Clifford H., Data Constellation.

zachary_p · October 30, 2007, 1:03am

It’s a Mongrel-configured limit to avoid queuing an impossibly long
number of requests in an overloaded situation. So we can return
whatever we want.

I think the issue might be, if you can only handle 500 requests p/s,
and you are getting 600, if Mongrel closes the connection, at least
those 500 will get served, but if Mongrel returns 503, the web server
will say “hey, error” and try on the next mongrel, which won’t help
clear the request queue. The requests will still queue, just at a
higher level, and noone will end up getting a request served in a sane
amount of time.

Evan

zachary_p · October 30, 2007, 1:03am

On 10/29/07, Will G. [email protected] wrote:

Evan, I hear you! I know you have the best interests of Mongrel in mind.

X-SendFile is just a header, right? If so, yeah, it could be moved to core.

Yeah, he was talking about X-SendFile.

If we’re talking the Ruby Sendfile, then I think that should NOT be in core. I recall many people
having issues (i.e. it doesn’t work) with that.

Also, send_file is broken on Windows too, besides it eats all your
memory and hang you process.

Nice, don’t you think?

–
Luis L.
Multimedia systems

Leaders are made, they are not born. They are made by hard effort,
which is the price which all of us must pay to achieve any goal that
is worthwhile.
Vince Lombardi

zachary_p · October 30, 2007, 1:03am

Note that there is no guarantee that this is actually the issue
discussed with the above configuration. It’s just an issue that has
been raised before that might be related.

Evan

zachary_p · October 30, 2007, 1:19am

See, if we used real X-Sendfile, the webserver maintainers would have
to worry about Windows problems, not Mongrel .

For now we’ll just leave the X-Sendfile behavior alone.

Evan

zachary_p · October 30, 2007, 2:28am

I think currently it accepts the connection and then immediately
closes it, which is not consistent with the spec.

Evan

Random cpu spikes, EBADF errors

RAID Configuration: RAID 1 (73GBx2)

HP Memory: 2 GB HP RAM

HP DL385 G2 Processor: Dual Socket Dual Core Opteron 2214 2.2 GHz

RAID Configuration: RAID 1 (146GBx2)

HP Memory: 4 GB HP RAM

HP DL385 G2 Processor: Dual Socket Dual Core Opteron 2214 2.2 GHz

– Luis L. Multimedia systems

–
Luis L.
Multimedia systems