Mongrel, monit, and the many, many messages

S-gw · January 9, 2008, 1:58am

Monit 4.9, Mongrel 1.0.1, Rails 1.2.6, Mac OS X 10.4.11 (PPC)

I don’t know whether this is a mongrel issue or a monit issue.

I’m trying to poke my way around a system set up by someone else. I have
no more experience w/ mongrel that local Rails dev at this point, and a
conceptual understanding of how monit is working. I have the Deploying
Rails beta book, and I’m muddling my way thru mongrel and monit docs,
but I think some hints as to direction would be useful.

I am suspicious that all cannot be well on this setup as monit will send
dozens of messages a day, and occasionally hundreds of messages. The
worst day was 1400 alerts. Yes, 1400.

The bulk comes from there being 3 clusters (staging, beta, production),
and 10 mongrels per cluster, and two servers. So, we can reduce the
total quantity by these factors, I get that part, but still, there’s an
aweful lot of “this stopped” and “that does not exist” even factoring
the redundancy out.

I don’t understand the implications of what each of these means. Mongrel
keep crashing? Rails crashing? Monit crashing?

Thanks for any clues you can offer.

Sample messages I get are:

– (A)----------------------------------
Monit instance changed Service [domain snipped]

Date: Tue, 08 Jan 2008 14:41:50 -0800
Action: alert
Host: [domain snipped]
Description: Monit stopped

– (B)----------------------------------
Does not exist Service mongrel-production-8300

Date: Tue, 08 Jan 2008 15:30:04 -0800
Action: restart
Host: [domain snipped]
Description: ‘mongrel-production-8300’ process is not running

– ©----------------------------------
Execution failed Service mongrel-production-8301

Date: Tue, 08 Jan 2008 15:30:34 -0800
Action: alert
Host: [domain snipped]
Description: ‘mongrel-production-8301’ failed to start

S-gw · January 9, 2008, 2:02am

Sounds like you have a number of issues. Starting with mongrel, what
do the mongrel logs for the pids that have stopped running say ? Also
check /var/log/system.log for monit messages.

It may be worth upgrading to monit 4.10.1, which includes a number of
fixes for running monit under OSX.

Cheers

Dave

S-gw · January 9, 2008, 2:47am

At Wed, 9 Jan 2008 01:58:58 +0100,
Greg W. [email protected] wrote:

[â€¦]

I have seen a similar situation here. What happened was (more or less,
this is from memory) a mongrel instance would be locked up on an HTTP
response that would take a long time to complete. Because requests
would just queue up behind this one, monit would fail to get a
response in a reasonable time, would assume that the process was
non-responsive and try to restart it gracefully (using mongrel_rails
stop). Mongrel would take a long time to shut down because it was
still processing that long running response, so we would get a message
that monit couldn’t shut it down and it would fail to start (or
something like that). Finally the long running rails process would
complete, mongrel would restart, and monit would let us know that the
process was back up.

The solution was to make sure that responses come back in a reasonable
amount of time.

best,
Erik Hetzner
;; Erik Hetzner, California Digital Library
;; gnupg key id: 1024D/01DB07E3

S-gw · January 9, 2008, 4:29am

Make sure your Monit check interval (not sure abou the default) is
greater than your Mongrel request timeout interval (default 60
seconds).

Evan

S-gw · January 9, 2008, 7:55pm

Thanks for the ideas so far. I’ll look into the latest monit. Message
(A) is starting to look like a monit crash to me. It is always followed
by a bunch of similar messages that monit maybe stopping/starting all
the mongrels.

looks like the logs have little or no date/time stamps, so they’re
semi-useless in trying to correlate to the email alerts.

I do have some requests that can take a while to process (depends on
response time from external services), so that’s a valid lead.

Evan W. wrote:

Make sure your Monit check interval (not sure abou the default) is
greater than your Mongrel request timeout interval (default 60
seconds).

I have looked everywhere I can think of, and I don’t see any mention of
this timeout value anywhere in Mongrel docs. This page
(http://mongrel.rubyforge.org/docs/howto.html) mentions a -t (timeout),
but the description doesn’t match what you’re referring to. It looks
like a delay between the end of responding to request A and starting to
handle request B, not when to give up on A.

I guess I’ll assume the 60 secs, and play with monit accordingly.

– gw

S-gw · January 9, 2008, 8:37pm

That page is out of date. The RDoc is probably better. And there’s
always the source…

Soon we’ll do some work on the state of the documentation.

Evan

S-gw · January 9, 2008, 9:10pm

At Wed, 9 Jan 2008 19:55:27 +0100,
Greg W. [email protected] wrote:

Thanks for the ideas so far. I’ll look into the latest monit. Message
(A) is starting to look like a monit crash to me. It is always followed
by a bunch of similar messages that monit maybe stopping/starting all
the mongrels.

[â€¦]

I doubt a monit crash. This is the message I get when I start monit
with the â€˜-I quitâ€™ option. It sounds like something (a cron job?) is
restarting monit, & monit is not noticing that the mongrels are
running when it restarts, so it tries to bring the mongrels up. Fool
around with the monitrc: perhaps monit is failing to notice the pid
files that exist for mongrel?

best,
Erik Hetzner

S-gw · January 9, 2008, 10:41pm

On Jan 9, 2008, at 3:09 PM, Greg W. wrote:

Yeah we have launchd monitoring monit, so that could explain that.

Y’know, you can just have launchd monitor mongrel. That probably
makes more sense than launchd watching monit watching mongrel

-n

S-gw · January 9, 2008, 10:09pm

Erik Hetzner wrote:

At Wed, 9 Jan 2008 19:55:27 +0100,
Greg W. [email protected] wrote:

Thanks for the ideas so far. I’ll look into the latest monit. Message
(A) is starting to look like a monit crash to me. It is always followed
by a bunch of similar messages that monit maybe stopping/starting all
the mongrels.

[â€¦]

I doubt a monit crash. This is the message I get when I start monit
with the â€˜-I quitâ€™ option. It sounds like something (a cron job?) is
restarting monit…

Yeah we have launchd monitoring monit, so that could explain that.

When it was all set up it was explained to me that “mongrel/rails
crashes/has leaks, so we use monit to keep an eye on that, but monit
crashes/has leaks, so we’ll use launchd to monitor monit”

Sounded like a house of cards to me, but wasn’t in a position to argue
it at the time. IIRC the monit thing may have been a leak specific to OS
X at the time. So hopefully the recent versions are the solution to
that. I should get a chance to look into that tonight.

Thanks.

– gw

S-gw · January 9, 2008, 10:56pm

Nathan V. wrote:

On Jan 9, 2008, at 3:09 PM, Greg W. wrote:

Yeah we have launchd monitoring monit, so that could explain that.

Y’know, you can just have launchd monitor mongrel. That probably
makes more sense than launchd watching monit watching mongrel

Yep. Now that I’ve been poking around and getting more familiar with
this setup and see that launchd can monitor those details, that seemed
like a logical thing to me, so now I have a “second” The orginal guy
was just learning OS X at the time and was more familiar with monit as
part of his overall Rails deployment package.

– gw