Slow spam checking makes Baby Jesus cry

RBL spam checks on the machine my blog is running on are slower than a
very slow thing on a slow day. Which does bad things, because a
fastcgi process that’s tied up doing dns lookups and the like is a
fastcgi process that can’t let the user know that their comment has
been accepted and is being checked for spam. Which means they hit
submit again. And again. And again. And who can blame them?

It’s also a fastcgi process that can’t serve any requests. And fastcgi
processes that can’t serve requests are just the sort of thing that
hosting services resource limiters don’t like to see.

So, we need some way of getting back to the user and accepting the
next request ASAP.

My current thinking is to tweak the behaviour of
ContentState::Undefined so that instead of doing the spam check before
saving the content and then changing the feedback state, it will have
an after_save that looks something like:

def after_save(content)
t = Thread.new(content) do |feedback|
classify(feedback)
feedback.save
end
t.join if defined? $TESTING
end

The catch is, IPSocket.getaddress, which is what we use for DNS
lookups, appears to be be a blocking call, which with the nature of
Ruby threads, means it’ll still hold up processing during the
lookup.

Thoughts?

On Aug 20, 2006, at 1:02 PM, Piers C. wrote:

The catch is, IPSocket.getaddress, which is what we use for DNS
lookups, appears to be be a blocking call, which with the nature of
Ruby threads, means it’ll still hold up processing during the
lookup.

Thoughts?

What about using BackgrounDRb to do the spam processing
asynchronously? Toss the comments to be processed into a queue in the
DB, fire up a BDRb worker, and walk away.


Josh S.
http://blog.hasmanythrough.com

Piers C. [email protected] writes:

The catch is, IPSocket.getaddress, which is what we use for DNS
lookups, appears to be be a blocking call, which with the nature of
Ruby threads, means it’ll still hold up processing during the
lookup.

Thoughts?

Hmm… resolv.rb may be our friend. And (bonus points!) it’s in the
standard library.

Josh S. [email protected] writes:

DB, fire up a BDRb worker, and walk away.
A nice idea to be sure, but my hosting service is quite picky about
long running processes that aren’t fast cgi processes launched by
their copy of apache. I’m happy enough to catch a SIGKILL and do a
Thread.list.each {|t| t.join}, from within dispatch.fcgi, but that’s
about as far as it goes.

Piers C. [email protected] writes:

standard library.
And, following further work on using threads, I’d just like to say
that threading is weird and is currently doing my head in. Background
classification sort of works, but only very approximately.

Piers C. wrote:

Hmm… resolv.rb may be our friend. And (bonus points!) it’s in the
standard library.

And, following further work on using threads, I’d just like to say
that threading is weird and is currently doing my head in. Background
classification sort of works, but only very approximately.

Would it not be possible to take the same approach as mod_perl and
schedule some code to run immediately after the current request, but
before the next one (a cleanup handler)? It would still mean that
particular fastcgi process was unavailable, but it would mean that you
could return to the user quicker.

Unfortunately, a quick look in railties at the dispatcher shows that
rails doesn’t offer anything like this. Damn.

-Dom

On 8/20/06, Piers C. [email protected] wrote:

lookup.
schedule some code to run immediately after the current request, but

Monkey patching the dispatcher to allow it isn’t exactly hard. But see above.

This brings up a reasonably obvious question that I haven’t seen asked
yet–why not turn off RBL and like Akismet take care of it? Akismet
seems pretty fast, and I’ve been happy with its error rate so far.

I’ve never been a bit RBL fan, for the usual reasons–all RBL lists
seem to eventually decend into filtering for spam politics more then
spam itself–ISP X doesn’t have a good abuse policy, so we’re going to
block all of their addresses to force a change. I’d much rather block
based on content then politics :-).

Scott

Dominic M. [email protected] writes:

Thoughts?
particular fastcgi process was unavailable, but it would mean that you
could return to the user quicker.

Sadly not. Some of the spam checks I’ve seen in my logs have taken the
best part of a minute. Tie up a couple of dispatchers like that and
your performance is screwed.

Unfortunately, a quick look in railties at the dispatcher shows that
rails doesn’t offer anything like this. Damn.

Monkey patching the dispatcher to allow it isn’t exactly hard. But see
above.

“Scott L.” [email protected] writes:

On 8/20/06, Piers C. [email protected] wrote:

Monkey patching the dispatcher to allow it isn’t exactly hard. But see above.

This brings up a reasonably obvious question that I haven’t seen asked
yet–why not turn off RBL and like Akismet take care of it? Akismet
seems pretty fast, and I’ve been happy with its error rate so far.

Which is what I’ve done on www.bofh.org.uk by emptying the RBL server
list.

I’ve never been a bit RBL fan, for the usual reasons–all RBL lists
seem to eventually decend into filtering for spam politics more then
spam itself–ISP X doesn’t have a good abuse policy, so we’re going to
block all of their addresses to force a change. I’d much rather block
based on content then politics :-).

Nor me, for pretty much the same reasons.

But then, I’ve been thinking about spam checking in general and what
we’re currently doing is almost an implementation of the Chain of
Responsibility, but it’s statically configured. If my hunch is right,
Spam checking is yet another thing we can modularize and shove into a
vendor/plugins directory.

(Thus allowing those who want them to add captcha based ‘protection’
to their blog).

The fun begins when you start to think at how this chain of
responsibility could be used to handle spam reporting. For instance,
it would be nice to have a spam check engine that extracts possible
blacklist/whitelist patterns from feedback and offers 'em up to the
user for consideration before passing on to the next item in the
chain.