Non-Cache Handling for Bots

mperham · June 11, 2009, 12:54am

Hello,

when a bot visits our page, we want to create a response that is
different
than the response for a human. In particular we want to limit the
hierarchies of menus so that the bot doesnt think there are too may
tags/keywords for the site (and thus they dont get indexed) as well we
do
not display ads for bots.

We cache most of our pages using standard rails caching.

The problem is that when a bot visits the site, it will get the standard
cached page with the incorrect menus. We want to work around this so
that
bots do not get cached pages. (Yes, it is good if the bots get a
bot-specific cached page, but lets keep it simple so far).

We did try numerous the following:

using apache rewrite to add /robot to the URL. then mongrel will
never
find the cached page on disk. The problem here is that all the links in
the
page have /robot in front of them. So, having apache add /robot to the
URI
results in the PATH_INFO as-seen by rails to have /robot in it. Another
problem was that we effectivly duplicated the set of rules in routes.rb
for
the cases with /robot as a prefix.

We have 2 ideas:

when apache detects a bot, send the request to a non-caching
webserver.
Does anyonme know one of these?
I edited the mongrel source code in
mongrel-1.1.4/lib/mongrel/rails.rb
and added this kind of thing to the process method:

do_not_cache = KNOWN_ROBOT_AGENTS.detect{ |b|
user_agent.downcase.include?(b) } if user_agent

And then used that variable in the tests for @files.can_serve(…)

This works but we still want mongrel to serve static files as cached (so
the
rules above can take care of this too, it just gets more complicated to
check for /stylesheets, /.images etc.).

Question: is there a way to plug-in our own logic into the mogrel
process of
handling a request? And/or can we set up a specific mongrel to never
cache
(are there options for this)?

Any ideas are appreciated,

Mike

mperham · June 11, 2009, 3:24am

What kind of caches are you talking about?
Are these full page caches? The kind that get stored into /public?

My question is, can you instead of adding /robot to the URL when apache
finds the robot, can you instead change Apache’s DocumentRoot?
It seems to be that this would prevent apache from finding the cached
page. Also, if you actually point to another copy of your /public, you
could get the normal static pages…

I think perhaps since you are talking about changing mongrel’s caching
behaviour that you aren’t talking about the page caches that get stored
into /public. (Well, I’m rusty on terminology here)

–
Michael R. removed_email_address@domain.invalid
Director – Consumer Desktop Development, Simtone Corporation, Ottawa,
Canada
Personal: Michael Richardson's Directory

SIMtone Corporation fundamentally transforms computing into simple,
secure, and very low-cost network-provisioned services pervasively
accessible by everyone. Learn more at www.simtone.net and
www.SIMtoneVDU.com

mperham · June 11, 2009, 3:49am

I am talking about Rails standard page-caching mechanism. Rails by
default puts full pages into public/… and if mongrel sees them
there, it serves them (without running rails dispatch et al). This is
fine for normal but not good for the bot user agents.

Here is a new solution:

set rails to cache in public/cache
Use Apache rewrite to serve these files directly (if found)
If not found, pass to mongrel which will not find the cached files
either since MONGREL ONLY LOOKS IN public for cached files. Mongrel
does not honor the config.action_controller.page_cache_directory rails
setting
Rails processes the file and puts it into public/cache/…

…on the next request, apache serves from cache.

I am working on the reqwrite rules etc. for this.

Mike

mperham · June 11, 2009, 4:22am

“Mike” == Mike P. removed_email_address@domain.invalid writes:
Mike> I am talking about Rails standard page-caching mechanism.
Rails by default
Mike> puts full pages into public/… and if mongrel sees them
there, it serves
Mike> them (without running rails dispatch et al). This is fine for
normal but not
Mike> good for the bot user agents.

right, so that’s what I thought you were talking about.
Only, it’s not mongrel that serves up the pages, but Apache, usually.
In your apache config, you have something like:

    # Rewrite all non-static requests to cluster
    RewriteCond %{DOCUMENT_ROOT}/%{REQUEST_FILENAME} !-f
    RewriteRule ^/(.*)$ balancer://spartan_cluster%{REQUEST_URI}

[P,QSA,L]

which basically serves up any files found in /public, otherwise, punts
to the mongrel. I thought that rails put the files directly there for
apache to use/see. (there are caveats if your mongrel and apache do not
share the same file system, such as because they are on different
machines)

If you are telling me that actually mongrel does this, it’s news to me.

Mike> Here is a new solution:
Mike> 1) set rails to cache in public/cache
Mike> 2) Use Apache rewrite to serve these files directly (if found)
Mike> 3) If not found, pass to mongrel which will not find the

cached files either
Mike> since MONGREL ONLY LOOKS IN public for cached files. Mongrel
does not honor
Mike> the config.action_controller.page_cache_directory rails
setting
Mike> 4) Rails processes the file and puts it into public/cache/…

Mike> ...on the next request, apache serves from cache.

Mike> I am working on the reqwrite rules etc. for this.

So, basically have apache pick a different cache location when it sees a
robot.

–
Michael R. removed_email_address@domain.invalid
Director – Consumer Desktop Development, Simtone Corporation, Ottawa,
Canada
Personal: Michael Richardson's Directory

SIMtone Corporation fundamentally transforms computing into simple,
secure, and very low-cost network-provisioned services pervasively
accessible by everyone. Learn more at www.simtone.net and
www.SIMtoneVDU.com

mperham · June 11, 2009, 4:45am

“Serving up different results based on user agent may cause your site to
be
perceived as deceptive and removed from the Google index.”