Questions about blog scraping

dubstep · April 18, 2012, 7:10pm

Hello,

I need some sort of advise over where to start digging (and how) because
I’m a bit confused.

I’d like to be able to grab all content from a website. Using nokogiri I
can use XPath and get blog post content among other things from a web
page. But I don’t have a clue about where to start looking in order to
be able to scan a website flying through all possible links that include
that website.

Is nokogiri the right tool or should I use something like mechanize? Can
you provide any hint on how to perform scraping on an entire website?
I’m interested in blogs mostly, wordpress and blogger platforms for the
time being.

Best Regards,

Panagiotis A.

atma · April 18, 2012, 7:59pm

Hello,

On Wed, Apr 18, 2012 at 1:10 PM, Panagiotis A.
[email protected]wrote:

Is nokogiri the right tool or should I use something like mechanize? Can
you provide any hint on how to perform scraping on an entire website? I’m
interested in blogs mostly, wordpress and blogger platforms for the time
being.

I recommend starting by watching Ryan B.'s excellent screencasts: