The top link on del.icio.us is a site with all the Calvin + Hobbes
strips. I thought I’d download them before they get taken down. Here’s
the code if you want, it’s very short. Newer people can also see how
easy it is to use open-uri for simple web scraping.
Before running this consider buying the comics – what is your
motivation to avoid paying for them? If it’s bad, don’t do it. (I own
them all in paper already and want an electronic version.) Also create
the c+h_archives folder or change the output path. FYI the images total
about 112 megs. There’s 3691 of them.
open(“http://www.marcellosendos.ch/comics/ch/index.html”) do |index|
index.read.scan(/A href=“(1.+?)”>/).each do |archive_page_link|
archive_page_link = base_url + archive_page_link[0]
base_image_url = archive_page_link.gsub(//\w+.\w+$/, “/”)
open(archive_page_link) do |archive_page|
archive_page.read.scan(/src=“(.+?.gif)”>/).each do |img|
img_url = base_image_url + img[0]
begin
open(img_url) do |image_file|
File.open(“c+h_archives/#{img[0]}”, “w”) do |local_file|
local_file.write(image_file.read)
end
end
rescue Exception => e
# there’s five broken image links
puts “failed to get #{img_url}”
end
end
end
end
end
The top link on del.icio.us is a site with all the Calvin + Hobbes
strips. I thought I’d download them before they get taken down. Here’s
the code if you want, it’s very short. Newer people can also see how
easy it is to use open-uri for simple web scraping.
Before running this consider buying the comics
No, first consider the people hosting the content you’re snarfing.
They’re footing the bill for bandwidth and hosting.
… FYI the images total
about 112 megs. There’s 3691 of them.
And not a single “sleep” in the script. Nice.
I see this sort of shit on ruby-doc.org, spiders ruthlessly fetching
every page in site, one right after another.
And not a single “sleep” in the script. Nice.
Hi James,
It’s a good thing I posted. I will remember to put a sleep next time.
Thank you.
Oh. How much sleep is best?
606024 might work.
One second per image would add an hour to
the script run time.
Gosh! Imagine having to wait a whole hour to glom someone else’s
content!
I don’t have a sense of how much is needed. 5
seconds? .5 seconds? Is requests per time or volume of data per time
more important to limit?
You’re encouraging people to download 112 MB via 3691 requests from
someone else’s Web site.
Right now, the only thing I see being limited is courtesy.
If you abuse a Web site you may have your IP address banned.
Sadly, most people running sites do not have the technical chops to
catch such behavior and cut people off before too much damage is done.
More likely, the target site will either go off-line for excessive
bandwidth, or the owner will get a surprise bill for overages.
There are often very good reasons to spider a site and grab content.
When needed, it must be done in a responsible way. Your example fails
that, both in motivation and technique.
–
James B.
“Simplicity of the language is not what matters, but
simplicity of use.”
On Sun, Aug 19, 2007 at 09:29:03AM +0900, James B. wrote:
Elliot T. wrote:
I don’t have a sense of how much is needed. 5 seconds? .5
seconds? Is requests per time or volume of data per time
more important to limit?
James B. wrote:
You’re encouraging people to download 112 MB via 3691 requests from
someone else’s Web site.
Right now, the only thing I see being limited is courtesy.
I’m no expert, but it seems to me that Mr. Britt makes a
reasonable point. I’d be interested to know whether
Mr. Temple’s comment about 5 seconds/.5 seconds was meant
simply as a genuinely “open” question, or whether it was
intended as a comment of some kind.
No, first consider the people hosting the content you’re snarfing.
They’re footing the bill for bandwidth and hosting.
… FYI the images total
about 112 megs. There’s 3691 of them.
And not a single “sleep” in the script. Nice.
Hi James,
It’s a good thing I posted. I will remember to put a sleep next time.
Thank you.
Oh. How much sleep is best? One second per image would add an hour to
the script run time. I don’t have a sense of how much is needed. 5
seconds? .5 seconds? Is requests per time or volume of data per time
more important to limit?
B. Buy the books. They’re cheap in used bookstores! It’s a heck of a lot
less work than writing a script. That said, how many times can you or
will you possibly read them? How much is your time worth to you?
The ultimate punchline: Calvin has just destroyed Susie D’s snowman, and
he’s sprawled face-down in the snow. Susie, holding the snowman’s head
over
him, says, “Calvin, look up!”
A. They’re probably hosting Calvin & Hobbes strips illegally, so they
get what they get. But in general, if you publish or make public
something, even if held open house in your home, you deal with the
traffic or quit.
B. Buy the books. They’re cheap in used bookstores! It’s a heck of a
lot less work than writing a script. That said, how many times can
you or will you possibly read them? How much is your time worth to you?
The top link on del.icio.us is a site with all the Calvin + Hobbes
strips. I thought I’d download them before they get taken down. Here’s
the code if you want, it’s very short. Newer people can also see how
easy it is to use open-uri for simple web scraping.
Hi Elliot, thanks for the script. Not considering ethics about using
it, sure it is an interesting script. I thought about doing a simpler
version using Hpricot or scRUBYt, but right now your script is always
saying “failed to get…”.
Is it me or maybe they have taken measures to avoid direct downloading?
A. They’re probably hosting Calvin & Hobbes strips illegally, so they
get what they get. But in general, if you publish or make public
something, even if held open house in your home, you deal with the
traffic or quit.
They certainly are, and I know from past experience (supporting a site
that syndicated C&H) that the copyright holders are very protective of
their content. Best to just not mess with it.
B. Buy the books. They’re cheap in used bookstores! It’s a heck of a
lot less work than writing a script. That said, how many times can
you or will you possibly read them? How much is your time worth to you?
Hear hear.
Ben
This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.