Scrape html gives partial result

Thufir · November 13, 2009, 9:30am

Why does scrape.rb just result in a few lines of html, rather than an
entire document? The html can be printed in its entirety, but how to
persist it? It looks like I have a misplaced } on line 21, but moving
it
to line 18 didn’t give better results.

thufir@ARRAKIS:~/projects/rss2mysql$
thufir@ARRAKIS:~/projects/rss2mysql$ ruby scrape.rb
“http://groups.google.ca/group/ruby-talk-google/browse_thread/
thread/923db4577e5ffbfb/440cc76e1d3dc4f0?show_docid=440cc76e1d3dc4f0”
“http://groups.google.ca/group/ruby-talk-google/browse_thread/thread/
a3bd76032df6507d/37d98a0d3efeaae4?show_docid=37d98a0d3efeaae4”
“http://groups.google.ca/group/ruby-talk-google/browse_thread/
thread/0c6d521382cda99d/244a1c70d6ea0878?show_docid=244a1c70d6ea0878”
“http://groups.google.ca/group/ruby-talk-google/browse_thread/
thread/1f6885d8416db1a6/260089ad5b9e133b?show_docid=260089ad5b9e133b”
“http://groups.google.ca/group/ruby-talk-google/browse_thread/thread/
f5031e66d7819c94/ee4025141f0e926c?show_docid=ee4025141f0e926c”
“http://groups.google.ca/group/ruby-talk-google/browse_thread/
thread/725d2b507b595cd2/cf5df2ad24b92ed8?show_docid=cf5df2ad24b92ed8”
“http://groups.google.ca/group/ruby-talk-google/browse_thread/thread/
a5a8597adf18bc65/11496209df7695d1?show_docid=11496209df7695d1”
“http://groups.google.ca/group/ruby-talk-google/browse_thread/thread/
b357b950b39c0c06/d7612a26a60056b0?show_docid=d7612a26a60056b0”
“http://groups.google.ca/group/ruby-talk-google/browse_thread/thread/
b357b950b39c0c06/db37b5cf1c92ffa0?show_docid=db37b5cf1c92ffa0”
“http://groups.google.ca/group/ruby-talk-google/browse_thread/
thread/4146e867f9efc1c1/64ee4b38fc3f3e78?show_docid=64ee4b38fc3f3e78”
thufir@ARRAKIS:~/projects/rss2mysql$
thufir@ARRAKIS:~/projects/rss2mysql$ nl scrape.rb
1 require ‘rubygems’
2 require ‘activerecord’
3 require ‘yaml’
4 require ‘item’
5 require ‘open-uri’
6 require ‘pp’

 7  db = YAML::load(File.open('database.yml'))

 8  ActiveRecord::Base.establish_connection(
 9  :adapter  => db["development"]["adapter"],
10  :host   => db["development"]["host"],
11  :username => db["development"]["username"],
12  :password => db["development"]["password"],
13  :database => db["development"]["database"])


14  items = Item.find(:all)

15  items.each do |item|
16    open(item.url,
17    "User-Agent" => "Mozilla/5.0 (X11; U; Linux i686; en-US;

rv:1.9.0.15) Gecko/2009102815 Ubuntu/9.04 (jaunty) Firefox/3.0.15"){|f|
18 item.html = f.readlines.join
19 item.save
20 pp item.url
21 }
22 end
thufir@ARRAKIS:~/projects/rss2mysql$
thufir@ARRAKIS:~/projects/rss2mysql$ mysql -u ruby -p
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 81
Server version: 5.0.75-0ubuntu10.2 (Ubuntu)

Type ‘help;’ or ‘\h’ for help. Type ‘\c’ to clear the buffer.

mysql> select url, html from rss2mysql.items;
±----------------------------------------------------------------------------------------------------------------------------------
±----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+
|
url
|
html
|
±----------------------------------------------------------------------------------------------------------------------------------
±----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+
| http://groups.google.ca/group/ruby-talk-google/browse_thread/
thread/923db4577e5ffbfb/440cc76e1d3dc4f0?show_docid=440cc76e1d3dc4f0 |

Thufir · November 14, 2009, 4:11am

On Fri, Nov 13, 2009 at 3:27 AM, Thufir [email protected] wrote:

Why does scrape.rb just result in a few lines of html, rather than an
entire document? The html can be printed in its entirety, but how to
persist it?

mysql> describe rss2mysql.items;

http://dizzy.co.uk/ruby_on_rails/cheatsheets/rails-migrations#database_mapping

Thufir · November 14, 2009, 5:28am

On Nov 13, 7:10 pm, [email protected] wrote:

On Fri, Nov 13, 2009 at 3:27 AM, Thufir [email protected] wrote:

Why does scrape.rb just result in a few lines of html, rather than an
entire document? The html can be printed in its entirety, but how to
persist it?

mysql> describe rss2mysql.items;

http://dizzy.co.uk/ruby_on_rails/cheatsheets/rails-migrations#databas…

Ok, thanks, that problem is solved by using text instead of string.

In terms of design, I’m considering the pros/cons for breaking of the
html to another table. Perhaps the scraped data should be with the
html in its own table? I doubt I’d see a performance differential for
the amount of data I’ll be working with, but does it matter whether a
relatively large text field, and perhaps about five string fields are
added onto existing table, or whether there’s a 1:1 relation to
another table? Also, perhaps there would be a 1:many relation between
the raw html and scraped data, but I’m not sure about that.

Also, I don’t want to accidentally re-fetch the html and end up with a
bunch of 404 error pages, so would it make sense to add a boolean
indicating whether html had been fetched? Or, just restrict fetching
(?) the html to when feeds are grabbed?

thanks,

Thufir

Thufir · November 14, 2009, 5:33am

Thufir wrote:
[…]

In terms of design, I’m considering the pros/cons for breaking of the
html to another table. Perhaps the scraped data should be with the
html in its own table? I doubt I’d see a performance differential for
the amount of data I’ll be working with, but does it matter whether a
relatively large text field, and perhaps about five string fields are
added onto existing table, or whether there’s a 1:1 relation to
another table?

It depends on the conceptual structure of the application. I doubt that
performance would be enough of an issue to worry about.

Also, perhaps there would be a 1:many relation between
the raw html and scraped data, but I’m not sure about that.

Depends on the data!

Also, I don’t want to accidentally re-fetch the html and end up with a
bunch of 404 error pages, so would it make sense to add a boolean
indicating whether html had been fetched? Or, just restrict fetching
(?) the html to when feeds are grabbed?

Do you need a boolean? Just test whether the HTML field is null.

thanks,

Thufir

Best,

Marnen Laibow-Koser
http://www.marnen.org
[email protected]

Thufir · November 15, 2009, 9:46am

On Nov 13, 8:33 pm, Marnen Laibow-Koser [email protected] wrote:
[…]

In terms of design, I’m considering the pros/cons for breaking of the
html to another table.
[…]
It depends on the conceptual structure of the
application. I doubt that
performance would be enough of an issue to worry about.

Yeah, figured as much. Partly as an exercise, I’ll try for a base
model of “items” with a related model of “pages.”

[…]

Also, I don’t want to accidentally re-fetch the html
and end up with a
bunch of 404 error pages
[…]
Do you need a boolean? Just test whether the HTML field is null.

Ah, right.

-Thufir