Why does scrape.rb just result in a few lines of html, rather than an
entire document? The html can be printed in its entirety, but how to
persist it? It looks like I have a misplaced } on line 21, but moving
it
to line 18 didn’t give better results.
rv:1.9.0.15) Gecko/2009102815 Ubuntu/9.04 (jaunty) Firefox/3.0.15"){|f|
18 item.html = f.readlines.join
19 item.save
20 pp item.url
21 }
22 end
thufir@ARRAKIS:~/projects/rss2mysql$
thufir@ARRAKIS:~/projects/rss2mysql$ mysql -u ruby -p
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 81
Server version: 5.0.75-0ubuntu10.2 (Ubuntu)
Type ‘help;’ or ‘\h’ for help. Type ‘\c’ to clear the buffer.
mysql> select url, html from rss2mysql.items;
±----------------------------------------------------------------------------------------------------------------------------------
±----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+
|
url
|
html
|
±----------------------------------------------------------------------------------------------------------------------------------
±----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+
| http://groups.google.ca/group/ruby-talk-google/browse_thread/
thread/923db4577e5ffbfb/440cc76e1d3dc4f0?show_docid=440cc76e1d3dc4f0 |
<link REL="SHORTCUT ICON" HREF="/groups/img/3 |
+-----------------------------------------------------------------------------------------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+
10 rows in set (0.00 sec)
<p>mysql> Aborted<br>
thufir@ARRAKIS:~/projects/rss2mysql$</p>
<p>thanks,</p>
<p>Thufir</p>
Ok, thanks, that problem is solved by using text instead of string.
In terms of design, I’m considering the pros/cons for breaking of the
html to another table. Perhaps the scraped data should be with the
html in its own table? I doubt I’d see a performance differential for
the amount of data I’ll be working with, but does it matter whether a
relatively large text field, and perhaps about five string fields are
added onto existing table, or whether there’s a 1:1 relation to
another table? Also, perhaps there would be a 1:many relation between
the raw html and scraped data, but I’m not sure about that.
Also, I don’t want to accidentally re-fetch the html and end up with a
bunch of 404 error pages, so would it make sense to add a boolean
indicating whether html had been fetched? Or, just restrict fetching
(?) the html to when feeds are grabbed?
In terms of design, I’m considering the pros/cons for breaking of the
html to another table. Perhaps the scraped data should be with the
html in its own table? I doubt I’d see a performance differential for
the amount of data I’ll be working with, but does it matter whether a
relatively large text field, and perhaps about five string fields are
added onto existing table, or whether there’s a 1:1 relation to
another table?
It depends on the conceptual structure of the application. I doubt that
performance would be enough of an issue to worry about.
Also, perhaps there would be a 1:many relation between
the raw html and scraped data, but I’m not sure about that.
Depends on the data!
Also, I don’t want to accidentally re-fetch the html and end up with a
bunch of 404 error pages, so would it make sense to add a boolean
indicating whether html had been fetched? Or, just restrict fetching
(?) the html to when feeds are grabbed?
Do you need a boolean? Just test whether the HTML field is null.
On Nov 13, 8:33 pm, Marnen Laibow-Koser [email protected] wrote:
[…]
In terms of design, I’m considering the pros/cons for breaking of the
html to another table.
[…]
It depends on the conceptual structure of the
application. I doubt that
performance would be enough of an issue to worry about.
Yeah, figured as much. Partly as an exercise, I’ll try for a base
model of “items” with a related model of “pages.”
[…]
Also, I don’t want to accidentally re-fetch the html
and end up with a
bunch of 404 error pages
[…]
Do you need a boolean? Just test whether the HTML field is null.
Ah, right.
-Thufir
This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.