Advice for html project

jnb · July 14, 2010, 1:38am

I helping my boss with some scripting for a web analysis research
project. He handles vocabulary and analysis, while I am using ruby parse
WARC files and the actual HTML.

Anyway, I’m still fairly new to Ruby. I did the WARC parsing, but I was
wondering what I should use for the HTML parser. (Didn’t want to
re-invent that wheel.) Some considerations:

Mainly we just need to pull the content text out of the HTML
A few tags might have special weight or significance (h1, etc.)
Unfortunately, nearly all the HTML is broken, because all our test
data was provided by this software that truncates the data after a
certain length.

jnb · July 14, 2010, 2:22am

Excerpts from Jonathan B.'s message of Wed Jul 14 01:38:31 +0200 2010:

I helping my boss with some scripting for a web analysis research
project. He handles vocabulary and analysis, while I am using ruby parse
WARC files and the actual HTML.

Anyway, I’m still fairly new to Ruby. I did the WARC parsing, but I was
wondering what I should use for the HTML parser. (Didn’t want to
re-invent that wheel.) Some considerations:
Google for nokogiri. That’s one solution.

Marc W.