I helping my boss with some scripting for a web analysis research
project. He handles vocabulary and analysis, while I am using ruby parse
WARC files and the actual HTML.
Anyway, I’m still fairly new to Ruby. I did the WARC parsing, but I was
wondering what I should use for the HTML parser. (Didn’t want to
re-invent that wheel.) Some considerations:
- Mainly we just need to pull the content text out of the HTML
- A few tags might have special weight or significance (h1, etc.)
- Unfortunately, nearly all the HTML is broken, because all our test
data was provided by this software that truncates the data after a
certain length.