On 17/12/2006, at 11:15 PM, Paul L. wrote:
it would be nice if hpricot behaved like a browser.
Paul,
before I address your response directly I will say that I am aware of
your crusade against html parsing libraries and while I believe you
are entitled to your opinion, I disagree with it. I have done enough
of this sort of thing to know that, for me, the level of abstraction
that these libraries gives is both beneficial in development time and
maintenance. I am neither an html nuby, nor a ruby nuby. I am also
aware that my needs may not match those of some one else so I’m not
going to ram my opinions down there throat every time they ask for a
little help.
You have created a new thread, and you have not attached any prior
text.
This requires us to start over.
As this is the first time I have posted on this subject, that much is
obvious. Unless I am missing something.
Tell us what you hoped would happen, what happened instead, and how
they
differ.
Run the script and that too will be obvious.
If your goal is to filter particular content from HTML pages, just
say so,
and be specific about what you want and don’t want. Given this
information,
I will show you how to extract the desired content with a few lines of
Ruby, no fuss, no undue complexity, no Hpricot.
My goal is to highlight an issue I found with a particular library
and provide some sample code that shows the problem with the minimum
amount of code. I posted it here so that there may be some discussion
with interested people as to the desired behaviour.
IIRC, you had asked for help using Hpricot to extract text between
and
tag pairs, but with the added requirement that there be an IMG
tag
within the ...
tag pair to validate the case. Is this
still the
goal? If so, how did my previously posted, simple solution work out
for
you?
What IMG tag? There isn’t one in the sample code. What previous
solution? You do not recall correctly.
This is a scene in a much larger play, one in which someone says,
“Wow, I
had no idea there was such a powerful library, so carefully
designed, so
complete. But, notwithstanding its extraordinary features,
notwithstanding
the hundreds of man-hours expended creating it … I can’t get it
to do
what I want.”
The incident that that prompted my post went thus…
I had a page that seemed to render fine in a browser but when parsing
it my code failed. I inspected the html and found a malformed comment
to be the problem. Probably put there to stop screen scraping. I
wrote a bit of code, using regexps no less, that removed the
offending comment and hpricot then went on it’s merry way. Job done.
I thought others may be interested so I posted some sample code. I am
now regretting that decision.
This is a very common refrain. I think I can solve your problem
with a few
lines of Ruby code, code that you can easily understand and adapt to
specific and evolving requirements. And if I cannot do this, I will
say so.
I could too, but I don’t care.
–
Paul L.
Thanks for hijacking my thread. Thanks for nothing.