Hpricot problem

nock12 · December 17, 2006, 10:56am

Sorry, try again…

Not sure where to send this, sorry if it’s not the right place…

The html in the attached file renders ‘correctly’ in the 3 browsers I
have tried but it tricks hpricot because of the second malformed
comment. When I say correctly I mean I get to see ‘Some text’. I
guess it could be argued that this is incorrect. For my application
it would be nice if hpricot behaved like a browser.

Henry

nock12 · December 17, 2006, 11:16am

Henry M. wrote:

Sorry, try again…

Not sure where to send this, sorry if it’s not the right place…

The html in the attached file renders ‘correctly’ in the 3 browsers I
have tried but it tricks hpricot because of the second malformed
comment. When I say correctly I mean I get to see ‘Some text’. I
guess it could be argued that this is incorrect. For my application
it would be nice if hpricot behaved like a browser.

You have created a new thread, and you have not attached any prior text.
This requires us to start over.

Tell us what you hoped would happen, what happened instead, and how they
differ.

If your goal is to filter particular content from HTML pages, just say
so,
and be specific about what you want and don’t want. Given this
information,
I will show you how to extract the desired content with a few lines of
Ruby, no fuss, no undue complexity, no Hpricot.

IIRC, you had asked for help using Hpricot to extract text between

and

tag pairs, but with the added requirement that there be an IMG tag within the

...

tag pair to validate the case. Is this still the goal? If so, how did my previously posted, simple solution work out for you?

This is a scene in a much larger play, one in which someone says, “Wow,
I
had no idea there was such a powerful library, so carefully designed, so
complete. But, notwithstanding its extraordinary features,
notwithstanding
the hundreds of man-hours expended creating it … I can’t get it to do
what I want.”

This is a very common refrain. I think I can solve your problem with a
few
lines of Ruby code, code that you can easily understand and adapt to
specific and evolving requirements. And if I cannot do this, I will say
so.

nock12 · December 17, 2006, 11:52am

Hello,

Given this information,
I will show you how to extract the desired content with a few lines of
Ruby, no fuss, no undue complexity, no Hpricot.
Why should it be complicated? What fuss? Who needs few lines? With the
current version of hpricot this is exactly one line:

doc//p[img]//text()

This is a scene in a much larger play, one in which someone says, >
“Wow, I
had no idea there was such a powerful library, so carefully designed,
so
complete. But, notwithstanding its extraordinary features, > > > > > >
notwithstanding
the hundreds of man-hours expended creating it … I can’t get it to >
do what I want.”

You know, software is an evolving stuff. 3 (or 4, or something like
this) days ago the above stuff was not available in HPricot, and since
it was such a common query, and requested by people. voila: now it is
there.

Of course there will always be some missing features - no framework or
library can solve all the problems of all mankind - but after some time,
useful feedback (i.e. not ‘forget about every framework since you can do
it in a few lines of Ruby’ but rather feature requests, bug reports etc)
a framework can reach a maturity level where is solves most of the
problems of its users.

Btw. ever heard of ‘reinventing the wheel’?

Also your (otherwise great) code snippets always assume that the
underlying HTML is well formed, and x and y and z - which is in real
life almost never the case. Of course the posters here are not pasting
200K of HTML against which they run they production code, but a few
lines of example which is usually an oversimplification of the problem.

This another point where such libraries are great: they handle 844747
special cases (if your case is not among them, see the current-2nd
paragraph, or add it there on your own) which is always a problematic
thing in case of hand written stuff.

I could state here 100 another points which would prove that in
production, libraries are almost always better choice over hand written
code on the fly - of course learning Ruby, playing with some features
etc is another thing. I am not arguing that in this case one should not
code everything on his own. However, there are some cases when people
need a stable, working solution for something and don’t want to play
around with hand coded regexps against crappy HTML. In this case, IMHO,
using a framework is absolutely OK.

Cheers,
Peter

__
http://www.rubyrailways.com

Best wishes,
Peter

__
http://www.rubyrailways.com

nock12 · December 17, 2006, 2:46pm

The html in the attached file renders ‘correctly’ in the 3 browsers I
have tried but it tricks hpricot because of the second malformed
comment. When I say correctly I mean I get to see ‘Some text’. I guess
it could be argued that this is incorrect.

What are you trying to do? Matching that comment? Or matching the text
‘Some text’? Which version of Hpricot do you use (svn head or 0.4)? What
exactly is the problem?

For my application it would be nice if hpricot behaved like a browser.
Well, if this is the goal, then use a browser :-). Hpricot is not a
browser and it does not try to be one.

I am working on a project with Java where we are using Mozilla/FireFox
XULRunner to parse the HTML (and to communicate with FF) and it’s
really, really robust and fast and reliable and and and. However, AFAIK
this is not doable in Ruby ATM (I would be really happy if it would be,
but from what I have seen it’s not - there was some initial try to
implement rbXPCOM, but it was abandoned in 2001). Maybe some other
browser (safari, opera?)

Btw. which feature of ‘browser-like’-ness would you like to use? What
are your exact requirements?

Peter

__
http://www.rubyrailways.com

nock12 · December 17, 2006, 8:37pm

Peter S. wrote:

/ …

Btw. ever heard of ‘reinventing the wheel’?

I don’t generally reinvent the wheel until the existing wheel breaks.
This
is one of those cases.

Also your (otherwise great) code snippets always assume that the
underlying HTML is well formed, and x and y and z - which is in real
life almost never the case.

Yes, true, my code is typically quite fragile and can only handle
essentially perfect HTML, and I generally offer that exact warning.
Ironically, though, in this case, my naive solution parsed the HTML that
caused Hpricot to fail.

Of course the posters here are not pasting
200K of HTML against which they run they production code, but a few
lines of example which is usually an oversimplification of the problem.

Almost always. But in this case Hpricot failed on the provided short
example, with a single deviant tag syntax.

This another point where such libraries are great: they handle 844747
special cases (if your case is not among them, see the current-2nd
paragraph, or add it there on your own) which is always a problematic
thing in case of hand written stuff.

Absolutely. I don’t generally post my offer of a few lines of code
unless
and until a library has failed. In this case, it failed.

I could state here 100 another points which would prove that in
production, libraries are almost always better choice over hand written
code on the fly -

Yes, unfortunately none of them would successfully answer this OP’s call
from the real world. Libraries are the obvious solution to this kind of
task. They have everything going for them, up to, but not including, the
moment when they fail to meet the user’s requirements.

I have to say that I see a lot of posts that follow this pattern. The
library seems to be able to solve any number of difficult problems
except
the specific problem the user happens to be facing.

And my typical offered, simple solution is not meant to, and cannot
stand in
for, the 2^32 special cases that have been laboriously programmed into
the
library. It can only provide an overlooked special need that the library
cannot provide. It’s surprising to me how often this happens.

nock12 · December 18, 2006, 7:18am

On Sun, Dec 17, 2006 at 06:55:48PM +0900, Henry M. wrote:

The html in the attached file renders ‘correctly’ in the 3 browsers I
have tried but it tricks hpricot because of the second malformed
comment.

Great stuff! Thankyou. This is going to be a fun one to work on, so
I’ll get
back to you when I’ve got the medicine.

_why

nock12 · December 18, 2006, 7:18am

On 17/12/2006, at 11:15 PM, Paul L. wrote:

it would be nice if hpricot behaved like a browser.
Paul,

before I address your response directly I will say that I am aware of
your crusade against html parsing libraries and while I believe you
are entitled to your opinion, I disagree with it. I have done enough
of this sort of thing to know that, for me, the level of abstraction
that these libraries gives is both beneficial in development time and
maintenance. I am neither an html nuby, nor a ruby nuby. I am also
aware that my needs may not match those of some one else so I’m not
going to ram my opinions down there throat every time they ask for a
little help.

You have created a new thread, and you have not attached any prior
text.
This requires us to start over.

As this is the first time I have posted on this subject, that much is
obvious. Unless I am missing something.

Tell us what you hoped would happen, what happened instead, and how
they
differ.

Run the script and that too will be obvious.

If your goal is to filter particular content from HTML pages, just
say so,
and be specific about what you want and don’t want. Given this
information,
I will show you how to extract the desired content with a few lines of
Ruby, no fuss, no undue complexity, no Hpricot.

My goal is to highlight an issue I found with a particular library
and provide some sample code that shows the problem with the minimum
amount of code. I posted it here so that there may be some discussion
with interested people as to the desired behaviour.

IIRC, you had asked for help using Hpricot to extract text between

and
tag pairs, but with the added requirement that there be an IMG tag within the
...
tag pair to validate the case. Is this still the goal? If so, how did my previously posted, simple solution work out for you?

What IMG tag? There isn’t one in the sample code. What previous
solution? You do not recall correctly.

This is a scene in a much larger play, one in which someone says,
“Wow, I
had no idea there was such a powerful library, so carefully
designed, so
complete. But, notwithstanding its extraordinary features,
notwithstanding
the hundreds of man-hours expended creating it … I can’t get it
to do
what I want.”

The incident that that prompted my post went thus…
I had a page that seemed to render fine in a browser but when parsing
it my code failed. I inspected the html and found a malformed comment
to be the problem. Probably put there to stop screen scraping. I
wrote a bit of code, using regexps no less, that removed the
offending comment and hpricot then went on it’s merry way. Job done.
I thought others may be interested so I posted some sample code. I am
now regretting that decision.

This is a very common refrain. I think I can solve your problem
with a few
lines of Ruby code, code that you can easily understand and adapt to
specific and evolving requirements. And if I cannot do this, I will
say so.

I could too, but I don’t care.

–
Paul L.

Thanks for hijacking my thread. Thanks for nothing.

nock12 · December 18, 2006, 7:20am

On 18/12/2006, at 8:35 AM, Paul L. wrote:

except
the specific problem the user happens to be facing.

Every solution works up until the point that it doesn’t. If life
wasn’t like that we wouldn’t have much to.

And my typical offered, simple solution is not meant to, and cannot
stand in
for, the 2^32 special cases that have been laboriously programmed
into the
library. It can only provide an overlooked special need that the
library
cannot provide. It’s surprising to me how often this happens.

Which is why I posted my test case. To knock one more special case
off the list.

nock12 · December 18, 2006, 7:20am

On 17/12/2006, at 11:51 PM, Peter S. wrote:

Given this information,
I will show you how to extract the desired content with a few
lines of
Ruby, no fuss, no undue complexity, no Hpricot.
Why should it be complicated? What fuss? Who needs few lines?
With the
current version of hpricot this is exactly one line:

doc//p[img]//text()

Maybe I’m going mad but there is no img tag in the sample code. I am
not interested in extracting anything. I know how to do that. I am
trying to highlight a problem I discovered in hpricot.

nock12 · December 18, 2006, 10:04am

On 18/12/2006, at 7:16 PM, _why wrote:

On Sun, Dec 17, 2006 at 06:55:48PM +0900, Henry M. wrote:

The html in the attached file renders ‘correctly’ in the 3 browsers I
have tried but it tricks hpricot because of the second malformed
comment.

Great stuff! Thankyou. This is going to be a fun one to work on,
so I’ll get
back to you when I’ve got the medicine.

It’s not a big deal. Like I said, it’s easy to work around. Just
thought you’d like to know.

nock12 · December 18, 2006, 2:03pm

Henry, There was some just a few days ago who had a problem with using
Hpricot, and IMG elements in P tags. Paul must have gotten you two
confused.