I have an html code into string. I want to retrieve the content (Can
be any HTML code with any number of tags) present inside the div after
the heading till the end of the div.
Example,
Tagline:
Yippee Ki Yay Mo - John 6:27
Plot Outline:
John McClane takes on an Internet-based terrorist organization who is
systematically shutting down the United States.
more
In the above example, Plot Outline is header that I am looking for
then, regex should give me -
John McClane takes on an Internet-based terrorist organization who is
systematically shutting down the United States. more
And if “Tagline:” is what I am looking for then regex should give me -
Yippee Ki Yay Mo - John 6:27
I hope the problem statement is clear.
On Thu, Feb 28, 2008 at 12:40 AM, Chirantan
[email protected] wrote:
In the above example, Plot Outline is header that I am looking for
Yippee Ki Yay Mo - John 6:27
I hope the problem statement is clear.
Scraping html is not the easiest thing in the world. I would
recommend the hpricot library.
Todd
On Feb 28, 12:36 am, Chirantan [email protected] wrote:
then, regex should give me -
I hope the problem statement is clear.
Note that this will give spurious results if an html comment happens
to contain what you are looking for.
def find_header header, html
Put all of the DIVs in an array.
divs = html.scan( %r{<div.?>(.?)}im ).flatten
divs.each{|s|
if s =~ %r{<h(\d)>#{header}</h\1>(.*)}im
return $2.strip
end
}
return nil
end
html = DATA.read
puts find_header( “Plot Outline:”, html )
END
Tagline:
Yippee Ki Yay Mo - John 6:27
Plot Outline:
John McClane takes on an Internet-based terrorist organization who is
systematically shutting down the United States.
more
On Feb 28, 9:50 am, William J. [email protected] wrote:
Yippee Ki Yay Mo - John 6:27
divs.each{|s|
inline" href=“Live Free or Die Hard (2007) - Plot - IMDb”
onclick=“(new Image()).src=‘/rg/title-tease/plotsummary/images/b.gif?
link=/title/tt0337978/plotsummary’;”>more
More concise:
def find_header header, html
html.scan( %r{<div.?>(.?)}im ).flatten.each{|s|
return $1.strip if s =~ %r{
#{header}</h5(.*)}im }
return nil
end
A regex will break too easily when parsing HTML. A real parser will do
a much better job, and often be more concise and readable, too.
This does what you want:
#-------
require ‘rubygems’
require ‘hpricot’
@doc = Hpricot(html) # or Hpricot(open(“filename”))
def find(term)
@doc.search("//div[@class=‘info’]").each do |info|
header = info.search(“h5”).remove
if header.inner_text == term
puts info.inner_html
end
end
end
#-------
find(“Plot Outline:”)
John McClane takes on an Internet-based terrorist organization who is
systematically shutting down the United States. more
Mark
On Feb 29, 1:14 am, William J. [email protected] wrote:
Example,
inline" href=“Live Free or Die Hard (2007) - Plot - IMDb”
onclick="(new Image()).src='/rg/title-tease/plotsummary/images/b.gif?
def find_header header, html
html.scan( %r{<div.?>(.?)}im ).flatten.each{|s|
return $1.strip if s =~ %r{
#{header}</h5(.*)}im }
return nil
end
Thank you William and Mark,
The codes worked. Thanks a lot.
On Feb 29, 2008, at 2:54 PM, Mark T. wrote:
Whats quite interesting is that I am not able to find a nice article
on why
this doesn’t work. So, in short:
Regexp can only parse languages that are regular (hence the name) or -
in other words - a Type 3-language in the Chomsky hierarchy [1]. This
is a
rule of thumb because many Regexp-libraries nowadays implement
features that enable you to do more than formal regular expressions.
But for the typical use, it is true.
Regular languages do not have any possibility to “look behind”. They
do only
look forward. This is the reason why you cannot define a regular
language to
describe an parse arbitrarily deep nested structure (an thus, no regular
expression):
You have no possibility to determine which closing tag matches a given
opening tag.
A more abstract example:
There is no (formal) regular expression that matches a word that
consists
of n times “a” and then n times “b”:
ab
aabb
aaabbb
aaaabbbb
etc.
What you can do is extract a tag, push it on a stack, extract the
next one, etc. and pop them when encountering matching closing tags.
Tags
by itself can be described with regexps (afaik, this is how Textmate
does its
markup).
Greetings
Skade
[1] Chomsky hierarchy - Wikipedia
On Fri, Feb 29, 2008 at 10:52 AM, Florian G. [email protected]
wrote:
in other words - a Type 3-language in the Chomsky hierarchy [1]. This
expression):
aaabbb
Greetings
Skade
[1] Chomsky hierarchy - Wikipedia
Thank you for that great explanation! I was waiting for someone to
bring up formal grammar, but I was afraid to, because I wasn’t sure it
applied (not that familiar with how regexps actually work).
Todd
On Feb 29, 7:50 am, Mark T. [email protected] wrote:
All the regex solutions provided will break with the following
perfectly valid HTML:
Tagline:
Yippee Ki Yay Mo - John 6:27
Easily fixed.
def find_header header, html
html.scan( %r{<div.?>(.?)</div\s*>}im ).flatten.
each{|s|
return $1.strip if s =~ %r{<h5\s*>#{header}</h5\s*>(.*)}im }
return nil
end
This is one of many reasons it is a BAD idea to use regexes to parse
HTML. Regular expressions are simply not the right tool for the job.
Who told you that they are not? And why did you take his word for it?
Does hpricot use regular expressions?
All the regex solutions provided will break with the following
perfectly valid HTML:
Tagline:
Yippee Ki Yay Mo - John 6:27
This is one of many reasons it is a BAD idea to use regexes to parse
HTML. Regular expressions are simply not the right tool for the job.
On Feb 29, 10:52 am, Florian G. [email protected] wrote:
This is one of many reasons it is a BAD idea to use regexes to parse
features that enable you to do more than formal regular expressions.
A more abstract example:
There is no (formal) regular expression that matches a word that
consists
of n times “a” and then n times “b”:
And that doesn’t matter much. One can use as many regular expressions
as he wishes.
ab
aabb
aaabbb
aaaabbbb
etc.
“ab
xx
aabb
aaabbb
aaabb
aaaabbbb”.split.each{|s|
if s.match(/^(a+)/) and s.match(/^a+b{#{$1.size}}$/)
puts s
else
puts ‘-’
end
}
Or one can use regular expression + code:
“ab
xx
aabb
aaabbb
aaabb
aaaabbbb”.split.each{|s|
if s.match(/^(a+)(b+)$/) and $1.size == $2.size
puts s
else
puts ‘-’
end
}
What makes anyone think that a single regular expression
has to do all the work?
On Fri, Feb 29, 2008 at 1:19 PM, Jari W.
[email protected] wrote:
This is one of many reasons it is a BAD idea to use regexes to parse
}
Within H5 tags:
Tagline:
Within DIV tags: Yippee Ki Yay Mo - John 6:27
Within DIV tags:
What if you have a div inside a div? Although, the OP said “any”
legitimate html inside a div, there’s part of me that begs the
question: which div?
Todd
Mark T. wrote:
All the regex solutions provided will break with the following
perfectly valid HTML:
Tagline:
Yippee Ki Yay Mo - John 6:27
This is one of many reasons it is a BAD idea to use regexes to parse
HTML. Regular expressions are simply not the right tool for the job.
Sorry if I’m missing the point:
the_text = %q{
Tagline:
Yippee Ki Yay Mo - John 6:27
}
the_text.each_line do |line|
puts “Within DIV tags: #{line}” if (line=~/<div/)…(line=~/</div/)
puts “Within H5 tags: #{line}” if (line=~/<h5/)…(line=~/</h5/)
end
Result:
Within DIV tags:
Within DIV tags:
Tagline:
Within H5 tags: Tagline:
Within DIV tags: Yippee Ki Yay Mo - John 6:27
Within DIV tags:
Best regards,
Jari W.
Todd B. wrote:
Within DIV tags: Tagline:
Within H5 tags: Tagline:
Within DIV tags: Yippee Ki Yay Mo - John 6:27
Within DIV tags:
What if you have a div inside a div? Although, the OP said “any”
legitimate html inside a div, there’s part of me that begs the
question: which div?
Sure, for real-life HTML with nested tags it’ll break. I just wanted to
point out that for simple parsing needs (as the example that I replied
to) regexps can find both beginning and end tags.
Best regards,
Jari W.
On Feb 29, 2008, at 8:19 PM, Jari W. wrote:
Sorry if I’m missing the point:
puts “Within H5 tags: #{line}” if (line=~/<h5/)…(line=~/</h5/)
Best regards,
Jari W.
This may work on this short snippet. Consider this:
the_text = %q{
Tagline:
Yippee Ki Yay Mo - John 6:27
}
the_text.each_line do |line|
puts “Within DIV tags: #{line}” if (line=~/<div/)…(line=~/</div/)
puts “Within H5 tags: #{line}” if (line=~/<h5/)…(line=~/</h5/)
end
It doesn’t see the second as it considers both divs closed.
(which isn’t even possible to determine, as we did not save any
state). Second question: which
am I in at a certain point? Or,
in other words: whats the
#innerText of .info, whats the
#innerText
of .nextinfo? You won’t get far without a stack and that can be proven
[1].
If this is of interest to you, consider reading a book about computer
theory. It may be hard stuff, but it pays off :).[2]
Greetings
Florian G.
[1] Up to the reader ;).
[2] Don’t feel bad if you didn’t and don’t consider this as an
offense. I know many good programmers that never read any theory. But
it certainly isn’t bad to know about it.
Florian G. wrote:
This is one of many reasons it is a BAD idea to use regexes to parse
Within DIV tags: Yippee Ki Yay Mo - John 6:27
puts “Within DIV tags: #{line}” if (line=~/<div/)…(line=~/</div/)
puts “Within H5 tags: #{line}” if (line=~/<h5/)…(line=~/</h5/)
end
It doesn’t see the second as it considers both divs closed.
It consider the first div closed. It never sees the other one.
Best regards,
Jari W.
On Feb 29, 2:03 pm, William J. [email protected] wrote:
Easily fixed.
def find_header header, html
html.scan( %r{<div.?>(.?)</div\s*>}im ).flatten.
each{|s|
return $1.strip if s =~ %r{<h5\s*>#{header}</h5\s*>(.*)}im }
return nil
end
Easily broken again.
Tagline:
Yippee Ki Yay Mo - John 6:27
The point is, regex-based parsing is fragile, and is provably
incomplete for parsing arbitrarily nested structures like HTML. A real
parser (such as a recursive descent parser) is needed. I use regular
expressions often, but when parsing HTML, XML, or other nested data, I
reach for other tools.
This is one of many reasons it is a BAD idea to use regexes to parse
HTML. Regular expressions are simply not the right tool for the job.
Who told you that they are not? And why did you take his word for it?
Experience, for one. Until I really understood parsers, I tended to
use regular expressions for everything. I’ve been using regular
expressions for a LONG time, and I am very comfortable with them. But
parsing HTML was always troublesome.
This has been discussed for years e.g. in Perl circles (PerlMonks,
etc) where it is well known that regexes do not fit nested data.
People with questions asking how to parse HTML with a regex will get
chided, especially with so many good parsers available in Perl. There
are good parsers available in Ruby now too, so people should be
encouraged to use them.
Does hpricot use regular expressions?
Of course not.
aaabbb
end
}
What makes anyone think that a single regular expression
has to do all the work?
I don’t know. But many think one fits. Thats why i wrote this
explanation, as it is something i see almost everyday and to give some
insight to those that are pondering on why this is so.
So: your solution does not fit the problem, but thanks for showing
that another problem (parsing “anbn” with a touring-complete
language) can indeed be solved.
I also stated this in my last paragraph: you can solve the problem by
using regular expressions. But the language of regular expressions by
itself is not mighty enough to solve it alone.
Greetings
Florian