Need a regex searching html code

chirantan · February 28, 2008, 7:41am

I have an html code into string. I want to retrieve the content (Can
be any HTML code with any number of tags) present inside the div after
the heading till the end of the div.

Example,

Tagline:

Yippee Ki Yay Mo - John 6:27

Plot Outline:

John McClane takes on an Internet-based terrorist organization who is systematically shutting down the United States. more

In the above example, Plot Outline is header that I am looking for
then, regex should give me -

John McClane takes on an Internet-based terrorist organization who is
systematically shutting down the United States. more

And if “Tagline:” is what I am looking for then regex should give me -

Yippee Ki Yay Mo - John 6:27

I hope the problem statement is clear.

chirantan · February 28, 2008, 1:08pm

On Thu, Feb 28, 2008 at 12:40 AM, Chirantan
[email protected] wrote:

In the above example, Plot Outline is header that I am looking for

Yippee Ki Yay Mo - John 6:27

I hope the problem statement is clear.

Scraping html is not the easiest thing in the world. I would
recommend the hpricot library.

Todd

chirantan · February 28, 2008, 5:15pm

On Feb 28, 12:36 am, Chirantan [email protected] wrote:

then, regex should give me -

I hope the problem statement is clear.

Note that this will give spurious results if an html comment happens
to contain what you are looking for.

def find_header header, html

Put all of the DIVs in an array.

divs = html.scan( %r{<div.?>(.?)}im ).flatten
divs.each{|s|
if s =~ %r{<h(\d)>#{header}</h\1>(.*)}im
return $2.strip
end
}
return nil
end

html = DATA.read

puts find_header( “Plot Outline:”, html )

END

Tagline:

Yippee Ki Yay Mo - John 6:27

Plot Outline:

John McClane takes on an Internet-based terrorist organization who is systematically shutting down the United States. more

chirantan · February 28, 2008, 9:15pm

On Feb 28, 9:50 am, William J. [email protected] wrote:

Yippee Ki Yay Mo - John 6:27

divs.each{|s|

inline" href=“Live Free or Die Hard (2007) - Plot - IMDb”
onclick=“(new Image()).src=‘/rg/title-tease/plotsummary/images/b.gif?
link=/title/tt0337978/plotsummary’;”>more

More concise:

def find_header header, html
html.scan( %r{<div.?>(.?)}im ).flatten.each{|s|
return $1.strip if s =~ %r{

#{header}</h5(.*)}im }
return nil
end

chirantan · February 28, 2008, 7:55pm

A regex will break too easily when parsing HTML. A real parser will do
a much better job, and often be more concise and readable, too.

This does what you want:

#-------
require ‘rubygems’
require ‘hpricot’
@doc = Hpricot(html) # or Hpricot(open(“filename”))

def find(term)
@doc.search("//div[@class=‘info’]").each do |info|
header = info.search(“h5”).remove
if header.inner_text == term
puts info.inner_html
end
end
end
#-------

find(“Plot Outline:”)
John McClane takes on an Internet-based terrorist organization who is
systematically shutting down the United States. more

Mark

chirantan · February 29, 2008, 5:01am

On Feb 29, 1:14 am, William J. [email protected] wrote:

Example,
inline" href=“Live Free or Die Hard (2007) - Plot - IMDb”
onclick="(new Image()).src='/rg/title-tease/plotsummary/images/b.gif?

def find_header header, html
html.scan( %r{<div.?>(.?)}im ).flatten.each{|s|
return $1.strip if s =~ %r{

#{header}</h5(.*)}im }
return nil
end

Thank you William and Mark,

The codes worked. Thanks a lot.

chirantan · February 29, 2008, 5:53pm

On Feb 29, 2008, at 2:54 PM, Mark T. wrote:

Whats quite interesting is that I am not able to find a nice article
on why
this doesn’t work. So, in short:

Regexp can only parse languages that are regular (hence the name) or -
in other words - a Type 3-language in the Chomsky hierarchy [1]. This
is a
rule of thumb because many Regexp-libraries nowadays implement
features that enable you to do more than formal regular expressions.
But for the typical use, it is true.

Regular languages do not have any possibility to “look behind”. They
do only
look forward. This is the reason why you cannot define a regular
language to
describe an parse arbitrarily deep nested structure (an thus, no regular
expression):
You have no possibility to determine which closing tag matches a given
opening tag.

A more abstract example:
There is no (formal) regular expression that matches a word that
consists
of n times “a” and then n times “b”:

ab
aabb
aaabbb
aaaabbbb
etc.

What you can do is extract a tag, push it on a stack, extract the
next one, etc. and pop them when encountering matching closing tags.
Tags
by itself can be described with regexps (afaik, this is how Textmate
does its
markup).

Greetings
Skade

[1] Chomsky hierarchy - Wikipedia

chirantan · February 29, 2008, 6:36pm

On Fri, Feb 29, 2008 at 10:52 AM, Florian G. [email protected]
wrote:

in other words - a Type 3-language in the Chomsky hierarchy [1]. This
expression):
aaabbb
Greetings
Skade

[1] Chomsky hierarchy - Wikipedia

Thank you for that great explanation! I was waiting for someone to
bring up formal grammar, but I was afraid to, because I wasn’t sure it
applied (not that familiar with how regexps actually work).

Todd

chirantan · February 29, 2008, 8:06pm

On Feb 29, 7:50 am, Mark T. [email protected] wrote:

All the regex solutions provided will break with the following
perfectly valid HTML:

Tagline:
Yippee Ki Yay Mo - John 6:27

Easily fixed.

def find_header header, html
html.scan( %r{<div.?>(.?)</div\s*>}im ).flatten.
each{|s|
return $1.strip if s =~ %r{<h5\s*>#{header}</h5\s*>(.*)}im }
return nil
end

This is one of many reasons it is a BAD idea to use regexes to parse
HTML. Regular expressions are simply not the right tool for the job.

Who told you that they are not? And why did you take his word for it?
Does hpricot use regular expressions?

chirantan · February 29, 2008, 2:56pm

All the regex solutions provided will break with the following
perfectly valid HTML:

Tagline:

Yippee Ki Yay Mo - John 6:27

This is one of many reasons it is a BAD idea to use regexes to parse
HTML. Regular expressions are simply not the right tool for the job.

chirantan · February 29, 2008, 8:16pm

On Feb 29, 10:52 am, Florian G. [email protected] wrote:

This is one of many reasons it is a BAD idea to use regexes to parse
features that enable you to do more than formal regular expressions.

A more abstract example:
There is no (formal) regular expression that matches a word that
consists
of n times “a” and then n times “b”:

And that doesn’t matter much. One can use as many regular expressions
as he wishes.

ab
aabb
aaabbb
aaaabbbb
etc.

“ab
xx
aabb
aaabbb
aaabb
aaaabbbb”.split.each{|s|
if s.match(/^(a+)/) and s.match(/^a+b{#{$1.size}}$/)
puts s
else
puts ‘-’
end
}

Or one can use regular expression + code:

“ab
xx
aabb
aaabbb
aaabb
aaaabbbb”.split.each{|s|
if s.match(/^(a+)(b+)$/) and $1.size == $2.size
puts s
else
puts ‘-’
end
}

What makes anyone think that a single regular expression
has to do all the work?

chirantan · February 29, 2008, 8:29pm

On Fri, Feb 29, 2008 at 1:19 PM, Jari W.
[email protected] wrote:

This is one of many reasons it is a BAD idea to use regexes to parse
}
Within H5 tags:

Tagline:

Within DIV tags: Yippee Ki Yay Mo - John 6:27
Within DIV tags:

What if you have a div inside a div? Although, the OP said “any”
legitimate html inside a div, there’s part of me that begs the
question: which div?

Todd

chirantan · February 29, 2008, 8:20pm

Mark T. wrote:

All the regex solutions provided will break with the following
perfectly valid HTML:

Tagline:
Yippee Ki Yay Mo - John 6:27

This is one of many reasons it is a BAD idea to use regexes to parse
HTML. Regular expressions are simply not the right tool for the job.

Sorry if I’m missing the point:

the_text = %q{

Tagline:

Yippee Ki Yay Mo - John 6:27

}

the_text.each_line do |line|
puts “Within DIV tags: #{line}” if (line=~/<div/)…(line=~/</div/)
puts “Within H5 tags: #{line}” if (line=~/<h5/)…(line=~/</h5/)
end

Result:
Within DIV tags:

Within DIV tags:

Tagline:

Within H5 tags:

Tagline:

Within DIV tags: Yippee Ki Yay Mo - John 6:27
Within DIV tags:

Best regards,

Jari W.

chirantan · February 29, 2008, 8:37pm

Todd B. wrote:

Within DIV tags:
Tagline:
Within H5 tags:
Tagline:
Within DIV tags: Yippee Ki Yay Mo - John 6:27 Within DIV tags:

What if you have a div inside a div? Although, the OP said “any”
legitimate html inside a div, there’s part of me that begs the
question: which div?

Sure, for real-life HTML with nested tags it’ll break. I just wanted to
point out that for simple parsing needs (as the example that I replied
to) regexps can find both beginning and end tags.

Best regards,

Jari W.

chirantan · February 29, 2008, 8:34pm

On Feb 29, 2008, at 8:19 PM, Jari W. wrote:

Sorry if I’m missing the point:
puts “Within H5 tags: #{line}” if (line=~/<h5/)…(line=~/</h5/)

Best regards,

Jari W.

This may work on this short snippet. Consider this:

the_text = %q{

Tagline:

Yippee Ki Yay Mo - John 6:27

}

the_text.each_line do |line|
puts “Within DIV tags: #{line}” if (line=~/<div/)…(line=~/</div/)
puts “Within H5 tags: #{line}” if (line=~/<h5/)…(line=~/</h5/)
end

It doesn’t see the second as it considers both divs closed.
(which isn’t even possible to determine, as we did not save any
state). Second question: which

am I in at a certain point? Or,
in other words: whats the #innerText of .info, whats the #innerText
of .nextinfo? You won’t get far without a stack and that can be proven
[1].
If this is of interest to you, consider reading a book about computer
theory. It may be hard stuff, but it pays off :).[2]

Greetings
Florian G.

[1] Up to the reader ;).
[2] Don’t feel bad if you didn’t and don’t consider this as an
offense. I know many good programmers that never read any theory. But
it certainly isn’t bad to know about it.

chirantan · February 29, 2008, 8:42pm

Florian G. wrote:

This is one of many reasons it is a BAD idea to use regexes to parse

Within DIV tags: Yippee Ki Yay Mo - John 6:27

puts “Within DIV tags: #{line}” if (line=~/<div/)…(line=~/</div/)
puts “Within H5 tags: #{line}” if (line=~/<h5/)…(line=~/</h5/)
end

It doesn’t see the second as it considers both divs closed.

It consider the first div closed. It never sees the other one.

Best regards,

Jari W.

chirantan · February 29, 2008, 10:20pm

On Feb 29, 2:03 pm, William J. [email protected] wrote:

Easily fixed.
def find_header header, html
html.scan( %r{<div.?>(.?)</div\s*>}im ).flatten.
each{|s|
return $1.strip if s =~ %r{<h5\s*>#{header}</h5\s*>(.*)}im }
return nil
end

Easily broken again.

Tagline:

Yippee Ki Yay Mo - John 6:27

The point is, regex-based parsing is fragile, and is provably
incomplete for parsing arbitrarily nested structures like HTML. A real
parser (such as a recursive descent parser) is needed. I use regular
expressions often, but when parsing HTML, XML, or other nested data, I
reach for other tools.

This is one of many reasons it is a BAD idea to use regexes to parse
HTML. Regular expressions are simply not the right tool for the job.

Who told you that they are not? And why did you take his word for it?

Experience, for one. Until I really understood parsers, I tended to
use regular expressions for everything. I’ve been using regular
expressions for a LONG time, and I am very comfortable with them. But
parsing HTML was always troublesome.

This has been discussed for years e.g. in Perl circles (PerlMonks,
etc) where it is well known that regexes do not fit nested data.
People with questions asking how to parse HTML with a regex will get
chided, especially with so many good parsers available in Perl. There
are good parsers available in Ruby now too, so people should be
encouraged to use them.

Does hpricot use regular expressions?

Of course not.

chirantan · February 29, 2008, 8:46pm

aaabbb

end
}

What makes anyone think that a single regular expression
has to do all the work?

I don’t know. But many think one fits. Thats why i wrote this
explanation, as it is something i see almost everyday and to give some
insight to those that are pondering on why this is so.
So: your solution does not fit the problem, but thanks for showing
that another problem (parsing “anbn” with a touring-complete
language) can indeed be solved.

I also stated this in my last paragraph: you can solve the problem by
using regular expressions. But the language of regular expressions by
itself is not mighty enough to solve it alone.

Greetings
Florian

Need a regex searching html code

Tagline:

Plot Outline:

Put all of the DIVs in an array.

Tagline:

Plot Outline:

#{header}</h5(.*)}im } return nil end

#{header}</h5(.*)}im } return nil end

Tagline:

Tagline:

Tagline:

Tagline:

Sorry if I’m missing the point:

Tagline:

the_text.each_line do |line| puts “Within DIV tags: #{line}” if (line=~/<div/)…(line=~/</div/) puts “Within H5 tags: #{line}” if (line=~/<h5/)…(line=~/</h5/) end

Tagline:

Tagline:

Tagline:

Tagline:

Tagline:

Tagline:

#{header}</h5(.*)}im }
return nil
end

#{header}</h5(.*)}im }
return nil
end

the_text.each_line do |line|
puts “Within DIV tags: #{line}” if (line=~/<div/)…(line=~/</div/)
puts “Within H5 tags: #{line}” if (line=~/<h5/)…(line=~/</h5/)
end