Parsing through downloaded html

Hi all,

I’ve collected a number of thousands of .hmtl documents and I need to
know how to parse through all these documents (that are in one folder)
automatically.

So, I want to copy certain parts of all of these .html documents (for
example the header), but the websites are offline, on my hard disk, in
stead of online.

What’s the way to go?

http://nokogiri.org/ is great for this. You need parsing html, look at
tutorial on their site:
Parsing an HTML/XML document - Nokogiri

2012/9/6 Sybren K. [email protected]

Thank you Ivan.

I am familiar with nokogiri (and open-uri), but have only used it to
download online websites. I am a complete newb, so I was curious how to
start:

  1. How to automatically open (parse) all the .htmls in a folder, one by
    one?
  2. When one of these files is opened, how to parse through it, copy
    certain parts to an excel like document, and close it again?

On Thu, Sep 6, 2012 at 11:30 AM, Sybren K. [email protected]
wrote:

Thank you Ivan.

I am familiar with nokogiri (and open-uri), but have only used it to
download online websites. I am a complete newb, so I was curious how to
start:

  1. How to automatically open (parse) all the .htmls in a folder, one by
    one?

Dir[‘*.html’].each do |file|

end

  1. When one of these files is opened, how to parse through it, copy
    certain parts to an excel like document, and close it again?

You can parse through it with Nokogiri, extract the parts you want
using Nokogiri methods to seach for content inside HTML. If you want
to write an excel file you will need to use an excel API (no idea), or
if you can generate a CSV you can use the stdlib CVS class.

Jesus.

It’s ok if you are a new to Ruby but don’t be so passive.
You are not first that is looking for excel library (in Ruby terms it’s
called gem). First look for that.
After you find these libraries pick one and search Google for usage or
go
to their site there you will find support or documentation.
That’s the way all of us works.

2012/9/6 Sybren K. [email protected]

Thanks Jesus.

So something like:

require ‘nokogiri’
Dir[’*.html’].each do |file|
document = Nokogiri::HTML(open(file))
variable = document.xpath("//div/h2")
[and something to PUT the variable in a specified excel column/csv file)

??

(again, I am a newb =))

I’ve done similar job recently. I’ve copied part of code so look at it
maybe it could help you:
require ‘nokogiri’
require ‘spreadsheet’

class XmlParsing
def test
row = 0
column = 0
book = Spreadsheet::Workbook.new
sheet = book.create_worksheet

Dir.chdir("xml")
puts Dir.pwd
Dir.glob("*.xml") do |file|
  f = File.open(file)
  doc = Nokogiri::XML(f)
  fullname_node = doc.at_xpath("//full_name")
  sheet[row,column] = fullname_node.content
  f.close
  row += 1
end
book.write 'spreadsheet.xls'

end
end

xml = XmlParsing.new
xml.test

I’m not sure if it works because I’ve deleted some code from original
script but this is general idea. It script parses xml document and puts
information into excel.

Plus you need spreadsheet gem. You can install it with:
sudo gem install spreadsheet.

Or there is better way to do it. Make one file named Gemfile and put
this
content into it:
source ‘https://rubygems.org

gem ‘nokogiri’
gem ‘spreadsheet’

Save file and go to console cd to dir where Gemfile is located and run:
bundle install

This is preferred way to go because there is no need to install gems one
by
one manually. Instead when you want to run your software on different
machine you just invoke bundle install.

Hope it helps.

2012/9/6 Sybren K. [email protected]

Hi Ivan, thanks.

I’m doing my best here, and I do search before I post. I was just
checking if I was working in the right direction. Excel library is step
two.

On 09/07/2012 03:29 AM, Michelle C. wrote:

Ruby is a nice language for beginners. I am new user for 1 day now and I
have learned so much from everyone! Try to avoid saying that you are
new. This is due to a user named Ryan Daves<sp?>. He talks very negative
to users who do that. Also another ruby forum you might want to
reference is at tek-tips.com. The users are helpful there as well. Good
luck. -Michelle

As someone commenting on other people’s lack of social skills, you could
do far worse than trying to improve your own, badmouthing Ryan like that
in a topic that is totally unrelated to the one where you took offence
to his manner of replying.

You need only look through a couple of days’ worth of older messages
here to see that you are drawing totally incorrect conclusions based on
a single post of his. For your information, Ryan D. is among the most
helpful and knowledegable people around here.

If you have issues with his reply to you, I suggest you take in up with
him within that very thread, preferrably in a more civilized manner.
Thank you.

Lars H. wrote in post #1075010:

On 09/07/2012 03:29 AM, Michelle C. wrote:

Ruby is a nice language for beginners. I am new user for 1 day now and I
have learned so much from everyone! Try to avoid saying that you are
new. This is due to a user named Ryan Daves<sp?>. He talks very negative
to users who do that. Also another ruby forum you might want to
reference is at tek-tips.com. The users are helpful there as well. Good
luck. -Michelle

As someone commenting on other people’s lack of social skills, you could
do far worse than trying to improve your own, badmouthing Ryan like that
in a topic that is totally unrelated to the one where you took offence
to his manner of replying.

You need only look through a couple of days’ worth of older messages
here to see that you are drawing totally incorrect conclusions based on
a single post of his. For your information, Ryan D. is among the most
helpful and knowledegable people around here.

If you have issues with his reply to you, I suggest you take in up with
him within that very thread, preferrably in a more civilized manner.
Thank you.

In my approach for help on this forum, I tried to be as detailed as
possible to get help. I even explained that I am a beginner and the
reply that I received contained false accusations. If the user was
expecting a snippet, I would have sent it. Otherwise, why did he even
reply if he didn’t know how to reply? A logical reply would have been,
“Will you send the code you have so far?”

In my 1-day experience as a user on this ruby forum, I have discovered
that if a user says that they are a “newb” or a “beginner” then they may
be verbally attacked or told they are being “passive”.
The intent of a forum is for other users to help or advise a user in the
subject matter; it is not to decide if the user is being “passive” in
their approach.

Perhaps there is a need to focus on helping users with ruby programming
logic when the user requests help instead of replying to the user by
stating that they are being “passive” as discovered in this thread or
making false accusations.

-Michelle

“Иван Бишевац” [email protected] wrote in post #1074911:

I’ve done similar job recently. I’ve copied part of code so look at it
maybe it could help you:
require ‘nokogiri’
require ‘spreadsheet’

class XmlParsing
def test
row = 0
column = 0
book = Spreadsheet::Workbook.new
sheet = book.create_worksheet

Dir.chdir("xml")
puts Dir.pwd
Dir.glob("*.xml") do |file|
  f = File.open(file)
  doc = Nokogiri::XML(f)
  fullname_node = doc.at_xpath("//full_name")
  sheet[row,column] = fullname_node.content
  f.close
  row += 1
end
book.write 'spreadsheet.xls'

end
end

xml = XmlParsing.new
xml.test

I’m not sure if it works because I’ve deleted some code from original
script but this is general idea. It script parses xml document and puts
information into excel.

Plus you need spreadsheet gem. You can install it with:
sudo gem install spreadsheet.

Or there is better way to do it. Make one file named Gemfile and put
this
content into it:
source ‘https://rubygems.org

gem ‘nokogiri’
gem ‘spreadsheet’

Save file and go to console cd to dir where Gemfile is located and run:
bundle install

This is preferred way to go because there is no need to install gems one
by
one manually. Instead when you want to run your software on different
machine you just invoke bundle install.

Hope it helps.

2012/9/6 Sybren K. [email protected]

Good stuff! This ran nicely with the exception of “content” element
generating a nil exception but my XML file was not compliant to XML
rules.

Ruby is a nice language for beginners. I am new user for 1 day now and I
have learned so much from everyone! Try to avoid saying that you are
new. This is due to a user named Ryan Daves<sp?>. He talks very negative
to users who do that. Also another ruby forum you might want to
reference is at tek-tips.com. The users are helpful there as well. Good
luck. -Michelle

stating that they are being “passive” as discovered in this thread or
making false accusations.

-Michelle


Posted via http://www.ruby-forum.com/.

Since I am directly called for word “passive” I have to answer :slight_smile:
I think that sending solution with source code to inexperienced Ruby
programmer is not good for him. It’s better to learn him how to search
Internet and find solution reading documentation, since it’s most
important
skill of every programmer. I gave solution to Sybren K., but also
advice what’s best way to do next time.

Am 07.09.2012 09:28, schrieb Lars H.:

in a topic that is totally unrelated to the one where you took offence
to his manner of replying.

You need only look through a couple of days’ worth of older messages
here to see that you are drawing totally incorrect conclusions based on
a single post of his. For your information, Ryan D. is among the most
helpful and knowledegable people around here.

I’ll second that wholeheartedly!

@Michelle: Please don’t take it too personally. Questions like
“How can I write a program that achieves world peace”
(I exaggerate a liittle…) without giving any information
on a specific problem you have with Ruby or on your prior knowledge
in programming are nearly impossible to answer adequately.

Still, you will find that this kind of questions is more often
answered here than not (at least people try…)

Ryan’s answer may have been a little too rude, but your reply
to him and your subsequent behaviour even more so.

2012/9/7 [email protected]:

Am 07.09.2012 09:28, schrieb Lars H.:

You need only look through a couple of days’ worth of older messages
here to see that you are drawing totally incorrect conclusions based on
a single post of his. For your information, Ryan D. is among the most
helpful and knowledegable people around here.

I’ll second that wholeheartedly!

Okay guys, as helpful as Ryan can be in a good mood, and as awesome a
programmer as he is, you can’t deny he tends to be a dick ;). (No
offense, Ryan.)

– Matma R.

2012/9/7 Ryan D. [email protected]:

I would argue that I don’t tend to be a dick, but that I think we need to do
more to manage the signal:noise on this list, lest we lose the ability to help
anyone at all. That is probably often perceived as me being a dick. You barely
interact with me on a day to day basis so you’re only seeing one aspect of me.
Also see my talk I gave at Cascadia Ruby where I addressed this specifically:

Occupy Ruby—why we need to moderate the 1%:

http://zenspider.com/presentations/2012-cascadia.html

Interesting talk, haven’t seen it before. Maybe I judged you too
harsh, personally I react differently – I tend to just archive the
mail and move on, responding to people with interesting or non-trivial
problems, and rarely taking part in anything “community”.

– Matma R.

On Sep 7, 2012, at 10:24 , Bartosz Dziewoński [email protected]
wrote:

Okay guys, as helpful as Ryan can be in a good mood, and as awesome a
programmer as he is, you can’t deny he tends to be a dick ;). (No
offense, Ryan.)

No offense taken. I would argue that I don’t tend to be a dick, but
that I think we need to do more to manage the signal:noise on this list,
lest we lose the ability to help anyone at all. That is probably often
perceived as me being a dick. You barely interact with me on a day to
day basis so you’re only seeing one aspect of me. Also see my talk I
gave at Cascadia Ruby where I addressed this specifically:

Occupy Ruby—why we need to moderate the 1%:

http://zenspider.com/presentations/2012-cascadia.html

Well, this topic took an interesting turn =)

Ivan, thanks for the ‘spreadsheet’ tip + code. I got me a lot further,
but I´m still running into some walls. Mostly, at the moment I need to
know how to specify column and row for variables: in a way that for
every next document I parse the variables will be put in the same
columns, but the next row.

so column a, column b
first document: variable 1 = column a, row 1 | variable 2 = column b,
row 1
second document: variable 1 = column a, row 2 | variabele 2 = column b,
row 2.
etcetera.

the code so far:

First the basic code, including the opening of a new spreadsheet:

require ‘nokogiri’
require ‘spreadsheet’
Spreadsheet.client_encoding = ‘UTF-8’
book = Spreadsheet::Workbook.new
sheet1 = book.create_worksheet

Now to parse through all downloaded .htmls:

Dir.chdir(“anattempt”)
Dir.glob[’*.html’].each do |document|
f = file.open(document)
searchablefile = Nokogiri::HTML(f)
variabelebasedonaxpath = searchablefile.xpath("//h1[contains(text(),
‘Harbers’]")

Now to save the variable(s) in the spreadsheet (…but how to?)

row = ? (push ?)
Column = ? (column.push ?)

book.write ‘htmltoexcel.xls’

On Sun, Sep 9, 2012 at 2:31 PM, Иван Бишевац [email protected]
wrote:

Dir.chdir(“anattempt”)
Dir.glob[‘*.html’].each do |document|

Another possibility would be to use each_with_index.

Dir.glob[‘*.html’].each_with_index do |document, row|

Jesus.

require ‘nokogiri’
require ‘spreadsheet’

Spreadsheet.client_encoding = ‘UTF-8’
book = Spreadsheet::Workbook.new
sheet1 = book.create_worksheet

Numbering is zero based. This means that first row is labeled 0, first

column 0.
row = 0

Dir.chdir(“anattempt”)
Dir.glob[‘*.html’].each do |document|
f = file.open(document)
searchablefile = Nokogiri::HTML(f)

use at_xpath rather than xpath since first one method returns just 1

element,

but second method xpath returns array of all found records matching

criteria
var1 = searchablefile.at_xpath(“your xpath here…”)
var2 = searchablefile.at_xpath(“your xpath here…”)

In first pass it saves data to first row, and two columns A and B.

Every nest pass increments row by 1, but columns are same A and B.

sheet1[row, 0] = variabelebasedonaxpath.content
shhet1[row, 1] = variabelebasedonaxpath.content

#After saving data increment row position by 1
row += 1
end

book.write ‘htmltoexcel.xls’

I didn’t tested this, but if something goes wrong ask here.
Also read http://nokogiri.org/tutorials for learning how to parse
xml/html
documents, that’s short but useful resource.

Copy here full error.

2012/9/10 Sybren K. [email protected]