I’ve collected a number of thousands of .hmtl documents and I need to
know how to parse through all these documents (that are in one folder)
automatically.
So, I want to copy certain parts of all of these .html documents (for
example the header), but the websites are offline, on my hard disk, in
stead of online.
I am familiar with nokogiri (and open-uri), but have only used it to
download online websites. I am a complete newb, so I was curious how to
start:
How to automatically open (parse) all the .htmls in a folder, one by
one?
Dir[‘*.html’].each do |file|
…
end
When one of these files is opened, how to parse through it, copy
certain parts to an excel like document, and close it again?
You can parse through it with Nokogiri, extract the parts you want
using Nokogiri methods to seach for content inside HTML. If you want
to write an excel file you will need to use an excel API (no idea), or
if you can generate a CSV you can use the stdlib CVS class.
It’s ok if you are a new to Ruby but don’t be so passive.
You are not first that is looking for excel library (in Ruby terms it’s
called gem). First look for that.
After you find these libraries pick one and search Google for usage or
go
to their site there you will find support or documentation.
That’s the way all of us works.
require ‘nokogiri’
Dir[’*.html’].each do |file|
document = Nokogiri::HTML(open(file))
variable = document.xpath("//div/h2")
[and something to PUT the variable in a specified excel column/csv file)
I’ve done similar job recently. I’ve copied part of code so look at it
maybe it could help you:
require ‘nokogiri’
require ‘spreadsheet’
class XmlParsing
def test
row = 0
column = 0
book = Spreadsheet::Workbook.new
sheet = book.create_worksheet
Dir.chdir("xml")
puts Dir.pwd
Dir.glob("*.xml") do |file|
f = File.open(file)
doc = Nokogiri::XML(f)
fullname_node = doc.at_xpath("//full_name")
sheet[row,column] = fullname_node.content
f.close
row += 1
end
book.write 'spreadsheet.xls'
end
end
xml = XmlParsing.new
xml.test
I’m not sure if it works because I’ve deleted some code from original
script but this is general idea. It script parses xml document and puts
information into excel.
Plus you need spreadsheet gem. You can install it with:
sudo gem install spreadsheet.
Or there is better way to do it. Make one file named Gemfile and put
this
content into it:
source ‘https://rubygems.org’
gem ‘nokogiri’
gem ‘spreadsheet’
Save file and go to console cd to dir where Gemfile is located and run:
bundle install
This is preferred way to go because there is no need to install gems one
by
one manually. Instead when you want to run your software on different
machine you just invoke bundle install.
Ruby is a nice language for beginners. I am new user for 1 day now and I
have learned so much from everyone! Try to avoid saying that you are
new. This is due to a user named Ryan Daves<sp?>. He talks very negative
to users who do that. Also another ruby forum you might want to
reference is at tek-tips.com. The users are helpful there as well. Good
luck. -Michelle
As someone commenting on other people’s lack of social skills, you could
do far worse than trying to improve your own, badmouthing Ryan like that
in a topic that is totally unrelated to the one where you took offence
to his manner of replying.
You need only look through a couple of days’ worth of older messages
here to see that you are drawing totally incorrect conclusions based on
a single post of his. For your information, Ryan D. is among the most
helpful and knowledegable people around here.
If you have issues with his reply to you, I suggest you take in up with
him within that very thread, preferrably in a more civilized manner.
Thank you.
Ruby is a nice language for beginners. I am new user for 1 day now and I
have learned so much from everyone! Try to avoid saying that you are
new. This is due to a user named Ryan Daves<sp?>. He talks very negative
to users who do that. Also another ruby forum you might want to
reference is at tek-tips.com. The users are helpful there as well. Good
luck. -Michelle
As someone commenting on other people’s lack of social skills, you could
do far worse than trying to improve your own, badmouthing Ryan like that
in a topic that is totally unrelated to the one where you took offence
to his manner of replying.
You need only look through a couple of days’ worth of older messages
here to see that you are drawing totally incorrect conclusions based on
a single post of his. For your information, Ryan D. is among the most
helpful and knowledegable people around here.
If you have issues with his reply to you, I suggest you take in up with
him within that very thread, preferrably in a more civilized manner.
Thank you.
In my approach for help on this forum, I tried to be as detailed as
possible to get help. I even explained that I am a beginner and the
reply that I received contained false accusations. If the user was
expecting a snippet, I would have sent it. Otherwise, why did he even
reply if he didn’t know how to reply? A logical reply would have been,
“Will you send the code you have so far?”
In my 1-day experience as a user on this ruby forum, I have discovered
that if a user says that they are a “newb” or a “beginner” then they may
be verbally attacked or told they are being “passive”.
The intent of a forum is for other users to help or advise a user in the
subject matter; it is not to decide if the user is being “passive” in
their approach.
Perhaps there is a need to focus on helping users with ruby programming
logic when the user requests help instead of replying to the user by
stating that they are being “passive” as discovered in this thread or
making false accusations.
I’ve done similar job recently. I’ve copied part of code so look at it
maybe it could help you:
require ‘nokogiri’
require ‘spreadsheet’
class XmlParsing
def test
row = 0
column = 0
book = Spreadsheet::Workbook.new
sheet = book.create_worksheet
Dir.chdir("xml")
puts Dir.pwd
Dir.glob("*.xml") do |file|
f = File.open(file)
doc = Nokogiri::XML(f)
fullname_node = doc.at_xpath("//full_name")
sheet[row,column] = fullname_node.content
f.close
row += 1
end
book.write 'spreadsheet.xls'
end
end
xml = XmlParsing.new
xml.test
I’m not sure if it works because I’ve deleted some code from original
script but this is general idea. It script parses xml document and puts
information into excel.
Plus you need spreadsheet gem. You can install it with:
sudo gem install spreadsheet.
Or there is better way to do it. Make one file named Gemfile and put
this
content into it:
source ‘https://rubygems.org’
gem ‘nokogiri’
gem ‘spreadsheet’
Save file and go to console cd to dir where Gemfile is located and run:
bundle install
This is preferred way to go because there is no need to install gems one
by
one manually. Instead when you want to run your software on different
machine you just invoke bundle install.
Good stuff! This ran nicely with the exception of “content” element
generating a nil exception but my XML file was not compliant to XML
rules.
Ruby is a nice language for beginners. I am new user for 1 day now and I
have learned so much from everyone! Try to avoid saying that you are
new. This is due to a user named Ryan Daves<sp?>. He talks very negative
to users who do that. Also another ruby forum you might want to
reference is at tek-tips.com. The users are helpful there as well. Good
luck. -Michelle
Since I am directly called for word “passive” I have to answer
I think that sending solution with source code to inexperienced Ruby
programmer is not good for him. It’s better to learn him how to search
Internet and find solution reading documentation, since it’s most
important
skill of every programmer. I gave solution to Sybren K., but also
advice what’s best way to do next time.
in a topic that is totally unrelated to the one where you took offence
to his manner of replying.
You need only look through a couple of days’ worth of older messages
here to see that you are drawing totally incorrect conclusions based on
a single post of his. For your information, Ryan D. is among the most
helpful and knowledegable people around here.
I’ll second that wholeheartedly!
@Michelle: Please don’t take it too personally. Questions like
“How can I write a program that achieves world peace”
(I exaggerate a liittle…) without giving any information
on a specific problem you have with Ruby or on your prior knowledge
in programming are nearly impossible to answer adequately.
Still, you will find that this kind of questions is more often
answered here than not (at least people try…)
Ryan’s answer may have been a little too rude, but your reply
to him and your subsequent behaviour even more so.
You need only look through a couple of days’ worth of older messages
here to see that you are drawing totally incorrect conclusions based on
a single post of his. For your information, Ryan D. is among the most
helpful and knowledegable people around here.
I’ll second that wholeheartedly!
Okay guys, as helpful as Ryan can be in a good mood, and as awesome a
programmer as he is, you can’t deny he tends to be a dick ;). (No
offense, Ryan.)
I would argue that I don’t tend to be a dick, but that I think we need to do
more to manage the signal:noise on this list, lest we lose the ability to help
anyone at all. That is probably often perceived as me being a dick. You barely
interact with me on a day to day basis so you’re only seeing one aspect of me.
Also see my talk I gave at Cascadia Ruby where I addressed this specifically:
Interesting talk, haven’t seen it before. Maybe I judged you too
harsh, personally I react differently – I tend to just archive the
mail and move on, responding to people with interesting or non-trivial
problems, and rarely taking part in anything “community”.
On Sep 7, 2012, at 10:24 , Bartosz Dziewoński [email protected]
wrote:
Okay guys, as helpful as Ryan can be in a good mood, and as awesome a
programmer as he is, you can’t deny he tends to be a dick ;). (No
offense, Ryan.)
No offense taken. I would argue that I don’t tend to be a dick, but
that I think we need to do more to manage the signal:noise on this list,
lest we lose the ability to help anyone at all. That is probably often
perceived as me being a dick. You barely interact with me on a day to
day basis so you’re only seeing one aspect of me. Also see my talk I
gave at Cascadia Ruby where I addressed this specifically:
Ivan, thanks for the ‘spreadsheet’ tip + code. I got me a lot further,
but I´m still running into some walls. Mostly, at the moment I need to
know how to specify column and row for variables: in a way that for
every next document I parse the variables will be put in the same
columns, but the next row.
so column a, column b
first document: variable 1 = column a, row 1 | variable 2 = column b,
row 1
second document: variable 1 = column a, row 2 | variabele 2 = column b,
row 2.
etcetera.
the code so far:
First the basic code, including the opening of a new spreadsheet:
#After saving data increment row position by 1
row += 1
end
book.write ‘htmltoexcel.xls’
I didn’t tested this, but if something goes wrong ask here.
Also read http://nokogiri.org/tutorials for learning how to parse
xml/html
documents, that’s short but useful resource.