I’m working on a script that examines a DITA XML file and tries to
determine where we put conrefs (where content is being pulled from). I
have most of the code working but I’m trying now to determine what type
of element something comes from.
All XML tags have ID numbers
this is a paragraph
If I need to reference the list item in a document for example, the id
number is used to pull that data into the other document. What the
script is trying to accomplish is to create a list of what conrefs are
in each file and reporting on them.
It’s easy enough to determine if a con ref is in a file, then open that
document to get the title of the document. But what is killing me is
trying to determine what type of element is being referenced. For
example, all I know is I’m looking for: ‘a4563’ easy enough to find
via a .match, but what I really want to know is what element is that id
number part of in the example of ‘a4563’
, in the case of ‘a124’ a
.
I suspect that I’ll need to do some regex groupings, but my regex-fu in
this area is very weak!
Anybody have some suggestions?
Thanks,
Wayne
I haven’t looked into this in detail, but I’ve cobbled together this
example of how you could get started:
wow that would do it since I have the id I gotta work on my regex more!
Thanks Joel!
Wayne
Agreed, Nokogiri is a much better solution.
I agree too thanks for the head slap Peter!
Well this works
require ‘rubygems’
require ‘nokogiri’
xml = ‘
this is a paragraph
’
doc = Nokogiri::XML(xml)
node = doc.search("//*[@id=‘b234567’]").first
puts node
puts node.name
Once you have the node then the name method will tell you the element
type.
NEVER USE A REGEX!!!
I didn’t go all exhaustive on it, just a general idea
Just for the record the regex you gave will have difficulty with the
following
It will give the node name as ‘fred ref=“other”’ because you are
assuming
that the id attribute is the first attribute after the element name,
which
may not be the case. Of course you can make the regex handle that too.
But
then the regex becomes even less readable.
On Wed, Mar 13, 2013 at 12:58 PM, Joel P. [email protected]
wrote:
I didn’t go all exhaustive on it, just a general idea
Yes, and you have been shown why regexp is the wrong tool for parsing
SGML heritage - especially when there is something as awesome as
Nokogiri around.
require ‘nokogiri’
dom = Nokogiri::XML <<XML
this is a paragraph
XML
dom.xpath(‘//*[@id]’).each do |node|
printf “%-10s %s\n”, node[:id], node.name
end
Kind regards
robert