NEWBIE: Ruby & XML

jeromeqc · August 10, 2010, 3:53pm

Hello.

I am working with some XML logs coming from a network simulator.
My aim is to strip out the transient information concerning any given
variable.

For example here is some example data:

string = <<EOF

23
DTX

22
DTX

22
DTX

23
DTX

22
DTX

24
DTX

21
DTX

22
DTX

EOF

For example I may want to strip out all the “CQI” and timing values to
get:

23, 00:00:40.450
22, 00:00:40.460
22, 00:00:40.460
23, 00:00:40.460
22, 00:00:40.460
24, 00:00:40.460
21, 00:00:40.470
22, 00:00:40.470

Question: These files can be very large and keeping the computer
resource overhead is important. I’ve looked at other threads on this
forum to decide which method of extracting data against timestamps would
be the quickest but the information has been conflicting.

I understand that stream parsing is faster that DOM. I also understand
that libxml is faster than REXML, but libxml streaming uses DOM. So is
it safe to assume that REXMl streaming is faster than libxml streaming?

I also need to consider which way of things would be easier to
implement.

jeromeqc · August 10, 2010, 4:52pm

On Tue, Aug 10, 2010 at 9:53 AM, Jerome David S.
[email protected] wrote:

get:
Question: These files can be very large and keeping the computer
resource overhead is important. I’ve looked at other threads on this
forum to decide which method of extracting data against timestamps would
be the quickest but the information has been conflicting.

I make no claim about what might be best but, nokogiri seems to be
the leading Ruby XML library at the moment. I quickly adapted an old
REXML pull parser to work with your sample data:

def parse(stream)
raise “BlockRequired” unless block_given?

parser = REXML::Parsers::PullParser.new(stream)

row = {}

while parser.has_next?
event = parser.pull

case event.event_type
when :start_element
  case event[0]
  when 'primitive'
    row = event[1]; col = nil
  when 'parameter'
    col = event[1]["name"]
  end

  row[col] ||= "" if col

when :end_element
  col = nil

  case event[0]
  when 'primitive'
    yield(row)
  else
    # ignore
  end

when :text
  row[col] << event[0].chomp if col

else
  #ignore
end

end
end

parse(string){|row|
#p row
puts “#{row[“CQI”]}, #{row[“time”]}”
}

ruby x.rb
23, 00:00:40.450
22, 00:00:40.460
22, 00:00:40.460
23, 00:00:40.460
22, 00:00:40.460
24, 00:00:40.460
21, 00:00:40.470
22, 00:00:40.470

The original program I lifted that from was processing XML files up to
several gigabytes; particularly on the largest files we saw much
better performance running under JRuby over MRI (1.8.5 or so).