Hello.
I am working with some XML logs coming from a network simulator.
My aim is to strip out the transient information concerning any given
variable.
For example here is some example data:
string = <<EOF
23
DTX
22
DTX
22
DTX
23
DTX
22
DTX
24
DTX
21
DTX
22
DTX
EOF
For example I may want to strip out all the “CQI” and timing values to
get:
23, 00:00:40.450
22, 00:00:40.460
22, 00:00:40.460
23, 00:00:40.460
22, 00:00:40.460
24, 00:00:40.460
21, 00:00:40.470
22, 00:00:40.470
Question: These files can be very large and keeping the computer
resource overhead is important. I’ve looked at other threads on this
forum to decide which method of extracting data against timestamps would
be the quickest but the information has been conflicting.
I understand that stream parsing is faster that DOM. I also understand
that libxml is faster than REXML, but libxml streaming uses DOM. So is
it safe to assume that REXMl streaming is faster than libxml streaming?
I also need to consider which way of things would be easier to
implement.
On Tue, Aug 10, 2010 at 9:53 AM, Jerome David S.
[email protected] wrote:
get:
Question: These files can be very large and keeping the computer
resource overhead is important. I’ve looked at other threads on this
forum to decide which method of extracting data against timestamps would
be the quickest but the information has been conflicting.
I make no claim about what might be best but, nokogiri seems to be
the leading Ruby XML library at the moment. I quickly adapted an old
REXML pull parser to work with your sample data:
def parse(stream)
raise “BlockRequired” unless block_given?
parser = REXML::Parsers::PullParser.new(stream)
row = {}
while parser.has_next?
event = parser.pull
case event.event_type
when :start_element
case event[0]
when 'primitive'
row = event[1]; col = nil
when 'parameter'
col = event[1]["name"]
end
row[col] ||= "" if col
when :end_element
col = nil
case event[0]
when 'primitive'
yield(row)
else
# ignore
end
when :text
row[col] << event[0].chomp if col
else
#ignore
end
end
end
parse(string){|row|
#p row
puts “#{row[“CQI”]}, #{row[“time”]}”
}
ruby x.rb
23, 00:00:40.450
22, 00:00:40.460
22, 00:00:40.460
23, 00:00:40.460
22, 00:00:40.460
24, 00:00:40.460
21, 00:00:40.470
22, 00:00:40.470
The original program I lifted that from was processing XML files up to
several gigabytes; particularly on the largest files we saw much
better performance running under JRuby over MRI (1.8.5 or so).