Parsing big XML files - memory issue

Hello,

I need to parse two big XML files in a row (30+MB each). I have tried
both REXML and Hpricot. They do work. Thing is, with both libraries,
the parsing of each file takes a huge amount of memory: more than
700MB each!

So I was wondering:

  • is it normal that parsing a 30MB file takes 700MB of memory? Could
    it be that something is wrong with the file? Is there an alternative
    way to deal with such big files?
  • is there a way to force the release of the memory when I don’t need
    the file anymore? At the moment it is not released instantly after the
    first file, so I end up with 1.5GB memory use.

I have reduced the code to the minimum to isolate the memory issue:

xml = File.read(“myfile.xml”)
doc = REXML::Document.new(xml) or doc = Hpricot.XML(xml)
doc = nil

and repeat with the second file.

Also, I tried libxml in case. I get an error message that I can’t
explain:
LibXML::XML::Error (Fatal error: Input is not proper UTF-8, indicate
encoding ! yet the file is UTF-8 as far as I can tell.

Thanks a lot for your help.
Pierre

On Jun 11, 9:41 am, PierreW [email protected] wrote:

way to deal with such big files?
DOM parsers can use up a lot of memory with large files (10x filesize
or more). SAX parsers don’t (because they don’t keep the whole thing
in memory - they just fire events as they traverse the dom). REXML
does have a sax style parser, and libxml will have one too.

Fred

Quoting PierreW [email protected]:

it be that something is wrong with the file? Is there an alternative
way to deal with such big files?

  • is there a way to force the release of the memory when I don’t need
    the file anymore? At the moment it is not released instantly after the
    first file, so I end up with 1.5GB memory use.

Generally XML libraries keep the whole content, including whitespace, in
an
easily searchable tree structured data structure, often plus the
original
text, plus overhead.

LibXML::XML::Error (Fatal error: Input is not proper UTF-8, indicate
encoding ! yet the file is UTF-8 as far as I can tell.

LibXML is very picky about UTF-8 and I have not been able to figure how
to get
it to recover and continue parsing. Since Hpricot and Nokogiri are less
picky and they use LibXML, I presume it is possible.

Whenever I have dug into the source file in question, there has been a
non-UTF8 character. Look at the reader.line_number and
reader.column_number
values for where.

As another person has suggested, the SAX API does not keep the whole
file or
DOM in memory and uses much less memory. Also look at XML::Reader
interface,
it is very fast and not at all memory hungry. Your code will probably
not be
as pretty as with the DOM APIs, but sometimes it is worth the trade
offs.
Switching from FeedTools to read RSS feeds to XML::Reader to grab just
what my
app needs resulted in a speed up of better than 10x, maybe 100x. This
is
applicable if your code is handling multiple XML schemas (there are at
least 6
different RSS schema with varying interpretations, for some fields the
whole
DOM is searched 10 times; once thru with custom code is ugly, but worth
it in
my application).

HTH,
Jeffrey

Hi all,

Thank you so much for pointing me in the right direction.

I used a REXML SAX2Parser: it solved my problem. It’s a bit more code
indeed, but it uses a fraction of the memory and it seems quite fast
to me.

Thanks a lot,
Pierre

On Jun 11, 5:22 pm, Maurício Linhares [email protected]

Hi Pierre,

I had a 45~50mb file to parse using Ruby libraries but to no avail,
the DOM based libraries were slow to death and the SAX based one that
I tried (libxml-ruby) had some serious memory leaks. Now there’s this
SaxMachine from paul dix that looks usable -
http://www.pauldix.net/2009/01/sax-machine-sax-parsing-made-easy.html

As to my problem, I wrote a StAX based parser using Java to get it to
run in reasonable time :frowning:

Maurício Linhares
http://codeshooter.wordpress.com/ | http://twitter.com/mauriciojr