Indexing an XML/HTML File

S_D · April 12, 2008, 7:04am

I’m planning on indexing XML/HTML files. I only want to index the text
contained in the files and not any of the elements or tags. I just
finished
reading Chapter 6 of “Ferret” (Balmain/O’Reilley) that presented a
solution
for this issue. The essence of the solution was to parse the XML/HTML
and
extract the text content using a parser such as Hpricot. My concern is
that
this approach will not support highlighting of the results [correct me
if
I’m wrong here] since the corresponding indexed field will only contain
text
without the elements and tags that are necessary to indicate the
position of
the text. Question: wouldn’t a better approach be to implement a
tokenizer
that ignores XML/HTML tags and preserves the positions of the
appropriately
indexed items? If this is indeed an ideal approach does such a solution
exist or, alternatively, how can I contribute when I implement it?

Regards,
John
aka sd.codewarrior