I have a script (below) which attempts to make an index out of all the
man pages on my system. It takes a while, mostly because it runs man
over and over… but anyway, as time goes on the memory usage goes up
and up and never down. Eventually, it runs out of ram and just starts
thrashing up the swap space, pretty much grinding to a halt.
The workaround would seem to be to index documents in batches in the
background, shutting down the index process every so often to recover
its memory. I’m about to try that, because I’m really hunting a
different bug… however, the memory problem concerns me.
require ‘rubygems’
require ‘ferret’
require ‘set’
dir = “temp_index”
if ARGV.first=="-p"
ARGV.shift
prefix=ARGV.shift
end
fi= Ferret::Index::FieldInfos.new
fi.add_field :name,
:index => :yes, :store => :yes, :term_vector => :with_positions
%w[data field1 field2 field3].each{|fieldname|
fi.add_field fieldname.to_sym,
:index => :yes, :store => :no, :term_vector => :with_positions
}
i = Ferret::Index::IndexWriter.new(:path=>dir, :create=>true,
:field_infos=>fi)
list=Dir["/usr/share/man//#{prefix}.gz"]
numpages=(ARGV.last||list.size).to_i
list[0…numpages].each{|manfile|
all,name,section=/\A(.).([^.]+)\Z/.match(File.basename(manfile,
“.gz”))
tttt=man #{section} #{name}
.gsub(/.\b/m, ‘’)
i << {
:data=>tttt.to_s,
:name=>name,
:field1=>name,
:field2=>name,
:field3=>name,
}
}
i.close
i=Ferret::Index::IndexReader.new dir
i.max_doc.times{|n|
i.term_vector(n,:data).terms
.inject(0){|sum,tvt| tvt.positions.size } > 1_000_000 and
puts “heinous term count for #{i[n][:name]}”
}
seenterms=Set[]
begin
i.terms(:data).each{|term,df|
seenterms.include? term and next
i.term_docs_for(:data,term)
seenterms << term
}
rescue Exception
raise
end