Hello all,
Is it possible to use parallel indexing and still ensure unique
documents in
the merged index? Using the canned example, I’m ending up with
non-unique
entries. It’s just adding them all together even though I’ve defined
unique
a :key.
How can I tell the IndexWriter to keep my uniqueness constraints?
For example, imagine that I have two indexes of a phone book:
“index_one” contains a unique set of names A-through-P (let’s say the
key is
their phone number).
“index_two” contains a unique set of names K-through-Z.
When I merge them, I would hope to get a unique index of A-through-Z,
but
I’m getting double entries where they overlap, K-through-P.
Here’s some code to demonstrate. My :id field is a long-ish unique
alphanumeric string. In the example below, “one” and “two” are actually
identical copies, each containing about 60,000 docs. I was hoping to get
a
combined index containing the same 60,000 docs, but ended up with
120,000.
Any help will be greatly appreciated. Thanks!
####################
one = “Documents/bucket/index_1”
two = “Documents/bucket/index_2”
merged = “Documents/bucket/merged_index”
pfa = PerFieldAnalyzer.new(LetterAnalyzer.new)
pfa[:id] = WhiteSpaceAnalyzer.new
field_infos = FieldInfos.new(:term_vector => :no)
field_infos.add_field(:id, :index => :untokenized)
index_two = Ferret::I.new(
:key => :id,
:max_buffer_memory => 0x8000000,
:merge_factor => 5,
:path => one,
:analyzer => pfa,
:field_infos => field_infos)
index_one = Ferret::I.new(
:key => :id,
:max_buffer_memory => 0x8000000,
:merge_factor => 5,
:path => two,
:analyzer => pfa,
:field_infos => field_infos)
readers = []
readers << IndexReader.new(one)
readers << IndexReader.new(two)
puts "size of index_one = "+index_one.size.to_s
puts "size of index_two = "+index_two.size.to_s
index_writer = IndexWriter.new(:path => merged)
index_writer.add_readers(readers)
index_writer.close()
readers.each{ |reader| reader.close() }
i = Ferret::I.new(:path => merged)
puts "size before optimize = "+i.size.to_s
i.optimize
puts "size after optimize = "+i.size.to_s