Parallel indexing with unique id?

Hello all,
Is it possible to use parallel indexing and still ensure unique
documents in
the merged index? Using the canned example, I’m ending up with
non-unique
entries. It’s just adding them all together even though I’ve defined
unique
a :key.

How can I tell the IndexWriter to keep my uniqueness constraints?

For example, imagine that I have two indexes of a phone book:

“index_one” contains a unique set of names A-through-P (let’s say the
key is
their phone number).

“index_two” contains a unique set of names K-through-Z.

When I merge them, I would hope to get a unique index of A-through-Z,
but
I’m getting double entries where they overlap, K-through-P.

Here’s some code to demonstrate. My :id field is a long-ish unique
alphanumeric string. In the example below, “one” and “two” are actually
identical copies, each containing about 60,000 docs. I was hoping to get
a
combined index containing the same 60,000 docs, but ended up with
120,000.

Any help will be greatly appreciated. Thanks!

####################

one = “Documents/bucket/index_1”
two = “Documents/bucket/index_2”
merged = “Documents/bucket/merged_index”

pfa = PerFieldAnalyzer.new(LetterAnalyzer.new)
pfa[:id] = WhiteSpaceAnalyzer.new

field_infos = FieldInfos.new(:term_vector => :no)
field_infos.add_field(:id, :index => :untokenized)

index_two = Ferret::I.new(
:key => :id,
:max_buffer_memory => 0x8000000,
:merge_factor => 5,
:path => one,
:analyzer => pfa,
:field_infos => field_infos)

index_one = Ferret::I.new(
:key => :id,
:max_buffer_memory => 0x8000000,
:merge_factor => 5,
:path => two,
:analyzer => pfa,
:field_infos => field_infos)

readers = []
readers << IndexReader.new(one)
readers << IndexReader.new(two)

puts "size of index_one = "+index_one.size.to_s
puts "size of index_two = "+index_two.size.to_s

index_writer = IndexWriter.new(:path => merged)
index_writer.add_readers(readers)
index_writer.close()
readers.each{ |reader| reader.close() }

i = Ferret::I.new(:path => merged)

puts "size before optimize = "+i.size.to_s
i.optimize
puts "size after optimize = "+i.size.to_s

Hi!

On Mon, Mar 24, 2008 at 11:29:14PM -0600, R. Bryan Hughes wrote:

Hello all,
Is it possible to use parallel indexing and still ensure unique documents in
the merged index? Using the canned example, I’m ending up with non-unique
entries. It’s just adding them all together even though I’ve defined unique
a :key.

How can I tell the IndexWriter to keep my uniqueness constraints?

You can’t. The :key option is only interpreted by Ferret’s Index class,
which will delete any already existing records with the same key field
value before adding a new record.

Cheers,
Jens


Jens Krämer
http://www.jkraemer.net/ - Blog
http://www.omdb.org/ - The new free film database

Thanks! You saved me lots of time.