Query scoring - WTF?

havatar · July 11, 2007, 2:25pm

Hi!

I thought I understood Ferret’s query scoring and how to tweak
results using boost values. What I currently experience however,
leaves me completely baffled.

Perhaps someone can shed some light on the scoring algorithm, because
asking Ferret to “explain” the score for a particular document isn’t
as informative as I thought. Actually, it confuses me even more.

Here’s what I got:

I’m indexing locations (addresses) in Ferret using the following fields:

street, zipcode, district, city, county, state, country_code

Addresses are stored in different precisions, i.e. not all of the
fields contain values depending on the location’s accuracy. Here are
two examples:

Berlin, Germany:

country_code: de
city: Berlin
The district ‘Berlin’ in a town called ‘Seedorf’:

country_code: de
city: Seedorf
district: Berlin

When querying for “berlin, de”, document #2 is ranked higher
(probably due to its natural position in the index). Since I want the
less accurate locations to rank higher, I added boost values. In the
example above, assume that city has a boost of 8 and district has a
boost of 7.

With this little adjustment the first document should rank higher
since the term ‘berlin’ appears in the city field. As you might
suspect, this is not what happens. And I consider this a bug.

Then I went and set the document boost to be 8 for a countries and 1
for streets. This doesn’t help either.

The ranking of other results change slightly but nothing seems to be
consistent with the boost settings. Perhaps the boost settings and
the results are related in some way. But it’s definitely not a
logical relation.

I’m thankful for any hint on how to achieve a proper ranking.

Thanks!
Andy

havatar · July 11, 2007, 2:37pm

Hi!

I tried to reproduce this however changing the sorting with modifying
boosts works perfectly for me:

require ‘rubygems’
require ‘ferret’

include Ferret

fi = Index::FieldInfos.new
fi.add_field :country_code
fi.add_field :city, :boost => 8
fi.add_field :district, :boost => 7
i = Ferret::I.new :field_infos => fi

i << { :country_code => ‘de’, :city => ‘Berlin’ }
i << { :country_code => ‘de’, :city => ‘Seedorf’, :district => ‘Berlin’
}

i.search_each ‘berlin, de’ do |hit,score|
puts “#{i[hit][:country_code]} #{i[hit][:district]} #{i[hit][:city]}
Score: #{score}”
end

this outputs
de Berlin Score: 0.841327428817749
de Berlin Seedorf Score: 0.740611553192139

Swapping the boost values (city:7, district:8) also changes the result
sorting.

Any more info on other circumstances that might cause your problems?

Jens

On Wed, Jul 11, 2007 at 02:24:33PM +0200, Andreas K. wrote:

Here’s what I got:

(probably due to its natural position in the index). Since I want the

Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

–
Jens Krämer
webit! Gesellschaft für neue Medien mbH
Schnorrstraße 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
[email protected] | www.webit.de

Amtsgericht Dresden | HRB 15422
GF Sven Haubold, Hagen Malessa

havatar · July 11, 2007, 3:41pm

Hi Jens,

thanks a lot for reminding me that distilling a simple test case can
help clearing up things quickly. I was buried so deep in my own code
that I couldn’t see the obvious.

Turns out there is a problem with a custom analyzer of mine. It works
OK and passed all tests but it seems that Ferret isn’t using the same
analyzer for searching and indexing although I’ve arranged for it. Or
so I thought.

I still haven’t found the culprit but you put me on the right track
anyway.

Thanks,
Andy

havatar · July 11, 2007, 4:12pm

On 11.07.2007, at 15:40, Andreas K. wrote:

Turns out there is a problem with a custom analyzer of mine. It works
OK and passed all tests but it seems that Ferret isn’t using the same
analyzer for searching and indexing although I’ve arranged for it. Or
so I thought.

Here are three more questions related to the problem. The problem is
definitely an analyzer mismatch but I can’t really put my finger on it.

Is it required to pass the field_infos everytime the index is
opened, or is it sufficient if the index is once created via
FieldInfos#create_index? In other words: are the field infos stored
in the index?
The analyzer to be used for both reading and writing is passed to
Index.new() via the :analyzer parameter. Correct? This is what I do
and I even set the analyzer explicitly using Index#add_document(doc,
analyzer).
For a given Index, how can I determine which analyzer is currently
used for any given field, both for reading and writing?

Cheers,
Andy

havatar · July 11, 2007, 4:34pm

On Wed, Jul 11, 2007 at 04:12:35PM +0200, Andreas K. wrote:

Is it required to pass the field_infos everytime the index is
opened, or is it sufficient if the index is once created via
FieldInfos#create_index? In other words: are the field infos stored
in the index?

yes.

The analyzer to be used for both reading and writing is passed to
Index.new() via the :analyzer parameter. Correct? This is what I do
and I even set the analyzer explicitly using Index#add_document(doc,
analyzer).

correct.

For a given Index, how can I determine which analyzer is currently
used for any given field, both for reading and writing?

I don’t know any way to get this information. You can use process_query
to see what the query parser generates from your query string (which
involves analyzing it).

To see what gets indexed, you could use the ferret_browser Dave
introduced with the latest release to inspect your index.

Jens

–
Jens Krämer
webit! Gesellschaft für neue Medien mbH
Schnorrstraße 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
[email protected] | www.webit.de

Amtsgericht Dresden | HRB 15422
GF Sven Haubold, Hagen Malessa