On 9/6/06, Neville B. [email protected] wrote:
So while its possible to have multiple readers, 1 writer, the 1 writer
requirement forces use of synchronized, which means that the readers
must be serialised and not concurrent - is this correct?
Close, When you open an IndexReader on the index it is opened up on
that particular version (or state) of the index. So any operations on
the IndexReader (like searches) will only show what was in the index
at the time you opened it. Any modifications to the index (usually
through and IndexWriter) that occur after you open the IndexReader
will not appear in your searches. So to keep searches up to date you
need to close and reopen your IndexReader every time you commit
changes to the index.
So the writer doesn’t force the use of synchronized. Rather it forces
you to decide whether searches need to return the most up to date
results available or if there can be a short delay between changes
being written to the index and changes appearing in the search
results. The Index class makes it as simple as possible to always
search the latest index but there is a performance hit. Most of the
time performance should be fine. The Ferret C core has been highly
optimized and will still beat most other solutions hands down, even
when used in this way.
Now, if I were writing an application where search performance is a
big issue (as it seems to be in your case) then I would start by using
the base classes like IndexReader and IndexWriter (as we’ve already
discussed). Like I just mentioned you might allow a delay between the
time the index is modified and the time those modifications appear in
search results. This would allow you to update the IndexReader every
minute/hour/day/week without regard to what the IndexWriter is doing.
This solution works well when when scraping webpages. Google’s
results, for example, aren’t always completely up to date with the
pages they index. If one of their results is a dead link it isn’t the
end of the world.
If, however, you are indexing data in a database it often isn’t this
simple. If you use the previous solution with a database that allows
deletes then you need some way to handle results that reference
objects that have been deleted from the database. Otherwise you will
need some way to synchronize on the index (probably on the
Ferret::Store::Directory like Ferret::Index::Index does) so that no
searches are done while the deletion is committed to the index and the
IndexReaders are updated.
Another solution which I’m going to experiment with is using the index
as your database. You may still keep your original database but store
any data in the index that will be shown back to the user as the
result of a search. That way you don’t need to worry about
synchronization with the database.
I don’t think I’ve explained this very clearly here so feel free to
try and clarify. I will be endeavoring to write this all down more
clear and comprehensible manner so that everyone can work out the
solution that best fits their needs.
Cheers,
Dave
PS: The ideal solution for me would be an object database with
Ferret-like full-text search built in. I’ve been thinking about this a
lot lately. It would certainly fit the style of development used in
many Rails apps. That is to say, all access to the database must go
through the model as that is where all the validation is. If you are
developing this way, why bother with the relational database and ORM
solution. A good object database would serve the same purpose and
would be a LOT more performant. Obviously this solution wouldn’t be
for everybody though so enterprise developers feel free to ignore.