Any support for wordsegment search?

jin · April 2, 2007, 7:58am

Anybody who knows whether ferret or acts_as_ferret support wordsegment
search?
like what lucene can done.
I wanna know,if not i will use lucene instead of this
can’t found relevant documents on this aspect in ruby

jin · April 2, 2007, 1:33pm

On Mon, Apr 02, 2007 at 07:58:35AM +0200, Jin wrote:

Anybody who knows whether ferret or acts_as_ferret support wordsegment
search?

I don’t know what you mean with this. Could you give an example? Strange
enough, I only seem to find chinese documents when googling - looks like
that’s a feature useful when analyzing chinese text…

like what lucene can done.

If Lucene can do it, Ferret will most probably be able to do it, too

Maybe it’s just a matter of implementing a custom analyzer, I guess I
found something like that there: http://kingcat1234.spaces.live.com/
(search for wordSegement).

Jens

–
Jens Krämer
webit! Gesellschaft für neue Medien mbH
Schnorrstraße 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
[email protected] | www.webit.de

Amtsgericht Dresden | HRB 15422
GF Sven Haubold, Hagen Malessa

jin · April 3, 2007, 3:55am

Jens K. wrote:

On Mon, Apr 02, 2007 at 07:58:35AM +0200, Jin wrote:

Anybody who knows whether ferret or acts_as_ferret support wordsegment
search?

I don’t know what you mean with this. Could you give an example? Strange
enough, I only seem to find chinese documents when googling - looks like
that’s a feature useful when analyzing chinese text…

Yep.Now the search system is build up but maybe more features more
better,now customer’s intension is if he input ‘designpattern’,there is
no space between design and pattern,it should dived the two words just
like google and provide the information of ‘designpattern’ and ‘design
pattern’,that is it

If Lucene can do it, Ferret will most probably be able to do it, too

Maybe it’s just a matter of implementing a custom analyzer, I guess I
found something like that there: http://kingcat1234.spaces.live.com/
(search for wordSegement).
I have checked the website u gave it,thank you.but what they done is for
diving the chinese sentence to words and it is not so accurate,e.g.
‘iamyourfriend’ may be dived to ‘iam’,‘amyou’,‘yourfriend’,‘friend’
something like this.
this kind of things should need vocabulary support
if similar implementation existed then i would prefer use it but not
create it

jin · April 6, 2007, 7:58am

On 4/3/07, Jens K. [email protected] wrote:

Yep.Now the search system is build up but maybe more features more
better,now customer’s intension is if he input ‘designpattern’,there is
no space between design and pattern,it should dived the two words just
like google and provide the information of ‘designpattern’ and ‘design
pattern’,that is it

interesting, that could be useful for analyzing german text, too - we
have lots of composite words like this

I’ve already mentioned this to Jin in private, but I think the better
solution for something like this is to post-process the query if you
get very few (or zero) matches. For example you could run the query
through a spell checker that would suggest you to rewrite
‘designpattern’ as ‘design pattern’. I’m not sure whether the spell
checker approach would work in German or Chinese but some sort of
post-processing should do the trick. If anyone has implemented
something like this or has any good ideas I’d love to hear them.

cheers,
Dave

jin · April 3, 2007, 10:24am

On Tue, Apr 03, 2007 at 03:55:28AM +0200, Jin wrote:

better,now customer’s intension is if he input ‘designpattern’,there is
no space between design and pattern,it should dived the two words just
like google and provide the information of ‘designpattern’ and ‘design
pattern’,that is it

interesting, that could be useful for analyzing german text, too - we
have lots of composite words like this

If Lucene can do it, Ferret will most probably be able to do it, too

Maybe it’s just a matter of implementing a custom analyzer, I guess I
found something like that there: http://kingcat1234.spaces.live.com/
(search for wordSegement).
I have checked the website u gave it,thank you.but what they done is for
diving the chinese sentence to words and it is not so accurate,e.g.
‘iamyourfriend’ may be dived to ‘iam’,‘amyou’,‘yourfriend’,‘friend’
something like this.

ah, ok.

this kind of things should need vocabulary support

yeah.

if similar implementation existed then i would prefer use it but not
create it

At least I couldn’t find one. Are you sure lucene has an analyzer that
can
split composite words? If yes, porting it to ruby should be relatively
easy

Jens

–
Jens Krämer
webit! Gesellschaft für neue Medien mbH
Schnorrstraße 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
[email protected] | www.webit.de

Amtsgericht Dresden | HRB 15422
GF Sven Haubold, Hagen Malessa

jin · April 6, 2007, 8:13am

On 4/6/07, David B. [email protected] wrote:

have lots of composite words like this

I’ve already mentioned this to Jin in private, but I think the better
solution for something like this is to post-process the query if you
get very few (or zero) matches. For example you could run the query
through a spell checker that would suggest you to rewrite
‘designpattern’ as ‘design pattern’. I’m not sure whether the spell
checker approach would work in German or Chinese but some sort of
post-processing should do the trick. If anyone has implemented
something like this or has any good ideas I’d love to hear them.

I forgot to mention; I would guess that this is probably how Google
does the same thing.