KinoSearch::Docs::Tutorial::Analysis - How to choose and use Analyzers.
Try swapping out the PolyAnalyzer in USConSchema for a Tokenizer:
package USConSchema; use base qw( KinoSearch::Schema ); use KinoSearch::Analysis::Tokenizer; sub analyzer { return KinoSearch::Analysis::Tokenizer->new }
Try searching for senate, Senate, and Senator before and after making
the change and re-indexing.
Under PolyAnalyzer, the results are identical for all three searches, but
under Tokenizer, searches are case-sensitive, and the result sets for
Senate and Senator are distinct.
What's happening is that PolyAnalyzer is performing more aggressive processing
than Tokenizer. In addition to tokenizing, it's also converting all text to
lower case so that searches are case-insensitive, and using a "stemming"
algorithm to reduce related words to a common stem (senat, in this case).
PolyAnalyzer is actually multiple Analyzers wrapped up in a single package.
In this case, it's three-in-one, since specifying a PolyAnalyzer with
language => 'en' is equivalent to this snippet:
my $lc_normalizer = KinoSearch::Analysis::LCNormalizer->new;
my $tokenizer = KinoSearch::Analysis::Tokenizer->new;
my $stemmer = KinoSearch::Analysis::Stemmer->new( language => 'en' );
my $polyanalyzer = KinoSearch::Analysis::PolyAnalyzer->new(
analyzers => [ $lc_normalizer, $tokenizer, $stemmer ],
);
You can add or subtract Analyzers from there if you like. Try adding a fourth
Analyzer, a Stopalizer for suppressing "stopwords" like the, if,
maybe, and so on.
my $stopalizer = KinoSearch::Analysis::Stopalizer->new(
language => 'en',
);
my $polyanalyzer = KinoSearch::Analysis::PolyAnalyzer->new(
analyzers => [ $lc_normalizer, $tokenizer, $stopalizer, $stemmer ],
);
Also, try removing the Stemmer.
my $polyanalyzer = KinoSearch::Analysis::PolyAnalyzer->new(
analyzers => [ $lc_normalizer, $tokenizer ],
);
The original choice probably still yields the best results for this document collection, but you get the idea: sometimes you want a different Analyzer.
Sometimes you don't want an Analyzer at all. For instance, "category" fields are often set up to match exactly or not at all, as are fields like "last_name" (because you probably don't want to conflate results for "Humphrey" and "Humphries").
To specify that there should be no analysis performed at all, use a custom FieldSpec:
package MySchema::NotAnalyzed; use base qw( KinoSearch::FieldSpec::TextField ); sub analyzed { 0 }
Copyright 2008 Marvin Humphrey
See KinoSearch version 0.20.