NAME

KinoSearch::Docs::Tutorial::Analysis - How to choose and use Analyzers.

DESCRIPTION

Try swapping out the PolyAnalyzer in USConSchema for a Tokenizer:

    package USConSchema;
    use base qw( KinoSearch::Schema );

    use KinoSearch::Analysis::Tokenizer;

    sub analyzer { return KinoSearch::Analysis::Tokenizer->new }

Try searching for senate, Senate, and Senator before and after making the change and re-indexing.

Under PolyAnalyzer, the results are identical for all three searches, but under Tokenizer, searches are case-sensitive, and the result sets for Senate and Senator are distinct.

What's happening is that PolyAnalyzer is performing more aggressive processing than Tokenizer. In addition to tokenizing, it's also converting all text to lower case so that searches are case-insensitive, and using a "stemming" algorithm to reduce related words to a common stem (senat, in this case).

PolyAnalyzer is actually multiple Analyzers wrapped up in a single package. In this case, it's three-in-one, since specifying a PolyAnalyzer with language => 'en' is equivalent to this snippet:

    my $lc_normalizer = KinoSearch::Analysis::LCNormalizer->new;
    my $tokenizer     = KinoSearch::Analysis::Tokenizer->new;
    my $stemmer      = KinoSearch::Analysis::Stemmer->new( language => 'en' );
    my $polyanalyzer = KinoSearch::Analysis::PolyAnalyzer->new(
        analyzers => [ $lc_normalizer, $tokenizer, $stemmer ], 
    );

You can add or subtract Analyzers from there if you like. Try adding a fourth Analyzer, a Stopalizer for suppressing "stopwords" like the, if, maybe, and so on.

    my $stopalizer = KinoSearch::Analysis::Stopalizer->new( 
        language => 'en',
    );
    my $polyanalyzer = KinoSearch::Analysis::PolyAnalyzer->new(
        analyzers => [ $lc_normalizer, $tokenizer, $stopalizer, $stemmer ], 
    );

Also, try removing the Stemmer.

    my $polyanalyzer = KinoSearch::Analysis::PolyAnalyzer->new(
        analyzers => [ $lc_normalizer, $tokenizer ], 
    );

The original choice probably still yields the best results for this document collection, but you get the idea: sometimes you want a different Analyzer.

When the best Analyzer is no Analyzer

Sometimes you don't want an Analyzer at all. For instance, "category" fields are often set up to match exactly or not at all, as are fields like "last_name" (because you probably don't want to conflate results for "Humphrey" and "Humphries").

To specify that there should be no analysis performed at all, use a custom FieldSpec:

    package MySchema::NotAnalyzed;
    use base qw( KinoSearch::FieldSpec::TextField );
    sub analyzed { 0 }

COPYRIGHT

Copyright 2008 Marvin Humphrey

LICENSE, DISCLAIMER, BUGS, etc.

See KinoSearch version 0.20.

Copyright © 2004-2008 Marvin Humphrey