2009-07-28 16 views

Odpowiedz

20

Aby to zrobić, musisz napisać własną klasę analizatora. Jest to stosunkowo proste. Oto ten, którego używam. Łączy w sobie zatrzymywanie filtrowania słów. Porter i (to może być za dużo dla twoich potrzeb) usuwanie akcentów z postaci.

/// <summary> 
/// An analyzer that implements a number of filters. Including porter stemming, 
/// Diacritic stripping, and stop word filtering. 
/// </summary> 
public class CustomAnalyzer : Analyzer 
{ 
    /// <summary> 
    /// A rather short list of stop words that is fine for basic search use. 
    /// </summary> 
    private static readonly string[] stopWords = new[] 
    { 
     "0", "1", "2", "3", "4", "5", "6", "7", "8", 
     "9", "000", "$", "£", 
     "about", "after", "all", "also", "an", "and", 
     "another", "any", "are", "as", "at", "be", 
     "because", "been", "before", "being", "between", 
     "both", "but", "by", "came", "can", "come", 
     "could", "did", "do", "does", "each", "else", 
     "for", "from", "get", "got", "has", "had", 
     "he", "have", "her", "here", "him", "himself", 
     "his", "how","if", "in", "into", "is", "it", 
     "its", "just", "like", "make", "many", "me", 
     "might", "more", "most", "much", "must", "my", 
     "never", "now", "of", "on", "only", "or", 
     "other", "our", "out", "over", "re", "said", 
     "same", "see", "should", "since", "so", "some", 
     "still", "such", "take", "than", "that", "the", 
     "their", "them", "then", "there", "these", 
     "they", "this", "those", "through", "to", "too", 
     "under", "up", "use", "very", "want", "was", 
     "way", "we", "well", "were", "what", "when", 
     "where", "which", "while", "who", "will", 
     "with", "would", "you", "your", 
     "a", "b", "c", "d", "e", "f", "g", "h", "i", 
     "j", "k", "l", "m", "n", "o", "p", "q", "r", 
     "s", "t", "u", "v", "w", "x", "y", "z" 
    }; 

    private Hashtable stopTable; 

    /// <summary> 
    /// Creates an analyzer with the default stop word list. 
    /// </summary> 
    public CustomAnalyzer() : this(stopWords) {} 

    /// <summary> 
    /// Creates an analyzer with the passed in stop words list. 
    /// </summary> 
    public CustomAnalyzer(string[] stopWords) 
    { 
     stopTable = StopFilter.MakeStopSet(stopWords);  
    } 

    public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader) 
    { 
     return new PorterStemFilter(new ISOLatin1AccentFilter(new StopFilter(new LowerCaseTokenizer(reader), stopWords))); 
    } 
} 
+1

Dzięki, spróbuję tego. – devson

+1

+1 dzięki Jack, właśnie tego szukałem. Gdybym mógł, uznałbym to za odpowiedź! – andy

+0

Użyłem twojego przykładu, jednak nie otrzymuję wyników dla zapytań dla numeru '4656' (działa standardowy analizator) Zastąpiłem słowa stopu wbudowanym' StopAnalyzer.ENGLISH_STOP_WORDS' które nie zawiera liczb, żadnych pomysłów co się dzieje tam? – Myster

7

Można użyć Snowball lub PorterStemFilter. Zobacz przewodnik Java Analyzer documentation w celu połączenia różnych filtrów/tokenizerów/analizatorów. Zauważ, że musisz użyć tego samego analizatora do indeksowania i pobierania, tak że obsługa zdań powinna zaczynać się od czasu indeksowania.

+0

Dzięki, spróbuję tego. – devson

Powiązane problemy