nltk => ステミング

前書き

ステミングはある種の正規化方法です。緊張が関与している場合を除き、多くの単語のバリエーションは同じ意味を持ちます。私たちの主張は、検索を短くして文章を正規化することです。基本的には、動詞とその部分から動詞を除いて単語の根を見つけることです。最も人気のあるステミングアルゴリズムの1つは、1979年以来ずっと行われてきたPorterステマーです。

ポーターステマー

PorterStemmerをインポートして初期化する

 from nltk.stem import PorterStemmer
 from nltk.tokenize import word_tokenize
 ps = PorterStemmer()

単語のリストを茎

 example_words = ["python","pythoner","pythoning","pythoned","pythonly"]

 for w in example_words:
     print(ps.stem(w))

結果：

 python
 python
 python
 python
 pythonli

それをトークン化した後に文を茎で打ちます。

 new_text = "It is important to by very pythonly while you are pythoning with python. All pythoners have pythoned poorly at least once."

 word_tokens = word_tokenize(new_text)
 for w in word_tokens:
     print(ps.stem(w))   # Passing word tokens into stem method of Porter Stemmer

結果：

 It
 is
 import
 to
 by
 veri
 pythonli
 while
 you
 are
 python
 with
 python
 .
 all
 python
 have
 python
 poorli
 at
 least
 onc
 .

Modified text is an extract of the original Stack Overflow Documentation

ライセンスを受けた CC BY-SA 3.0

所属していない Stack Overflow

nltk
ステミング

サーチ…

前書き

ポーターステマー