nltk => ストップワード

前書き

ストップワードは、ほとんどの場合フィラーとして使用され、有用な意味をほとんど持たない単語です。我々は、これらの言葉がデータベースのスペースを取ったり、貴重な処理時間を取ったりすることを避けるべきです。ストップワードとして使用する単語のリストを簡単に作成し、処理したいデータからこれらの単語をフィルタリングすることができます。

ストップワードのフィルタリング

NLTKには、デフォルトではストップワードとみなされる単語がたくさんあります。 NLTKコーパスを介して次のものを使用してアクセスできます。

from nltk.corpus import stopwords

英語のために保存されたストップワードのリストを確認するには：

stop_words = set(stopwords.words("english"))
print(stop_words)

指定されたテキストからストップワードを削除するためにstop_wordsを組み込む例：

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

example_sent = "This is a sample sentence, showing off the stop words filtration."
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(example_sent)
filtered_sentence = [w for w in word_tokens if not w in stop_words]

filtered_sentence = []

for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w)
    
print(word_tokens)
print(filtered_sentence)

Modified text is an extract of the original Stack Overflow Documentation

ライセンスを受けた CC BY-SA 3.0

所属していない Stack Overflow

nltk
ストップワード

サーチ…

前書き

ストップワードのフィルタリング