nltk => 그만 단어

소개

정지 단어는 대부분 필러로 사용되는 단어이며 유용한 의미가 거의 없습니다. 우리는 이러한 단어들이 데이터베이스에서 공간을 차지하거나 소중한 처리 시간을 차지하지 않도록해야합니다. 정지 단어로 사용할 단어 목록을 쉽게 작성한 다음 처리하려는 데이터에서이 단어를 필터링 할 수 있습니다.

정지 단어 필터링

NLTK는 기본적으로 단어를 멈추는 단어로 간주합니다. 다음을 사용하여 NLTK 코퍼스를 통해 액세스 할 수 있습니다.

from nltk.corpus import stopwords

영어로 저장된 정지 단어 목록을 확인하려면,

stop_words = set(stopwords.words("english"))
print(stop_words)

주어진 텍스트에서 정지 단어를 제거하기 위해 stop_words 세트를 통합하는 예 :

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

example_sent = "This is a sample sentence, showing off the stop words filtration."
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(example_sent)
filtered_sentence = [w for w in word_tokens if not w in stop_words]

filtered_sentence = []

for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w)
    
print(word_tokens)
print(filtered_sentence)

Modified text is an extract of the original Stack Overflow Documentation

아래 라이선스 CC BY-SA 3.0

와 제휴하지 않음 Stack Overflow

nltk
그만 단어

수색…

소개

정지 단어 필터링