過濾停用詞

Created: November-22, 2018

NLTK 預設有一堆它認為是停用詞的單詞。它可以通過 NLTK 語料庫訪問：

from nltk.corpus import stopwords

要檢查為英語儲存的停用詞列表：

stop_words = set(stopwords.words("english"))
print(stop_words)

合併 stop_words 集以從給定文字中刪除停用詞的示例：

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

example_sent = "This is a sample sentence, showing off the stop words filtration."
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(example_sent)
filtered_sentence = [w for w in word_tokens if not w in stop_words]

filtered_sentence = []

for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w)
    
print(word_tokens)
print(filtered_sentence)