过滤停用词

NLTK 默认有一堆它认为是停用词的单词。它可以通过 NLTK 语料库访问:

from nltk.corpus import stopwords

要检查为英语存储的停用词列表:

stop_words = set(stopwords.words("english"))
print(stop_words)

合并 stop_words 集以从给定文本中删除停用词的示例:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

example_sent = "This is a sample sentence, showing off the stop words filtration."
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(example_sent)
filtered_sentence = [w for w in word_tokens if not w in stop_words]

filtered_sentence = []

for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w)
    
print(word_tokens)
print(filtered_sentence)