使用 python-ucto

Ucto 是一种基于规则的多语言标记器。它也进行句子边界检测。虽然它是用 C++编写的,但是有一个 Python 绑定 python-ucto 来与它进行交互。

import ucto 

#Set a file to use as tokeniser rules, this one is for English, other languages are available too:
settingsfile = "/usr/local/etc/ucto/tokconfig-en"

#Initialise the tokeniser, options are passed as keyword arguments, defaults:
#   lowercase=False,uppercase=False,sentenceperlineinput=False,
#   sentenceperlineoutput=False,
#   sentencedetection=True, paragraphdetection=True, quotedetection=False,
#   debug=False
tokenizer = ucto.Tokenizer(settingsfile)

tokenizer.process("This is a sentence. This is another sentence. More sentences are better!")

for sentence in tokenizer.sentences():
    print(sentence)