使用 python-ucto

Ucto 是一種基於規則的多語言標記器。它也進行句子邊界檢測。雖然它是用 C++編寫的,但是有一個 Python 繫結 python-ucto 來與它進行互動。

import ucto 

#Set a file to use as tokeniser rules, this one is for English, other languages are available too:
settingsfile = "/usr/local/etc/ucto/tokconfig-en"

#Initialise the tokeniser, options are passed as keyword arguments, defaults:
#   lowercase=False,uppercase=False,sentenceperlineinput=False,
#   sentenceperlineoutput=False,
#   sentencedetection=True, paragraphdetection=True, quotedetection=False,
#   debug=False
tokenizer = ucto.Tokenizer(settingsfile)

tokenizer.process("This is a sentence. This is another sentence. More sentences are better!")

for sentence in tokenizer.sentences():
    print(sentence)