在大数据集中查找匹配项

Created: November-22, 2018

在大数据集的情况下，grepl("fox", test_sentences) 的调用效果不佳。大数据集是例如爬网网站或数百万推文等。

第一个加速度是使用 perl = TRUE 选项。更快的是 fixed = TRUE 选项。一个完整的例子是：

# example data
test_sentences <- c("The quick brown fox", "jumps over the lazy dog")

grepl("fox", test_sentences, perl = TRUE)
#[1]  TRUE FALSE

在文本挖掘的情况下，通常使用语料库。语料库不能直接与 grepl 一起使用。因此，请考虑以下功能：

searchCorpus <- function(corpus, pattern) {
  return(tm_index(corpus, FUN = function(x) {
    grepl(pattern, x, ignore.case = TRUE, perl = TRUE)
  }))
}