關於分割槽數量的經驗法則

Created: November-22, 2018

根據經驗，人們希望他的 RDD 具有與執行者數量的乘積一樣多的分割槽，使用的核心數量為 3（或者可能是 4）。當然，這是一種啟發式方法，它實際上取決於你的應用程式，資料集和群集配置。

例：

In [1]: data  = sc.textFile(file)

In [2]: total_cores = int(sc._conf.get('spark.executor.instances')) * int(sc._conf.get('spark.executor.cores'))

In [3]: data = data.coalesce(total_cores * 3)