低方差特征移除

Created: November-22, 2018

这是一种非常基本的特征选择技术。

它的基本思想是，如果一个特征是常数（即它有 0 个方差），那么它不能用于寻找任何有趣的模式，并且可以从数据集中删除。

因此，特征消除的启发式方法是首先删除方差低于某个（低）阈值的所有特征。

建立文档中的示例，假设我们开始

X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]

这里有 3 个布尔特征，每个特征有 6 个实例。假设我们希望删除至少 80％的实例中不变的那些。一些概率计算表明这些特征需要具有低于 0.8 *（1 - 0.8）的方差。因此，我们可以使用

from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))    
sel.fit_transform(X)
# Output: array([[0, 1],
                 [1, 0],
                 [0, 0],
                 [1, 1],
                 [1, 0],
                 [1, 1]])

请注意第一个功能的删除方式。

应谨慎使用此方法，因为低方差并不一定意味着功能不感兴趣。考虑以下示例，其中我们构建包含 3 个要素的数据集，前两个包含随机分布的变量，第三个包含均匀分布的变量。

from sklearn.feature_selection import VarianceThreshold
import numpy as np

# generate dataset
np.random.seed(0)

feat1 = np.random.normal(loc=0, scale=.1, size=100) # normal dist. with mean=0 and std=.1
feat2 = np.random.normal(loc=0, scale=10, size=100) # normal dist. with mean=0 and std=10
feat3 = np.random.uniform(low=0, high=10, size=100) # uniform dist. in the interval [0,10)
data = np.column_stack((feat1,feat2,feat3))

data[:5]
# Output:
# array([[  0.17640523,  18.83150697,   9.61936379],
#        [  0.04001572, -13.47759061,   2.92147527],
#        [  0.0978738 , -12.70484998,   2.4082878 ],
#        [  0.22408932,   9.69396708,   1.00293942],
#        [  0.1867558 , -11.73123405,   0.1642963 ]]) 

np.var(data, axis=0)
# Output: array([  1.01582662e-02,   1.07053580e+02,   9.07187722e+00])

sel = VarianceThreshold(threshold=0.1)
sel.fit_transform(data)[:5]
# Output:
# array([[ 18.83150697,   9.61936379],
#        [-13.47759061,   2.92147527],
#        [-12.70484998,   2.4082878 ],
#        [  9.69396708,   1.00293942],
#        [-11.73123405,   0.1642963 ]])

现在第一个特征已被删除，因为它的方差很小，而第三个特征（这是最无趣的）已被保留。在这种情况下，考虑变异系数会更合适，因为这与缩放无关。