删除密切相关的功能

密切相关的特征可能会增加模型的差异,删除相关对中的一个可能有助于减少这种差异。有很多方法可以检测相关性。这是一个:

library(purrr) # in order to use keep()

# select correlatable vars
toCorrelate<-mtcars %>% keep(is.numeric)

# calculate correlation matrix
correlationMatrix <- cor(toCorrelate)

# pick only one out of each highly correlated pair's mirror image
correlationMatrix[upper.tri(correlationMatrix)]<-0  

# and I don't remove the highly-correlated-with-itself group
diag(correlationMatrix)<-0 

# find features that are highly correlated with another feature at the +- 0.85 level
apply(correlationMatrix,2, function(x) any(abs(x)>=0.85))

  mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb 
 TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 

我想看看 MPG 与之相关的是什么,并决定要保留什么和折腾什么。对于 cyl 和 disp 也是如此。或者,我可能需要结合一些强相关的功能。