How to select features for machine learning in R?

Let’s consider three different approaches and how to implement them in the caret package.

  1. By detecting and removing highly correlated features from the dataset.

We need to create a correlation matrix of all the features and then identify the highly correlated ones, usually those with a correlation coefficient greater than 0.75:

corr_matrix <- cor(features)
highly_correlated <- findCorrelation(corr_matrix, cutoff=0.75)
print(highly_correlated)
  1. By ranking the data frame features by their importance.

We need to create a training scheme to control the parameters for train, use it to build a selected model, and then estimate the variable importance for that model:

control <- trainControl(method="repeatedcv", number=10, repeats=5)
model <- train(response_variable~., data=df, method="lvq", preProcess="scale", trControl=control)
importance <- varImp(model)
print(importance)
  1. By automatically selecting the optimal features.

One of the most popular methods provided by caret for automatically selecting the optimal features is a backward selection algorithm called Recursive Feature Elimination (RFE).

We need to compute the control using a selected resampling method and a predefined list of functions, apply the RFE algorithm passing to it the features, the target variable, the number of features to retain, and the control, and then extract the selected predictors:

control <- rfeControl(functions=caretFuncs, method="cv", number=10)
results <- rfe(features, target_variable, sizes=c(1:8), rfeControl=control)
print(predictors(results))

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *