Below are several approaches and how to implement them in the caret package of R.
- Data splitting—the entire dataset is split into a training dataset and a test dataset. The first one is used to fit the model, the second one is used to test its performance on unseen data. This approach works particularly well on big data. To implement data splitting in R, we need to use the
createDataPartition()
function and set the p parameter to the necessary proportion of data that goes to training. - Bootstrap resampling—extracting random samples of data from the dataset and estimating the model on them. Such resampling iterations are run many times and with replacement. To implement bootstrap resampling in R, we need to set the
method
parameter of thetrainControl()
function to"boot"
when defining the training control of the model. - Cross-validation methods
- k-fold cross-validation —the dataset is split into k-subsets. The model is trained on k-1 subsets and tested on the remaining one. The same process is repeated for all subsets, and then the final model accuracy is estimated.
- Repeated k-fold cross-validation —the principle is the same as for the k-fold cross-validation, only that the dataset is split into k-subsets more than one time. For each repetition, the model accuracy is estimated, and then the final model accuracy is calculated as the average of the model accuracy values for all repetitions.
- Leave-one-out cross-validation (LOOCV) —one data observation is put aside and the model is trained on all the other data observations. The same process is repeated for all data observations.
To implement these cross-validation methods in R, we need to set the method
parameter of the trainControl()
function to "cv"
, "repeatedcv"
, or "LOOCV"
respectively, when defining the training control of the model.
Leave a Reply