The goal of this tutorial is to provide solved examples of
model-fitting with tidymodels to help you solidify your
learning from the R tutorials.
The R code chunk below will load the
tidymodels and discrim packages as well as the
mobile_carrier_df data set.
library(tidymodels)
library(discrim)
mobile_carrier_df <- 
  readRDS(url('https://gmubusinessanalytics.netlify.app/data/mobile_carrier_data.rds'))
 
The mobile_carrier_df data frame contains information on
U.S. customers for a national mobile service carrier.
Each row represents a customer who did or did not cancel their
service. The response variable in this data set is named
canceled_plan and has levels of ‘yes’ or ‘no’. The
predictor variables in this data frame contain information about the
customers’ residence region and mobile call activity.
Our goal is to predict canceled_plan with various
machine learning algorithms including logistic regression, LDA, and
KNN.
mobile_carrier_df| canceled_plan <fct> | us_state_region <fct> | international_plan <fct> | voice_mail_plan <fct> | |
|---|---|---|---|---|
| yes | East North Central | no | no | |
| no | South Atlantic | no | yes | |
| yes | New England | no | no | |
| no | South Atlantic | no | no | |
| no | Mountain | no | yes | |
| yes | East North Central | no | no | |
| no | East South Central | no | yes | |
| yes | Mid Atlantic | yes | no | |
| no | Mountain | yes | yes | |
| no | South Atlantic | no | no | 
 
set.seed(271)
mobile_split <- initial_split(mobile_carrier_df, prop = 0.75,
                              strata = canceled_plan)
mobile_training <- mobile_split %>% 
                   training()
mobile_test <- mobile_split %>%
               testing()
# Create cross validation folds for hyperparameter tuning
set.seed(271)
mobile_folds <- vfold_cv(mobile_training, v = 5)
 
We create a feature engineering pipeline, mobile_recipe,
with the following transformations:
mobile_recipe <- recipe(canceled_plan ~ ., data = mobile_training) %>% 
                 step_YeoJohnson(all_numeric(), -all_outcomes()) %>% 
                 step_normalize(all_numeric(), -all_outcomes()) %>% 
                 step_dummy(all_nominal(), -all_outcomes())
 
It’s always good practice to check your transformations by applying them to the training data.
mobile_recipe %>% 
  prep(training = mobile_training) %>% 
  bake(new_data = NULL)| number_vmail_messages <dbl> | total_day_minutes <dbl> | total_day_calls <dbl> | total_eve_minutes <dbl> | |
|---|---|---|---|---|
| -0.5737528 | -0.7961420163 | 0.31432134 | 1.0665306242 | |
| 1.7541177 | 0.7390979014 | -0.75175966 | 0.9165627874 | |
| 1.7440365 | 0.6230319393 | 0.81394217 | -0.9326572335 | |
| -0.5737528 | -0.0413027110 | 0.01887266 | 0.1924920726 | |
| -0.5737528 | -0.2494271001 | -0.65687950 | 0.2773575238 | |
| -0.5737528 | 0.5910140026 | 0.56303561 | 0.4565155630 | |
| -0.5737528 | -0.3306291677 | -1.44883535 | 1.5162521376 | |
| -0.5737528 | -0.9141890845 | 0.11698262 | 1.5065922239 | |
| 1.7681880 | -0.3687107675 | -1.49433811 | 1.3537800818 | |
| -0.5737528 | -1.2574122978 | 1.11781889 | 1.5143202655 | 
 
Next, we specify a logistic regression model using the appropriate
parnsip function.
logistic_model <- logistic_reg() %>% 
                  set_engine('glm') %>% 
                  set_mode('classification')
 
Next, we combine our model and recipe into a single workflow,
logistic_wf
logistic_wf <- workflow() %>% 
               add_model(logistic_model) %>% 
               add_recipe(mobile_recipe)
 
Next, we fit our workflow using the last_fit() function.
This will train our model on the training data and calculate predictions
on the test data.
logistic_fit <- logistic_wf %>% 
                last_fit(split = mobile_split)
 
The collect_predictions() creates a data frame of test
results.
logistic_results <-  logistic_fit %>% 
                     collect_predictions()
 
Next we calculate the ROC Curve, area under the ROC curve, and the confusion matrix on the test data.
 
## ROC Curve
roc_curve(logistic_results, 
          truth = canceled_plan, 
          .pred_yes) %>% 
  autoplot()
# ROC AUC
roc_auc(logistic_results, 
        truth = canceled_plan,
        .pred_yes)| .metric <chr> | .estimator <chr> | .estimate <dbl> | 
|---|---|---|
| roc_auc | binary | 0.8050488 | 
# Confusion Matrix
conf_mat(logistic_results, 
         truth = canceled_plan, 
         estimate = .pred_class)          Truth
Prediction yes  no
       yes  90  34
       no   87 359
 
 
 
In this section we will modify the steps from above to fit an LDA
model to the mobile_carrier_df data. We have already
created our training/test/data folds and trained our feature engineering
recipe.
To fit an LDA model, we must specify an LDA object with
discrim_regularized(), create an LDA workflow, and fit our
model with last_fit().
lda_model <- discrim_regularized(frac_common_cov = 1) %>% 
             set_engine('klaR') %>% 
             set_mode('classification')lda_wf <- workflow() %>% 
          add_model(lda_model) %>% 
          add_recipe(mobile_recipe)lda_fit <- lda_wf %>% 
           last_fit(split = mobile_split)
 
Use the collect_predictions() function to create a data
frame of test results.
lda_results <-  lda_fit %>% 
                collect_predictions()Calculate the ROC Curve, area under the ROC curve, and the confusion matrix on the test data. You should get the results below.
 
## ROC Curve
roc_curve(lda_results, 
          truth = canceled_plan, 
          .pred_yes) %>% 
  autoplot()
# ROC AUC
roc_auc(lda_results, 
        truth = canceled_plan, 
        .pred_yes)| .metric <chr> | .estimator <chr> | .estimate <dbl> | 
|---|---|---|
| roc_auc | binary | 0.810181 | 
# Confusion Matrix
conf_mat(lda_results, 
         truth = canceled_plan, 
         estimate = .pred_class)          Truth
Prediction yes  no
       yes  87  29
       no   90 364
 
In this section we will modify the steps from above to fit an KNN
model to the mobile_carrier_df data.
To fit a KNN model, we must specify an KNN object with
nearest_neighbor(), create a KNN workflow, tune our
hyperparameter, neighbors, and fit our model with
last_fit().
knn_model <- nearest_neighbor(neighbors = tune()) %>% 
             set_engine('kknn') %>% 
             set_mode('classification')knn_wf <- workflow() %>% 
          add_model(knn_model) %>% 
          add_recipe(mobile_recipe)Next, we create a grid of the following values of
neighbors: 10, 15, 25, 45, 60, 80, 100, 120, 140, and
180
## Create a grid of hyperparameter values to test
k_grid <- tibble(neighbors = c(10, 15, 25, 45, 60, 80, 100, 120, 140, 180))
 
Next, we use tune_grid() to perform hyperparameter
tuning using k_grid and mobile_folds.
## Tune  workflow
set.seed(314)
knn_tuning <- knn_wf %>% 
              tune_grid(resamples = mobile_folds,
                        grid = k_grid)The select_best() function selects the best model from
our tuning results based on the area under the ROC curve.
## Select best model based on roc_auc
best_k <- knn_tuning %>% 
          select_best(metric = 'roc_auc')
 
The last step is to use finalize_workflow() to add our
optimal model to our workflow object.
## Finalize workflow by adding the best performing model
final_knn_wf <- knn_wf %>% 
                finalize_workflow(best_k)
 
knn_fit <- final_knn_wf %>% 
           last_fit(split = mobile_split)
 
Use the collect_predictions() function to create a data
frame of test results.
knn_results <-  knn_fit %>% 
                collect_predictions()Calculate the ROC Curve, area under the ROC curve, and the confusion matrix on the test data. You should get the results below.
## ROC Curve
roc_curve(knn_results, 
          truth = canceled_plan, 
          .pred_yes) %>% 
  autoplot()
# ROC AUC
roc_auc(knn_results, 
        truth = canceled_plan, 
        .pred_yes)| .metric <chr> | .estimator <chr> | .estimate <dbl> | 
|---|---|---|
| roc_auc | binary | 0.8272164 | 
# Confusion Matrix
conf_mat(knn_results, 
         truth = canceled_plan, 
         estimate = .pred_class)          Truth
Prediction yes  no
       yes   9   0
       no  168 393| Copyright © David Svancer 2023 |