In this tutorial, we will learn about classification with
discriminant analysis and the K-nearest neighbor (KNN) algorithm. KNN
can be used for both regression and classification and will serve as our
first example for hyperparameter tuning. We will be using two data sets
to demonstrate the algorithms in this lesson, churn_df
and
home_sales
.
Click the button below to clone the course R tutorials into your DataCamp Workspace. DataCamp Workspace is a free computation environment that allows for execution of R and Python notebooks. Note - you will only have to do this once since all tutorials are included in the DataCamp Workspace GBUS 738 project.
The code below will load the required packages and data sets for this
tutorial. We will need a new package for this lesson,
discrim
. This packages is part of tidymodels
and serves as a general interface to discriminant analysis algorithms in
R
.
When installing discrim
, you will also need to install
the klaR
package.
library(tidymodels)
library(discrim) # for discriminant analysis
# Telecommunications customer churn data
churn_df <- readRDS(url('https://gmubusinessanalytics.netlify.app/data/churn_data.rds'))
# Seattle home sales
home_sales <- readRDS(url('https://gmubusinessanalytics.netlify.app/data/home_sales.rds')) %>%
select(-selling_date)
We will be working with the churn_df
and
home_sales
data frames in this lesson.
Take a moment to explore these data sets below.
A row in this data frame represents a customer at a telecommunications company. Each customer has purchased phone and internet services from this company.
The outcome variable in this data is churn
which
indicates whether the customer terminated their services.
A row in this data frame represents a home that was sold in the Seattle area between 2014 and 2015.
The outcome variable in this data is selling_price
.
Linear discriminant analysis (LDA) is a classification algorithm where the set of predictor variables are assumed to follow a multivariate normal distribution with a common covariance matrix. As we saw in our lecture, this algorithm produces a linear decision boundary.
Both linear and quadratic discriminant analysis can be specified with
the discrim_regularized()
function from the
discrim
package.
Let’s use LDA to predict whether customers will cancel their
telecommunications service in the churn_df
data frame. We
will follow our standard machine learning workflow that was introduced
in the logistic regression tutorial.
In this case, our event of interest is churn == 'yes'
.
This is what we would like to map to the positive class when calculating
our performance metrics.
The code below shows that yes
is mapped to the first
level of the churn
variable. Since todymodels
maps the first level to the positive class in all performance metrics
functions, we do not need to recode the levels of this variable.
levels(churn_df$churn)
[1] "yes" "no"
Now we can proceed to split our data with
initial_split()
.
set.seed(314) # Remember to always set your seed. Any integer will work
churn_split <- initial_split(churn_df, prop = 0.75,
strata = churn)
churn_training <- churn_split %>% training()
churn_test <- churn_split %>% testing()
We have a mixture of numeric and factor predictor variables in our
data. We will use step_dummy()
to convert all factor
variables to numeric indicator variables.
It is also standard practice to center and scale our numeric
predictors. If needed, we can also adjust for skewness with
step_YeoJohnson()
.
Let’s make histograms of our numeric predictors to see if this is
needed. We have skewness present in late_payments
.
ggplot(data = churn_df, mapping = aes(x = monthly_charges)) +
geom_histogram(fill = '#006EA1', color = 'white', bins = 20) +
labs(title = 'Distribution of Monthly Charges') +
theme_light()
ggplot(data = churn_df, mapping = aes(x = late_payments)) +
geom_histogram(fill = '#006EA1', color = 'white', bins = 20) +
labs(title = 'Distribution of Late Payments') +
theme_light()
Now we can create a feature engineering recipe for this data. We will train the following transformations on our training data.
churn_recipe <- recipe(churn ~ ., data = churn_training) %>%
step_YeoJohnson(all_numeric(), -all_outcomes()) %>%
step_normalize(all_numeric(), -all_outcomes()) %>%
step_dummy(all_nominal(), -all_outcomes())
Let’s check to see if the feature engineering steps have been carried out correctly.
churn_recipe %>%
prep(training = churn_training) %>%
bake(new_data = NULL)
Linear discriminant analysis is specified with the
discrim_regularized
function. The optional
frac_common_cov
is used to specify an LDA or QDA model.
For LDA, we set frac_common_cov = 1
. This instructs
discrim_regularied
that we are assuming that each class in
the response variable has the same variance. This is the core assumption
of the LDA model.
FOR QDA, we set frac_common_cov = 0
, indicating that
each class within our response variable has its own class-specific
variance.
lda_model <- discrim_regularized(frac_common_cov = 1) %>%
set_engine('klaR') %>%
set_mode('classification')
Next we create a workflow that combines our feature engineering steps and LDA model.
lda_wf <- workflow() %>%
add_model(lda_model) %>%
add_recipe(churn_recipe)
last_fit()
Finally we will train our model and estimate performance on our test
data set using the last_fit()
function.
last_fit_lda <- lda_wf %>%
last_fit(split = churn_split)
To obtain the metrics on the test set (accuracy and roc_auc by
default) we use collect_metrics()
. Based on area under the
ROC curve, our model has a “B”.
last_fit_lda %>% collect_metrics()
We can also obtain a data frame with test set results by using the
collect_predictions()
function. In the code below, we call
this lda_predictions
. It contains the estimated
probabilities for customers canceling their service,
.pred_yes
, the predicted class of our response variable,
.pred_class
, and the truth, churn
.
lda_predictions <- last_fit_lda %>%
collect_predictions()
lda_predictions
We can use this data frame to make an ROC plot by using
roc_curve()
and autoplot()
.
lda_predictions %>%
roc_curve(truth = churn, .pred_yes) %>%
autoplot()
We can also use the lda_predictions
results to explore
the confusion matrix and other performance metrics, such as the F1
score, on our test data.
We see that our model made 41 false negatives and 30 false positives. In this case, predicting that a customer will not cancel their service when in fact they do seems like the more costly error.
conf_mat(lda_predictions, truth = churn, estimate = .pred_class)
Truth
Prediction yes no
yes 101 30
no 33 137
f_meas(lda_predictions, truth = churn, estimate = .pred_class)
To fit a quadratic discriminant analysis model, we will have to make
some minor adjustments to our workflow from the previous section. We
have already split our data and trained our feature engineering steps so
we only need to create a new QDA model specification with
discrim_regularized()
and a new workflow object.
FOR QDA, we set frac_common_cov = 0
, indicating that
each class within our response variable has its own class-specific
variance.
qda_model <- discrim_regularized(frac_common_cov = 0) %>%
set_engine('klaR') %>%
set_mode('classification')
Next we create a QDA workflow object that combines our feature engineering steps from the previous section and our QDA model.
qda_wf <- workflow() %>%
add_model(qda_model) %>%
add_recipe(churn_recipe)
last_fit()
Finally we will train our model and estimate performance on our test
data set using the last_fit()
function.
last_fit_qda <- qda_wf %>%
last_fit(split = churn_split)
Based on the area under the ROC curve, our QDA model had similar performance to our LDA model. Remember that QDA is a more complicated model than LDA since we are estimating more parameters.
Since we didn’t get any improvement in terms of model performance, it is always recommended to choose the simpler model if we are deciding which one to use in production.
last_fit_qda %>% collect_metrics()
In this section we will learn how to perform regression and classification using the k-nearest neighbor (KNN) algorithm and hyperparameter tuning with cross validation.
We will use KNN to predict whether customers will cancel their
service in our chrun_df
data. To do this, we have to adjust
our machine learning workflow by incorporating hyperparameter
tuning.
Since we need to perform hyperparameter tuning, we need to add the
extra step of creating cross validation folds from our training data.
This is done with the vfold_cv()
function.
In the code below, we further split our churn_training
data into folds with vfold_cv()
. The v
parameter specifies how many folds to create. In our example below, we
are creating 5 folds.
### Create folds for cross validation on the training data set
## These will be used to tune model hyperparameters
set.seed(314)
churn_folds <- vfold_cv(churn_training, v = 5)
Since we have already trained our feature engineering steps, we can
use the churn_recipe
in our KNN modeling.
The nearest_neighbor()
function from the
parnsip
package serves as a general interface to KNN
modeling engines in R
. It has the following important
hyperparameter:
neighbors
- A single integer for the number of
neighbors to consider (often called K). For the “kknn” engine, a value
of 5 is used if neighbors is not specified
To determine the optimal value of neighbors
, we need to
perform hyperparameter tuning. Whenever we have a hyperparameter we
would like to tune, we must set it equal to tune()
in our
model specification. In the code below, we specify our KNN
classification model with the “kknn” engine.
knn_model <- nearest_neighbor(neighbors = tune()) %>%
set_engine('kknn') %>%
set_mode('classification')
As before the next step is to create a workflow with our recipe and model.
knn_wf <- workflow() %>%
add_model(knn_model) %>%
add_recipe(churn_recipe)
Hyperparameter tuning is performed using a grid search algorithm. To
do this, we must create a data frame with a column name that matches our
hyperparameter, neighbors
in this case, and values we wish
to test.
In the code below we use the tibble()
function to create
a data frame with values of neighbors
ranging from 10 to
150. Our goal here is to choose a range of values to test, from small to
relatively large numbers of neighbors.
## Create a grid of hyperparameter values to test
k_grid <- tibble(neighbors = c(10, 20, 30, 50, 75, 100, 125, 150))
# View grid
k_grid
Now that we have a data frame with the values of
neighbors
to test, we can use the tune_grid()
function to determine the optimal value of our hyperparameter.
The tune_grid()
function takes a model or workflow
object, cross validation folds, and a tuning grid as arguments. It is
recommended to use set.seed()
before hyperparameter tuning
so that you can reproduce your results at a later time.
## Tune workflow
set.seed(314)
knn_tuning <- knn_wf %>%
tune_grid(resamples = churn_folds,
grid = k_grid)
To view the results of our hyperparameter tuning, we can use the
show_best()
function. We must pass the type of performance
metric we would like to see into the show_best()
function.
From the results below, we see that for each value of
neighbors
we specified, tune_grid()
fit a KNN
model with that parameter value 5 times (since we have 5 folds in our
cross validation object). The mean
column in the results
below indicates the average value of the performance metric that was
obtained.
## Show the top 5 best models based on roc_auc metric
knn_tuning %>% show_best('roc_auc')
We can use the select_best()
model to select the model
from our tuning results that had the best overall performance. In the
code below, we specify to select the best performing model based on the
roc_auc
metric. We see that the model with 150 neighbors
performed the best.
## Select best model based on roc_auc
best_k <- knn_tuning %>%
select_best(metric = 'roc_auc')
## View model
best_k
The last step is to use finalize_workflow()
to add our
optimal model to our workflow object.
## Finalize workflow by adding the best performing model
final_knn_wf <- knn_wf %>%
finalize_workflow(best_k)
last_fit()
After we have tuned our hyperparameter, neighbors
, and
finalized our workflow object with the optimal model, we perform the
same last steps as before. We will train our model and estimate
performance on our test data set using the last_fit()
function.
last_fit_knn <- final_knn_wf %>%
last_fit(split = churn_split)
Based on area under the ROC curve, our model has a “B-” and performs slightly worse than LDA and QDA.
last_fit_knn %>% collect_metrics()
Let’s also have a look at our ROC curve and confusion matrix.
knn_predictions <- last_fit_knn %>%
collect_predictions()
knn_predictions
We can use this data frame to make an ROC plot by using
roc_curve()
and autoplot()
.
knn_predictions %>%
roc_curve(truth = churn, .pred_yes) %>%
autoplot()
We see that our model made 42 false negatives and 37 false positives.
conf_mat(knn_predictions, truth = churn, estimate = .pred_class)
Truth
Prediction yes no
yes 94 29
no 40 138
In this final section, we will demonstrate a complete machine
learning workflow that incorporates hyperparameter tuning. We will use
KNN to predict the selling price of homes using the
home_sales
data.
First we split our data into training and test sets. We also create 5 cross validation folds from our training data for hyperparameter tuning.
set.seed(271)
# Create a split object
homes_split <- initial_split(home_sales, prop = 0.75,
strata = selling_price)
# Build training data set
homes_training <- homes_split %>%
training()
# Build testing data set
homes_test <- homes_split %>%
testing()
## Cross Validation folds
homes_folds <- vfold_cv(homes_training, v = 5)
Next, we specify our feature engineering recipe. In this step, we
do not use prep()
or bake()
.
This recipe will be automatically applied in a later step using the
workflow()
and last_fit()
functions.
For our model formula, we are specifying that
selling_price
is our response variable and all others are
predictor variables.
homes_recipe <- recipe(selling_price ~ ., data = homes_training) %>%
step_YeoJohnson(all_numeric(), -all_outcomes()) %>%
step_normalize(all_numeric(), -all_outcomes()) %>%
step_dummy(all_nominal(), - all_outcomes())
As an intermediate step, let’s check our recipe by prepping it on the training data and applying it to the test data. We want to make sure that we get the correct transformations.
From the results below, things look correct.
homes_recipe %>%
prep(training = homes_training) %>%
bake(new_data = homes_test)
Next, we specify our KNN regression model with
nearest_neighbor()
. We set neighbors
to
tune()
for hyperparameter tuning and make sure to set the
mode to regression.
knn_reg <- nearest_neighbor(neighbors = tune()) %>%
set_engine('kknn') %>%
set_mode('regression')
Next, we combine our model and recipe into a workflow object.
knn_reg_wf <- workflow() %>%
add_model(knn_reg) %>%
add_recipe(homes_recipe)
Let’s test the same values of neighbors
as before.
## Create a grid of hyperparameter values to test
k_grid_reg <- tibble(neighbors = c(10, 20, 30, 50, 75, 100, 125, 150))
# View grid
k_grid_reg
Now that we have a data frame with the values of
neighbors
to test, we can use the tune_grid()
function to determine the optimal value of our hyperparameter.
The tune_grid()
function takes a model or workflow
object, cross validation folds, and a tuning grid as arguments. It is
recommended to use set.seed()
before hyperparameter tuning
so that you can reproduce your results at a later time.
## Tune workflow
set.seed(314)
knn_reg_tuning <- knn_reg_wf %>%
tune_grid(resamples = homes_folds,
grid = k_grid_reg)
Since we are fitting a regression model, the performance metrics of
interest include rsq
and rmse
. Let’s use
show_best()
to display the best performing model based on
R2
## Show the top 5 best models based on rsq metric
knn_reg_tuning %>% show_best('rsq')
We can use the select_best()
model to select the model
from our tuning results that had the best overall performance. In the
code below, we specify to select the best performing model based on the
rsq
metric. We see that the model with 20 neighbors
performed the best.
## Select best model based on roc_auc
best_k_reg <- knn_reg_tuning %>%
select_best(metric = 'rsq')
## View model
best_k_reg
The last step is to use finalize_workflow()
to add our
optimal model to our workflow object.
## Finalize workflow by adding the best performing model
final_knn_reg_wf <- knn_reg_wf %>%
finalize_workflow(best_k_reg)
last_fit()
Finally, we process our machine learning workflow with
last_fit()
.
homes_knn_fit <- final_knn_reg_wf %>%
last_fit(split = homes_split)
To obtain the performance metrics and predictions on the test set, we
use the collect_metrics()
and
collect_predictions()
functions on our
homes_knn_fit
object.
# Obtain performance metrics on test data
homes_knn_fit %>% collect_metrics()
We can save the test set predictions by using the
collect_predictions()
function. This function returns a
data frame which will have the response variables values from the test
set and a column named .pred
with the model
predictions.
# Obtain test set predictions data frame
homes_knn_results <- homes_knn_fit %>%
collect_predictions()
# View results
homes_knn_results
Finally, let’s use the homes_knn_results
data frame to
make an R2 plot to visualize our model performance on the
test data set. The coord_obs_pred()
function will set the x
and y axis scales to be identical and the
scale_*_continuous()
functions convert the axis labels to
comma format. This will make the plot easier to interpret.
ggplot(data = homes_knn_results,
mapping = aes(x = .pred, y = selling_price)) +
geom_point(alpha = 0.25) +
geom_abline(intercept = 0, slope = 1, color = 'red', linetype = 2) +
coord_obs_pred() +
scale_y_continuous(labels = scales::comma) +
scale_x_continuous(labels = scales::comma) +
labs(title = 'KNN Regression Results - Home Sales Test Set',
x = 'Predicted Selling Price',
y = 'Actual Selling Price')
Copyright © David Svancer 2023 |