In this tutorial, we will learn about classification and logistic
regression with tidymodels
. We will be using the
heart_disease
dataset to demonstrate concepts in this
tutorial.
The heart_disease
data contains demographics and
outcomes of various medical tests for patients in a heart disease study.
The variable of interest in this data is heart_disease
and
it indicates whether a patient was diagnosed with heart disease (Yes or
No).
Click the button below to clone the course R tutorials into your DataCamp Workspace. DataCamp Workspace is a free computation environment that allows for execution of R and Python notebooks. Note - you will only have to do this once since all tutorials are included in the DataCamp Workspace GBUS 738 project.
The code below will load the required packages and data sets for this tutorial.
library(tidymodels)
library(vip)
# Heart disease data
heart_df <- readRDS(url('https://gmubusinessanalytics.netlify.app/data/heart_disease.rds')) %>%
select(heart_disease, age, chest_pain, max_heart_rate, resting_blood_pressure)
# View heart disease data
heart_df
In logistic regression, we are estimating the probability that our
Bernoulli response variable is equal to the
Positive
class.
In classification, the event we are interested in predicting, such as
heart_disease = 'Yes'
in our heart disease example, is
known as the Postive
event. Whereas the remaining event,
heart_disease = 'No'
, is the Negative
event.
In our course tutorials, we will follow the model fitting process that is expected to be followed on the course analytics project. When fitting a classification model, whether logistic regression or a different type of algorithm, we will take the following steps:
recipes
packageparsnip
model objectLet’s demonstrate this process using logistic regression and the
heart_df
data.
The first step in modeling is to split our data into a training and test set. In the classification setting, we must also make sure that the response variable in our data set is a factor.
By default, tidymodels
maps the first level of a factor
to the Positive
class while calculating performance
metrics. Therefore, before we split our data and proceed to modeling, we
need to make sure that the event we are trying to predict is the first
level of our response variable.
For the heart_df
data, the event we are interested in
predicting is heart_disease = 'Yes'
. We can use the
levels()
function to check the current ordering of the
levels of the heart_disease
variable.
levels(heart_df$heart_disease)
[1] "yes" "no"
Since ‘yes’ is the first level, we don’t need to take an further steps and can proceed to splitting our data.
In the code below, we use the initial_split()
function
from rsample
to create our training and testing data using
the heart_df
data.
## Always remember to set your seed
## Add an integer to the argument of set.seed()
set.seed(345)
heart_split <- initial_split(heart_df, prop = 0.75,
strata = heart_disease)
heart_training <- heart_split %>% training()
heart_test <- heart_split %>% testing()
The next step in the modeling process is to define our feature
engineering steps. In the code below, we process our numeric predictors
by removing skewness and normalizing, and create dummy variables from
our chest_pain
predictor.
When creating a feature engineering pipeline, it’s important to
exclude prep()
and bake()
because these will
be implemented automatically in our workflow that is created at a later
stage
heart_recipe <- recipe(heart_disease ~ ., data = heart_training) %>%
step_YeoJohnson(all_numeric(), -all_outcomes()) %>%
step_normalize(all_numeric(), -all_outcomes()) %>%
step_dummy(all_nominal(), -all_outcomes())
However, it is always good practice to check that the feature
engineering recipe is doing what we expect. The code below processes our
training data with prep()
and bake()
so that
we can have a look at the results.
heart_recipe %>%
prep(training = heart_training) %>%
bake(new_data = NULL)
Next, we define our logistic regression model object. In this case,
we use the logistic_reg()
function from
parsnip
. Our engine is glm
and our mode is
`classification.
logistic_model <- logistic_reg() %>%
set_engine('glm') %>%
set_mode('classification')
Now we can combine our model object and recipe into a single workflow
object using the workflow()
function.
heart_wf <- workflow() %>%
add_model(logistic_model) %>%
add_recipe(heart_recipe)
Next we fit our workflow to the training data. This is done by
passing our workflow object to the fit()
function
heart_logistic_fit <- heart_wf %>%
fit(data = heart_training)
Once we have trained our logistic regression model on our training
data, we can optionally study variable importance with the
vip()
function.
The first step is to extract the trained model from our workflow fit,
heart_logistic_fit
. This can be done by passing
heart_logistic_fit
to the
extract_fit_parsnip()
function.
heart_trained_model <- heart_logistic_fit %>%
extract_fit_parsnip()
Next we pass heart_trained_model
to the
vip()
function. This will return a ggplot
object with the variable importance scores from our model. The
importance scores are based on the z-statistics associated with each
predictor.
We see from the results below, that asymptomatic chest pain, maximum heart rate, and resting blood pressure, are the most important predictors of heart disease from our data set.
vip(heart_trained_model)
The next step in the modeling process is to assess the accuracy of
our model on new data. This is done by obtaining predictions on our test
data set with our trained model object,
heart_logistic_fit
.
Before we can do this, we create a results data frame with the following data:
All of this data can be put together using the predict()
function.
To obtain the predicted category for each row in our test set, we
pass the heart_logistic_fit
object to the
predict
function and specify
new_data = heart_test
.
We will get a data frame with a column named .pred_class
which has the predicted category (yes/no) for each row of our test data
set.
predictions_categories <- predict(heart_logistic_fit,
new_data = heart_test)
predictions_categories
Next we need to obtain the estimated probabilities for each category of our response variable.
This is done with the same code as above but with the additional argument, `type = ‘prob’)
In this case we get a data frame with the following columns,
.pred_yes
and .pred_no
. The
tidymodels
package will always use the following convention
for naming these columns
.pred_level_of_factor_in_response_variable
predictions_probabilities <- predict(heart_logistic_fit,
new_data = heart_test,
type = 'prob')
predictions_probabilities
The final step is to combine the results from above with the true response variable values in our test data set.
# Combine
test_results <- heart_test %>% select(heart_disease) %>%
bind_cols(predictions_categories) %>%
bind_cols(predictions_probabilities)
test_results
The yardstick
package from tidymodels
has a
number of functions for calculating performance metrics on the results
of a machine learning algorithm.
Important function from this package include conf_mat()
,
sens()
, spec()
, roc_curve()
, and
roc_auc()
All of these functions take a data frame with the structure of our
test_results
as the first argument. The input data frame
must contain the three pieces of information mentioned at the beginning
of this section:
The first result to explore is usually the confusion matrix. The
conf_mat()
function will produce one for us. It takes the
following important arguments:
data
- the first argument is a data frame with model
results (usually on the test set)truth
- a factor column with the true response
categoriesestimate
- a factor column with the predicted response
categoriesThe results of this function are a confusion matrix with the predicted categories in the rows and true categories in the columns.
By default, all yardstick
functions map the first level
of the response variable to the positive class. The
conf_mat()
function orders the output by displaying the
positive class first in both the rows and columns.
Form the results below, we have 58 correct predictions on our test data set. Since we have 76 rows in our test data, this gives us an accuracy of 76%.
We have 7 false positives (we predicted the positive class (‘yes’) but the truth was ‘no’) and 11 false negatives.
conf_mat(test_results,
truth = heart_disease,
estimate = .pred_class)
Truth
Prediction yes no
yes 24 7
no 11 34
The sensitivity is a performance metric that calculates the proportion of actual positive cases that the classification model predicted correctly.
In our heart disease example, this is the proportion of patients who
developed heart disease (heart_disease
= Yes) that were
predicted to develop heart disease.
The sens()
function from yardstick
is used
to calculate this metric. It takes the same arguments as
conf_mat()
.
From the results below, we have a sensitivity of 0.69 on our test data results.
sens(test_results,
truth = heart_disease,
estimate = .pred_class)
The specificity is a performance metric that calculates the proportion of actual negative cases that the classification model predicted correctly.
In our heart disease example, this is the proportion of patients who
did not develop heart disease (heart_disease
= No) that
were predicted to not develop heart disease.
The spec()
function from yardstick
is used
to calculate this metric. It takes the same arguments as
conf_mat()
.
From the results below, we have a sensitivity of 0.83 on our test data results.
spec(test_results,
truth = heart_disease,
estimate = .pred_class)
The ROC curve is a way to visualize the performance of any classification model. The plot includes the sensitivity on the y-axis and (1 - specificity on the x-axis for all possible probability cut-off values.
The default probability cut-off value used by classification models is 0.5. But changing this can guard against either false positives or false negatives. The ROC curve plots all of this information in one plot.
What to look for: the best ROC curve is as close as possible to the point (0, 1) that is at the top left corner of the plot. The closer the ROC curve is to that point throughout the entire range, the better the classification model.
The dashed line through the middle of the graph represents a model that is just guessing at random.
The first step in creating an ROC curve is to pass our results data
frame to the roc_curve()
function. This function takes a
data frame with model results, the name of the column with the true
outcome - truth
, and the name of the column with the
estimated probabilities of the positive class, It will produce a data
frame with the specificity and sensitivity for all probability
thresholds.
roc_curve(test_results,
truth = heart_disease,
.pred_yes)
To plot this data, we simply pass the results of
roc_curve()
to the autoplot()
function.
roc_curve(test_results,
truth = heart_disease,
.pred_yes) %>%
autoplot()
Another important performance metric is the area under the ROC curve. This metric can be loosely interpreted as a letter grade.
In terms of model performance, an area under the ROC value between 0.9 - 1 indicates an “A”, 0.8 - 0.9 a “B”, and so forth. Anything below a 0.6 is an “F” and indicates poor model performance.
To calculate the area under the ROC curve, we use the
roc_auc()
.
This function takes the results data frame as the first argument, the
truth
column as the second argument, and the column of
estimated probabilities for the positive class as the third
argument.
From our results below, our model gets a “B”.
roc_auc(test_results,
truth = heart_disease,
.pred_yes)
The F1 score is a performance metric that equally balances our false positive and false negative mistakes. The range of an F1 score is from 0 (worst) to 1 (best).
The f_meas()
function from yardstick
is
used to calculate this metric. It takes the same arguments as
conf_mat()
.
From the results below, we have an F1 score 0.73 on our test data results.
f_meas(test_results,
truth = heart_disease,
estimate = .pred_class)
It is also possible to create a custom metric set using the
metric_set()
function. This function takes
yardstick
function names as arguments and returns a new
function that we can use to calculate that set of metrics.
In the code below we create a new function, my_metrics()
that will calculate the accuracy, sensitivity, specificity,
F1, and ROC AUC from the results data frame.
Note: Since accuracy()
,
sens()
, spec()
, and f_meas()
require the truth
and estimate
columns while
roc_auc()
requires the truth
column and a
column of estimated probabilities for the positive class, all three must
be provided to the custom function when it is called.
my_metrics <- metric_set(accuracy, sens, spec, f_meas, roc_auc)
my_metrics(test_results,
truth = heart_disease,
estimate = .pred_class,
.pred_yes)
Just like with linear regression, we can automate the process of
fitting a logistic regression model by using the last_fit()
function. This will automatically give use the predictions and metrics
on our test data set.
In the example below, we will fit the same model as above, but with
last_fit()
instead of fit()
.
The last_fit()
function takes a workflow object as the
first argument and a data split object as the second. It will train the
model on the training data and provide predictions and calculate metrics
on the test set.
The last_fit()
function takes an optional
metrics
argument where a custom metric function can be
provided. In the example below, we use our my_metrics()
function. If this is left out of the call to last_fit()
, it
will calculate accuracy and ROC AUC by default.
last_fit_model <- heart_wf %>%
last_fit(split = heart_split,
metrics = my_metrics)
To obtain the metrics on the test set (accuracy and roc_auc by
default) we use collect_metrics()
.
last_fit_model %>%
collect_metrics()
We can also obtain a data frame with test set results by using the
collect_predictions()
function.
last_fit_results <- last_fit_model %>%
collect_predictions()
last_fit_results
We can use this data frame to make an ROC plot by using
roc_curve()
and autoplot()
.
last_fit_results %>%
roc_curve(truth = heart_disease, .pred_yes) %>%
autoplot()
Copyright © David Svancer 2023 |