In this tutorial, we will learn about R
functions and
data analysis with the tidyverse
package.
Click the button below to clone the course R tutorials into your DataCamp Workspace. DataCamp Workspace is a free computation environment that allows for execution of R and Python notebooks. Note - you will only have to do this once since all tutorials are included in the DataCamp Workspace GBUS 738 project.
In this section will learn about common built-in functions that are
useful for obtaining summary statistics, ranking data, and data
analysis. We will also learn how to write our own custom functions in
R
.
The functions below are useful for studying the distribution of numeric values within a data set. All of these functions take a numeric vector as their input.
min()
max()
range()
median()
fivenum()
quantile()
Obtaining the range of values present in a numeric vector.
data_vector <- c(3, 9, 11.2, 14, 28.7, 30, 15, 21, 5.7, 9.1, 24.6)
# minimum value in data_vector
min(data_vector)
[1] 3
# maximum value
max(data_vector)
[1] 30
# range of data values
range(data_vector)
[1] 3 30
The median()
and quantile()
functions are
used for obtaining specific percentiles from a distribution of numbers.
A percentile of a set of numbers is a value below which a given
percentage of the total values fall at or below. For example, the 50th
percentile (also called the median) represents the center of a set of
numeric data. This means that 50% of all the values are less than or
equal to the 50th percentile.
The quantile()
function requires two inputs. The first
is a numeric vector of data values and the second is a vector with
values ranging from 0 to 1, representing the percentile(s) to calculate.
# median
median(data_vector)
[1] 14
# 30th percentile
quantile(data_vector, 0.3)
30%
9.1
# 30th, 60th, and 90th percentiles
quantile(data_vector, c(0.3, 0.6, 0.9))
30% 60% 90%
9.1 15.0 28.7
The fivenum()
function calculates the five number
summary (min, 25th, median, 75th, max) of a numeric vector.
fivenum(data_vector)
[1] 3.0 9.1 14.0 22.8 30.0
The mean()
and sd()
functions are used to
calculate the mean and standard deviation of a set of data values.
# mean value
mean(data_vector)
[1] 16
# standard deviation
sd(data_vector)
[1] 9.2
The sum()
and cumsum()
functions are used
for summing the numbers within a vector. The sum()
function
simply returns the sum of all numbers within a vector.
The cumsum()
functions calculates a cumulative sum for
every position within a vector. This function always returns a vector of
the same length as the input.
# sum of all values
sum(data_vector)
[1] 171
# cumulative sum
cumsum(data_vector)
[1] 3 12 23 37 66 96 111 132 138 147 171
The abs()
and rank()
functions are useful
for ranking data values. The abs()
function returns the
absolute values of a vector.
negative_data <- c(-2, 4.5, -6, 10, 12)
# returns the absolute value of all elements
abs(negative_data)
[1] 2.0 4.5 6.0 10.0 12.0
The rank()
function returns the ranks of a set of data
values from smallest to largest. The smallest value is given a rank of
1.
data_vector
[1] 3.0 9.0 11.2 14.0 28.7 30.0 15.0 21.0 5.7 9.1 24.6
rank(data_vector)
[1] 1 3 5 6 10 11 7 8 2 4 9
To obtain ranks from largest to smallest, where rank 1 represents the largest value, just take the rank of the negative of a numeric vector. In the example below, the value 30 is given a rank of 1.
data_vector
[1] 3.0 9.0 11.2 14.0 28.7 30.0 15.0 21.0 5.7 9.1 24.6
rank(-data_vector)
[1] 11 9 7 6 2 1 5 4 10 8 3
This section will cover the basics of data manipulation using the
tidyverse
package. Before we can use the package, we must
load it into our environment with the following code
library(tidyverse)
. This will import all of the functions
available in the tidyverse
package into our environment.The
tidyverse
is a collection of 8 packages that are designed
specifically for data science tasks.
In this course, I have installed all required packages into our
RStudio Cloud environment. If you are ever working with RStudio on your
desktop, you must install packages before they can be used. This is done
with the following code install.packages('tidyverse')
.
To get more details about the tidyverse
package see the
tidyverse
documentation
We will also load the skimr
package which is used for
exploring the structure of a data frame.
# This will load all 8 of the tidyverse packages
library(tidyverse)
library(skimr)
The code below will import a data set from our course website. The data consists of 1,470 employee records for a U.S. based product company. The rows in this data frame represent the attributes of an employee at this company across the variables listed in the table below.
Variable | Definition |
---|---|
left_company | Did the employee leave the company? (Yes/No) |
department | Department within the company |
job_level | Job Level (Associate - Vice President) |
salary | Employee yearly salary (US Dollars) |
weekly_hours | Self-reported average weekly hours spent on the job (company survey) |
business_travel | Level of required business travel |
yrs_at_company | Tenure at the company (years) |
yrs_since_promotion | Years since last promotion |
previous_companies | Number of previous companies for which the employee has worked |
job_satisfaction | Self-reported job satisfaction (company survey) |
performance_rating | Most recent annual performance rating |
marital_status | Marital status (Single, Married, or Divorced) |
miles_from_home | Distance from employee address to office location |
This data is a special type of data frame known as a
tibble
. All data frames in the tidyverse
are
usually stored in this format. It has special properties which include
better printing features and labels for column data types.
employee_data <-readRDS(url('https://gmubusinessanalytics.netlify.app/data/employee_data.rds'))
# View data
employee_data
skimr
The first step in a data analysis project is to explore your data source. This includes summarizing the values within each column, checking for missing data, checking the data types of each column, and verifying the number of rows and columns.
The skim()
function can be used to accomplish all of
this. It takes your data frame as an argument. In the output below, we
first get the number of rows and columns along with the data types
present in our data.
The results are then grouped by the type of variables in our data.
First we get a summary of our factor variables, including the number of missing observations, whether our factor levels are ordered, the count of unique levels, and an abbreviated list of the most frequent factor levels.
Then we get a summary of our numeric variables which include the number of missing observations, the mean and standard deviation, a five number summary, and a plot of the distribution of values.
# View data frame properties and summary statistics
skim(employee_data)
Name | employee_data |
Number of rows | 1470 |
Number of columns | 13 |
_______________________ | |
Column type frequency: | |
factor | 7 |
numeric | 6 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
left_company | 0 | 1 | FALSE | 2 | No: 1233, Yes: 237 |
department | 0 | 1 | FALSE | 6 | IT : 399, Res: 293, Sal: 252, Mar: 238 |
job_level | 0 | 1 | FALSE | 5 | Sen: 476, Man: 344, Dir: 331, Ass: 185 |
business_travel | 0 | 1 | FALSE | 3 | Rar: 1043, Fre: 277, Non: 150 |
job_satisfaction | 0 | 1 | FALSE | 4 | Ver: 459, Hig: 442, Low: 289, Med: 280 |
performance_rating | 0 | 1 | FALSE | 5 | Mee: 546, Exc: 472, Exc: 286, Min: 136 |
marital_status | 0 | 1 | FALSE | 3 | Mar: 673, Sin: 470, Div: 327 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
salary | 0 | 1 | 94076.2 | 37590.2 | 29849 | 70379 | 88556 | 117100 | 212135 | ▃▇▃▁▁ |
weekly_hours | 0 | 1 | 50.0 | 4.8 | 40 | 47 | 49 | 52 | 66 | ▂▇▃▂▁ |
yrs_at_company | 0 | 1 | 7.0 | 6.1 | 0 | 3 | 5 | 9 | 40 | ▇▂▁▁▁ |
yrs_since_promotion | 0 | 1 | 2.2 | 3.2 | 0 | 0 | 1 | 3 | 15 | ▇▁▁▁▁ |
previous_companies | 0 | 1 | 3.2 | 1.6 | 1 | 2 | 3 | 4 | 7 | ▇▇▂▂▃ |
miles_from_home | 0 | 1 | 9.2 | 8.1 | 1 | 2 | 7 | 14 | 29 | ▇▅▂▂▂ |
It is also possible to select a subset of variables to explore. Just
pass a sequence of unquoted variable names into the skim()
function.
The skimr
package has many more features for exploring
data. Once we cover the fundamentals of dplyr
in the next
sections, I encourage interested students to explore the skimr documentation
# View data frame properties and summary statistics
skim(employee_data, left_company, department, salary, weekly_hours)
Name | employee_data |
Number of rows | 1470 |
Number of columns | 13 |
_______________________ | |
Column type frequency: | |
factor | 2 |
numeric | 2 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
left_company | 0 | 1 | FALSE | 2 | No: 1233, Yes: 237 |
department | 0 | 1 | FALSE | 6 | IT : 399, Res: 293, Sal: 252, Mar: 238 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
salary | 0 | 1 | 94076 | 37590.2 | 29849 | 70379 | 88556 | 117100 | 212135 | ▃▇▃▁▁ |
weekly_hours | 0 | 1 | 50 | 4.8 | 40 | 47 | 49 | 52 | 66 | ▂▇▃▂▁ |
dplyr
In this section we will cover data manipulation with the
dplyr
package. This is one of the core
tidyverse
packages used for exploring data frames.
Chapter 5 of R for Data Science, covers the basics of
manipulating data frames in R
. In this tutorial, I would
like to provide additional examples of the main functions of
dplyr
, including filter()
,
select()
, arrange()
, summarise()
,
and mutate()
.
The first argument to all of these functions is a data frame,
followed by additional arguments that perform various manipulations on
the data. The output from all of these functions will also be a special
type of data frame known as a tibble
.
filter()
The filter()
function is used for subsetting rows of a
data frame. It is much more intuitive than subsetting with the base
R
functions [ ]
and [[ ]]
.
The first argument to filter()
is a data frame, followed
by one or more logical conditions on the variables within the data
frame. Logical conditions separated by a comma are treated as an
AND (&) operation. The advantage of dplyr
, is
that you can pass variable names of a data frame in raw, unquoted format
to many functions. The filter()
function returns a data
frame that has been subsetted by the logical conditions within its
arguments.
# employees that left the company
filter(employee_data, left_company == 'Yes')
# View employees that left from the Sales department
filter(employee_data, left_company == 'Yes', department == 'Sales')
To filter a data frame using an OR condition, we must use the
|
operator.
# employees from Sales or Marketing department
filter(employee_data, department == 'Sales' | department == 'Marketing')
Another way to execute OR statements is by using the
%in%
function. This function is used to check whether a
column’s variable values match at least one element within a vector. In
many cases, it can save lots of typing. The code below will produce the
same result as the previous command
# employees from Sales or Marketing department
filter(employee_data, department %in% c('Sales', 'Marketing'))
What if we are interested in employees from Sales or Marketing that
make over $80,000? We can just add another condition to the previous
code. Remember that conditions separated by a comma represent an AND
operation. So in the code below, we are passing the following condition:
employees with salary
> 80000 AND
(department
is Sales OR
department
is Marketing)
# employees from Sales or Marketing department
filter(employee_data, salary > 80000, department %in% c('Sales', 'Marketing'))
select()
The select()
function allows you to select a subset of
columns from a data frame. There are multiple ways to enter the
selection condition and many helper functions, such as
starts_with()
, ends_with()
, and
contains()
. See the documentation for more examples.
We can select columns by used unquoted column names.
# Select the first three columns
select(employee_data, left_company, department, job_level)
We can also select columns by using their numeric positions
# Select the first three columns with a numeric vector
select(employee_data, c(1, 2, 3))
We can also pass a sequence of numeric positions separated by a comma.
# Select the first three columns with raw numbers
select(employee_data, 1, 2, 3)
Adding a -
in front of numeric positions or variable
names excludes those variables and returns all others
# Select all columns except department and job_level
select(employee_data, -department, -job_level)
# Exclude the first 5 columns
select(employee_data, -1, -2, -3, -4, -5)
# Many times its easier to use a sequence
select(employee_data, -(1:5))
There are many helper functions available with the
select()
function. Below are two examples. The first
selects columns that contain the word “job”. The second selects columns
that start with the letter “y”
# Select all variables that contain the word Rest
select(employee_data, contains('job'))
# Select all variables that start with "y" (or "Y")
select(employee_data, starts_with("y"))
arrange()
The arrange()
function allows you to sort a data frame
by specific columns. The default sorting order is ascending.
# Sort by left_company and salary
arrange(employee_data, left_company, salary)
To sort a column in descending order, wrap it in the
desc()
function.
# Sort by left company (descending) and salary (ascending)
arrange(employee_data, desc(left_company), salary)
summarise()
The summarise()
function allows you to create numerical
summaries, such as averages, from the columns of a data frame. The
summarise()
function is an aggregation function, meaning
that it takes input with multiple values and returns a single
value. With a sample average, for example, we take a numeric
vector with n numbers and summarize it with a single value, the
average.
The summarise()
function takes a data frame as the first
argument followed by a sequence of functions to execute on the columns
of the data frame.
Remember that dplyr
functions will always return a data
frame of results, even the results are a single number or row.
# Let's get the average salary
summarise(employee_data, average_salary = mean(salary))
# We can also create multiple summaries at once
summarise(employee_data, salary_min = min(salary),
salary_25th = quantile(salary, 0.25),
salary_50th = median(salary),
salary_75th = quantile(salary, 0.75),
salary_max = max(salary))
mutate()
The mutate()
function allows you to create new variables
in your data. The mutate()
function is a window function.
Window functions take input with n values and always return a
result with n values. Like summarise()
,
mutate()
takes a data frame as the first argument followed
by a sequence of functions to execute on the columns of the data
frame.
Let’s create a new variable salary_scaled
that
transforms the salary
values by subtracting the mean and
dividing by the standard deviation. In the code below, I save this new
data frame as employee_data_scaled
.
# Create salary_scaled variable
employee_data_scaled <- mutate(employee_data,
salary_scaled = (salary - mean(salary))/sd(salary))
# Let's see the result
select(employee_data_scaled, salary, salary_scaled)
%>%
OperatorOne of the most useful tools in the tidyverse
is the
pipe operator, %>%
. This operator allows users to
combine multiple data manipulation tasks in one line of code.
The pipe operator tells R
to take the object on the left
and pass it to the right as the first argument to the
next function. A keyboard shortcut for the pipe operator is
Ctrl
+ Shift
+ m
.
Let’s go through a simple example to demonstrate the
%>%
operator. Suppose I have a vector of numeric values
and I need to do the following:
cumsum()
functionmedian()
functionThe code below shows how this is done without the pipe operator. Most people new to programming find the code a bit difficult to read because the operations go from right to left.
cumsum(numeric_data)
is evaluated first, then passed to
the median()
function on the left.
If we had an operation that required 3 or more steps, writing code in this way would quickly get confusing to understand and explain.
# Data values
numeric_data <- c(2, 5, 1, 7, 9, 10)
# Cumulative sum
cumsum(numeric_data)
[1] 2 7 8 15 24 34
# Median of the cumulative sum
median(cumsum(numeric_data))
[1] 12
The %>%
operator can be used to write cleaner code
that moves from right to left in a do this - then that
fashion. In the code below, we first create the cumulative sum vector
with cumsum()
and then pass it to the first argument of the
median()
function. Most people find this style of coding
much easier to understand.
cumsum(numeric_data) %>% median()
[1] 12
%>%
with dplyr
The pipe operator makes it easy to create data analysis workflows
with dplyr
functions. When splitting steps across multiple
lines, make sure that each line ends with %>% to
prevent errors.
Let’s use %>%
to create the following data
analysis:
employee_data
The code below demonstrates how to achieve this with one expression
using the %>%
operator.
Notice that we do not have to pass employee_data
into
the first argument of filter()
since we have piped it in
with the first %>%
. The filter()
function
then produces a new data frame that only contains rows that have
left_company
values of “Yes”. This is then passed to the
first argument of summarise()
by the second
%>%
.
employee_data %>%
filter(left_company == 'Yes') %>%
summarise(average_salary = mean(salary))
For comparison, the code below executes the same workflow without the
%>%
operator. In this case, we must save our filtered
data frame and pass it into summarise()
.
employees_left <- filter(employee_data, left_company == 'Yes')
summarise(employees_left, average_salary = mean(salary))
In this section, we will cover more advanced techniques for data
manipulation with dplyr
.
A common data analysis technique, known as
split-apply-combine
, involves creating statistical
summaries by groups within a data frame.
Using the split-apply-combine
technique with
dplyr
usually involves taking a data frame, forming subsets
with the group_by()
function, applying a summary function
to to the groups, and collecting the results into a single data frame.
The group_by()
function allows you to perform tasks by
groups in your data. A typical application of the
group_by()
function is when addressing questions such as
What’s the average salary of employees by department?
The group_by()
will separate a data frame into distinct
groups using the variable that is supplied as the argument. Any
dplyr
functions that follow afterward are applied to the
groups within the grouped data frame.
Below is an example of how we would answer the question from above.
employee_data %>%
group_by(department) %>%
summarise(average_salary = mean(salary))
The group_by()
functions can take multiple grouping
variables. Just separate multiple grouping columns by a comma. For
example, if we wanted to known the average salary of employees by
department
and job_level
, we would execute the
code below.
employee_data %>%
group_by(department, job_level) %>%
summarise(average_salary = mean(salary))
n()
and count()
The group_by()
function is also used for obtaining
counts of factor or character variable values. The n()
function returns the number of rows in a group.
# The n() function in dplyr returns the number of rows per group
employee_data %>% group_by(left_company) %>%
summarise(number_employees = n())
This type of summary is so common, that dplyr
provides a
function named count()
to provide counts by levels of
single or multiple variable combinations. This function creates a column
named n
with the counts. The count()
function
is best used with factor or character columns that do not have too many
unique values.
# Count the occurrence of each unique value in department
employee_data %>% count(department)
# Counts can be sorted
employee_data %>% count(department, sort = TRUE)
# Count variable can be renamed
employee_data %>% count(department, sort = TRUE, name = 'number_of_employees')
The count
function also takes multiple variables as
arguments. This will count all of the combinations of variable values
that are present in the data.
# The count() function also takes multiple arguments
employee_data %>% count(left_company, department, name = 'number_of_employees')
Keep in mind that all dplyr
functions return a data
frame so we can easily use other dplyr
functions on the
results on count()
.
# Arrange the output by department and left_company
employee_data %>% count(left_company, department, name = 'number_of_employees') %>%
arrange(department, left_company)
One more example. How many employees are there by job satisfaction? What are their average miles from the office by each category?
employee_data %>% group_by(job_satisfaction) %>%
summarise(number_of_employees = n(),
avg_miles = mean(miles_from_home))
Can you figure out why the code below doesn’t work?
employee_data %>% count(job_satisfaction, name = 'number_of_employees') %>%
summarise(avg_miles = mean(miles_from_home))
Hint: highlight and execute the code below. What does this give you?
employee_data %>% count(job_satisfaction, name = 'number_of_employees')
This is what is being passed to summarise()
. Why doesn’t
summarise()
know how to calculate the mean of the
miles_from_home
variable?
group_by()
When using group_by()
it’s important to know that all
summarise()
and mutate()
functions are
applied by group
. If you need to apply functions to all
rows of a data frame and not by groups, then use the
ungroup()
function to un-group the input data frame.
Let’s say that we want to create a data frame that has employees
grouped by left_company
and job_level
. For
each group, we want the number of employees in that group and the
percentage these employees represent of all employees
in the data set.
In the code below, I first calculate this the incorrect way. Notice
that the percent_total_employees
sums to 200. This is
because the mutate()
function was applied to the grouped
data frame that summarise()
produced. So we get the
percentage for each group within “Yes” and “No”.
# Not what we wanted
employee_data %>% group_by(left_company, job_level) %>%
summarise(employees = n()) %>%
mutate(percent_of_total_employees = 100*(employees/sum(employees)))
To avoid these type of surprises, I recommended dropping groups after
summary statistics have been calculated. This is done with the
ungroup()
function.
# This is correct
employee_data %>% group_by(left_company, job_level) %>%
summarise(employees = n()) %>%
ungroup() %>%
mutate(percent_of_total_employees = 100*(employees/sum(employees)))
n_distinct()
The n()
function counts the number of rows in an data
frame while the n_distinct()
function counts the number of
unique values observed in a variable. This is useful in many
situations.
Like the n()
function, n_distinct()
is used
within summarise()
or mutate()
.
Let’s say we wanted to know the number of distinct departments that
are in the employee_data
data frame. The code below will
calculate this. We simply pass the department
variable into
n_distinct()
employee_data %>%
summarise(number_of_departments = n_distinct(department))
slice()
and
top_n()
The top_n()
function can be used to subset a data frame
based on the ranked values of a particular variable. This function takes
three inputs:
In the example below, I filter the employee_data
dataset
to include only the top 3 rows based on the salary
values.
By default, top_n()
will return the rows with the 3 largest
values of salary
.
employee_data %>% top_n(3, salary)
If there are any ties, the top_n()
function will return
all values. Notice that we get 4 rows in the output below. This is
because the top 3 values of yrs_at_company
are 40, 37, and
36 and there are two instances of 36 in the data.
employee_data %>% top_n(3, yrs_at_company)
The slice()
function is used to keep a certain number of
rows in a results. If you want only the first 3 rows of a data frame,
you could use slice(1:3). Let’s re-create the top 3 analysis we
just performed, but this time we’ll keep only the first three 3
rows.
# Keep just the first 3 rows
employee_data %>% slice(1:3)
# Keep rows 1, 200, 1002
employee_data %>% slice(1, 200, 1002)
The slice()
function can be used as an alternative to
top_n()
when paired with arrange()
. This has
the benefit of dropping tied values.
The code below returns the top 3 values of
yrs_at_company
just as above, but only keeps the first 3
rows. This is acheived by first ordering the values of
yrs_at_company
in descending order.
employee_data %>% arrange(desc(yrs_at_company)) %>%
slice(1:3)
The slice()
and top_n()
functions are also
useful for creating subsets by groups in a data frame. Let’s say that we
wanted to filter the employee_data
data frame to only
include the top 2 employees with the largest values of
salary
for each department. The code below shows how we
would accomplish this with top_n()
.
employee_data %>% group_by(department) %>%
top_n(2, salary)
To make things easier to interpret, we can arrange the results from
above by department
and salary
employee_data %>% group_by(department) %>%
top_n(2, salary) %>%
arrange(department, desc(salary))
summarise
R
Imagine we have data from a survey we recently conducted where 7 people responded and provided their age. This data is stored in the age vector below.
age <- c(23, 31, 27, 41, 54, 34, 25)
age
[1] 23 31 27 41 54 34 25
What if we would like to know the number of people who are 30 or older and what percentage of the total respondents this group represents?
We can answer this question by first using the >=
operator to find where values stored in the age vector are greater than
or equal to the value 30. Anytime we use comparison operators (>,
>=, <, <=, ==) on a vector, we will get a logical vector
consisting of TRUE/FALSE
values indicating where our
condition was met.
For example, running the code below produces a sequence of
TRUE/FALSE
values that test where our respondents are 30 or
older in the age vector.
age >= 30
[1] FALSE TRUE FALSE TRUE TRUE TRUE FALSE
R
To answer our question above, we can use the following properties of
logical vectors in R
:
We see from the output below that 4 people in our survey were 30 years or older and that this represents 57% of the total respondents.
sum(age >= 30)
[1] 4
mean(age >= 30)
[1] 0.57
summarise()
We can use these properties to extend our capabilities with
summarise()
. What if someone asked us how many employees
there are by department and the count and proportion of those employees
that make less than $60,0000?
employee_data %>% group_by(department) %>%
summarise(employees = n(),
employees_less_60 = sum(salary < 60000),
employees_less_60_prop = mean(salary < 60000))
case_when()
The case_when()
function from the dplyr
package is particularly useful when you need to create a new variable
that relies on a complex combination of existing variables in a data
frame.
The case_when()
function takes a sequence of two-sided
formulas. The left hand side determines which values match this case and
the right hand side provides the replacement value.
The general syntax is
logical condition ~ replacement value
, where the logical
condition can involve multiple variables from a data frame. The sequence
ends with TRUE ~ value for all other cases
.
The sequence is evaluated in the order that it is put into
case_when()
. This syntax is best understood with an
example.
The code below creates a simple data frame with results from a survey.
survey <- data.frame(age = c(24, 31, 28, 42, 18, 37, 51, 29),
occupation = c("Biotech", "Statistics", "Data Science",
"Marketing","Biotech", "Data Science",
"Statistics", "Biotech"),
job_level = c("Entry", "Mid", "Mid", "Senior", "Entry",
"Mid", "Senior", "Mid"))
survey
Next we use case_when()
to create a new character vector
within the survey
data that labels people in statistics and
30 years or older as “Statistics, 30+”, people in Data Science that are
30 years or older as “Data Science, 30+”, and “Other” for all other
combinations. Notice that for all other cases, we must provide
TRUE
to the logical condition in the
case_when()
function.
survey_updated <-
survey %>%
mutate(stat_ds_30 = case_when(age >= 30 & occupation == 'Statistics' ~ 'Statistics, 30+',
age >= 30 & occupation == "Data Science" ~ "Data Science, 30+",
TRUE ~ 'Other'))
survey_updated
cut_*()
There are three very useful functions within the
tidyverse
package that perform automating binning of
numeric data for quick exploratory analysis. These functions take a
numeric vector as input and return a factor with labeled levels.
cut_interval()
makes n groups with
equal range
cut_number()
makes n groups with (approximately)
equal numbers of observations
cut_width()
makes groups of a
given width
Let’s use the survey_updated
data frame to show how
these functions work. The survey_updated
data frame
contains a numeric age
column which we will transform into
factor columns with various categories depending on the function we
choose from above.
cut_interval()
N groups with equal range
survey_updated <-
survey_updated %>%
mutate(age_interval = cut_interval(age, n = 3))
The new column, age_interval
has 3 categories with equal
range.
survey_updated %>%
select(contains('age'))
The counts for the age_interval
column are as
follows:
survey_updated %>%
count(age_interval)
cut_number()
N groups with (approximately)
equal numbers of observations
survey_updated <-
survey_updated %>%
mutate(age_number = cut_number(age, n = 3))
The new column, age_number
has 3 categories with
approximately equal counts.
survey_updated %>%
select(contains('age'))
The counts for the age_number
column are as follows:
survey_updated %>%
count(age_number)
cut_width()
The cut_width()
function takes a width
argument which specifies the width of each interval and a
boundary
argument for centering intervals.
By default the boundary
is equal to 0.5 which may led to
intervals have non-integer endpoints. I usually set
boundary
to 0 so that the intervals always fall between
integer values.
survey_updated <-
survey_updated %>%
mutate(age_width = cut_width(age, width = 10, boundary = 0))
The new column, age_width
has 5 categories with 10-year
widths that span the range of the age
values.
survey_updated %>%
select(contains('age'))
The counts for the age_width
column are as follows:
survey_updated %>%
count(age_width)
employee_data
Let’s use the cut_width()
function to accomplish the
following data analysis workflow with dplyr
employee_data
miles_category
variable that bins
miles_from_home
values in 5 mile incrementsleft_company
and
miles_category
employee_data %>%
mutate(miles_category = cut_width(miles_from_home, width = 5, boundary = 0)) %>%
group_by(left_company, miles_category) %>%
summarise(employees = n())
There are many cases when we will have to write our own functions to
achieve tasks in an analytics project. R
functions can be
defined to take any amount of inputs (usually called arguments) but only
return one object.
The basic syntax of creating a function with arguments x and y is as follows:
my_function <- function(x, y) {
R Code here
}
Let’s write a simple function that takes a numeric vector as input and returns a vector of scaled values. For each value in our original vector, we will subtract the mean and divide by the standard deviation. I will show two equivalent ways of writing this function and discuss the difference.
Note that the input value is named x
. This is complete
arbitrary. The input value could also have been named input
as long as the same name is used within the code of the function.
z_score_1 <- function(x) {
return((x - mean(x))/sd(x))
}
# Let's test our function
age_vector <- c(18, 24, 21, 37, 51, 34, 41)
z_score_1(age_vector)
[1] -1.20 -0.70 -0.95 0.40 1.57 0.14 0.73
By default, an R
function returns the results of the
last operation that it performed. The code below is an
equivalent way of writing the same function. In this case we do not need
to use return
to give us the result.
# Equivalent
z_score_2 <- function(x) {
(x - mean(x))/sd(x)
}
# Check results
z_score_2(age_vector)
[1] -1.20 -0.70 -0.95 0.40 1.57 0.14 0.73
The return()
call is useful when you need to return a
list of results from a function. The function below
creates three objects, mean_x
, sd_x
, and
scaled_data
. To obtain all of these results, we must use
return
and build a list that contains all of the
objects.
# return a list
z_score_3 <- function(x) {
mean_x <- mean(x)
sd_x <- sd(x)
scaled_data <- (x - mean_x)/sd_x
return(list(mean_value = mean_x,
sd_value = sd_x,
scaled_vector = scaled_data))
}
detailed_results <- z_score_3(age_vector)
# View the results
detailed_results
$mean_value
[1] 32
$sd_value
[1] 12
$scaled_vector
[1] -1.20 -0.70 -0.95 0.40 1.57 0.14 0.73
Copyright © David Svancer 2023 |