In this tutorial, we will learn about data types and common data
structures used in R
.
Data types represent different types of information
that can be stored in R
. The most common R
data types are:
Data structures provide a way to organize and work with different types of data. The data structures we will learn about include:
Click the button below to clone the course R tutorials into your DataCamp Workspace. DataCamp Workspace is a free computation environment that allows for execution of R and Python notebooks. Note - you will only have to do this once since all tutorials are included in the DataCamp Workspace GBUS 738 project.
The most common R data types include numeric,
integer, logical, and character. The table below provides examples of
how each data type is represented in R
.
Data Type | Example Values |
---|---|
Numeric (Double) | 8.123, 10, 2.71812 |
Integer | 1L, 19L, 2000L |
Logical | TRUE, FALSE, T, F |
Character | “A character string of text”, “d”, “8.23” |
Numeric data types represent real numbers, such as 2.345, \(\pi\), and 4.001.
Integer data types represent whole counting numbers and are entered into R by adding an “L” after the number.
Logical data types represent the logical conditions TRUE and FALSE. They can be entered as the unquoted text, TRUE, or just T, for example.
Character data types represent text data and must be entered enclosed between quotes, either single ’ or double “.
The most common data structures in R
, can be categorized
by their dimension (one or two) and restrictions on their contents in
terms of data types.
They can be homogeneous, where all elements are of the same data type or heterogeneous, where contents may have multiple data types.
In this course, we will be using vectors, matrices, data frames, and lists. The key features of these data structures are summarized in the table below.
Data Structure | Data Type Restriction | Dimension |
---|---|---|
Vector | Homogeneous | 1 |
List | Heterogeneous | 1 |
Matrix | Homogeneous | 2 |
Data frame | Heterogeneous | 2 |
A vector is a one-dimensional sequence of data elements of the same type.
Vectors are constructed with the c() function. To
assign a vector to a variable, use the <-
operator (a
keyboard shortcut of this symbol is “Alt” + “-”).
The code below will create a numeric vector with 4 elements and print
the result to the R
console.
# A numeric vector
c(4, 23, 4.1, 3.5)
[1] 4.0 23.0 4.1 3.5
To assign the results to a variable in our R
environment, we use the <-
operator.
# Assign results to numeric_vec
numeric_vec <- c(4, 23, 4.1, 3.5)
When working with any data structure in R
, it is
important to be able to explore its contents and obtain information
about the type of data stored in the structure.
To get information about any data structure, we can use the
str()
function. This will display the data type of a vector
and print it contents.
# Check the type and contents of numeric_vec
# We see that it is numeric (num)
str(numeric_vec)
num [1:4] 4 23 4.1 3.5
Another important attribute of vectors is how many data elements it
contains. This is provided by passing a vector into the
length()
function.
length(numeric_vec)
[1] 4
The c() function can be used to create vectors using single input data elements separated by commas, pre-defined vectors, vectors created with the c() function, or a mixture of all formats. In the examples below, various ways of creating new vectors is demonstrated.
# Combine a pre-defined vector with additional data
numeric_vec_2 <- c(numeric_vec, 4.7, 5.1)
# View result
numeric_vec_2
[1] 4.0 23.0 4.1 3.5 4.7 5.1
# Adding another c() function within an outer c() function
numeric_vec_3 <- c(1.2,
numeric_vec,
c(1.1, 2.4, 4.1))
numeric_vec_3
[1] 1.2 4.0 23.0 4.1 3.5 1.1 2.4 4.1
There are two useful functions, seq()
and
:
, for creating numeric or integer vectors.
The seq()
function has three important
arguments:
These arguments can be provided by name, as shown below
seq_vec <- seq(from = 1, to = 6, by = 1)
str(seq_vec)
num [1:6] 1 2 3 4 5 6
Or by position
seq(1, 6, 1)
[1] 1 2 3 4 5 6
The seq()
function will create integer vectors
if we pass numbers followed by “L” into the function.
seq_int_vec <- seq(1L, 10L, 2L)
str(seq_int_vec)
int [1:5] 1 3 5 7 9
The :
function can be used to quickly generate a
numeric/integer vector that increments by one. The vector is created
using the following rule: start value:end value
# Numeric vector
1:5
[1] 1 2 3 4 5
# Integer vector
1L:10L
[1] 1 2 3 4 5 6 7 8 9 10
# Also works in reverse
5:1
[1] 5 4 3 2 1
To learn more about the :
or any other function in
R
, just execute ?:
in the console and the help
page will pop up in the lower right portion of RStudio.
R
All elements of a vector must be of the same type. When combining different data types into a single vector, it will be coerced in the following precedence order:
This means that if you mix character elements with numeric and integer element, then all elements get converted to character (since it has higher precedence).
These rules are important to understand, since many errors that show up in your code will be due to a mismatch of data types.
The vector below will get converted to a character vector.
vector_1 <- c(2.45, 5.1, 1L, 'character')
str(vector_1)
chr [1:4] "2.45" "5.1" "1" "character"
The vector below will get converted to a numeric vector. Notice that
R
converts logical values in the following way: TRUE
becomes 1 and FALSE becomes 0.
vector_2 <- c(4.234, 10L, TRUE, T, FALSE)
str(vector_2)
num [1:5] 4.23 10 1 1 0
This vector will be converted to an integer vector.
vector_3 <- c(10L, 5L, TRUE, FALSE)
str(vector_3)
int [1:4] 10 5 1 0
Factors are a special data structure for working with categorical data. Categorical data represents data that only differs by label (such ‘yes’/‘no’) or ranks (such as ‘1st’, ‘2nd’, etc.).
In R, factors are a special type of labeled integer vector. Factors are created with the factor() function. This function takes as arguments, a vector, the levels of the factor, and the labels of the factor.
Think of factors as a way to label your data. Factors should only be used when you have a pre-determined number of categories.
# Creating a factor vector
weekday_factor <- factor(c('M', 'T', 'W', 'Th', 'F', 'M', 'W'),
levels = c('M', 'T', 'W', 'Th', 'F'),
labels = c('Monday', 'Tuesday', 'Wednesday',
'Thursday', 'Friday'))
# View results
weekday_factor
[1] Monday Tuesday Wednesday Thursday Friday Monday Wednesday
Levels: Monday Tuesday Wednesday Thursday Friday
The str()
function will tell us that the vector is
factor, display some of the levels, and show the underlying mapping of
levels to integer values that happened behind the scenes.
The levels of a factor are always mapped to a sequence of numbers
starting at 1 and increasing by 1 for every level. This mapping is based
on the order in which the levels are entered in
factor()
str(weekday_factor)
Factor w/ 5 levels "Monday","Tuesday",..: 1 2 3 4 5 1 3
The summary()
function will automatically count the
occurrence of factor labels.
summary(weekday_factor)
Monday Tuesday Wednesday Thursday Friday
2 1 2 1 1
Factors can also be created with numeric vectors as input. Let’s say that we have a vector of 1s and 0s where 1 represents the occurrence of an event and 0 otherwise. The code below shows how to create a labeled factor from the data.
event_indicator <- c(1, 0, 0, 1, 0, 0)
event_fct <- factor(event_indicator,
levels = c(0, 1),
labels = c('No', 'Yes'))
summary(event_fct)
No Yes
4 2
Note that the order in which the levels are entered effects how they are stored in the factor.
event_fct_2 <- factor(event_indicator,
levels = c(1, 0),
labels = c('Yes', 'No'))
summary(event_fct_2)
Yes No
2 4
To access the levels of any factor and see their order, use the
levels()
function.
levels(event_fct)
[1] "No" "Yes"
levels(event_fct_2)
[1] "Yes" "No"
By default, if we do not provide input to the levels and labels
arguments in factor()
, levels are automatically assigned in
alphabetic order (for character vectors) or numeric order. The labels
are then set to the levels values.
fct_from_chr <- factor(c('Yes', 'No', 'No', 'Yes'))
str(fct_from_chr)
Factor w/ 2 levels "No","Yes": 2 1 1 2
fct_from_num <- factor(c(1, 1, 1, 4, 5))
str(fct_from_num)
Factor w/ 3 levels "1","4","5": 1 1 1 2 3
The survey
vector below represents survey responses
where people indicated their level of comfort with data analysis.
survey <- c(1, 3, 3, 2, 2, 1, 1, 1, 1)
The numeric values have the following meaning:
Use the factor()
function to label this vector. You
should get the results below if you pass your factor into the
summary()
function.
not comfortable moderately comfortable very comfortable
5 2 2
A matrix is an R data structure that stores a collection of data arranged in a 2 dimensional table with rows and columns. Like vectors, all data elements of a matrix must be of the same type.
If you build a matrix with vectors of different data types, the matrix will be coerced with the same precedence rules as above. A matrix can’t store a numeric column as well as a character column, this would get coerced into a character matrix.
You can create matrices with the matrix()
function.
Vectors are created with the c()
function. This generalizes
to cbind()
or rbind()
for matrices. The code
below demonstrates these functions.
Matrices also have the ability to capture row and column names. To
check these, we can use the rownames()
and
colnames()
To create a matrix, use the matrix()
function.
# Build a numeric 2 X 2 matrix
A <- matrix(data = c(1, 2, 3, 4), # data is entered as a vector
nrow = 2, # number of rows
ncol = 2, # number of columns
byrow = TRUE) # read data in by row (default is FALSE)
The str()
function will lets us know that we have a
numeric matrix with 2 rows and 2 columns.
# View properties with str()
str(A)
num [1:2, 1:2] 1 3 2 4
We can check the dimensions of a matrix using the dim()
function.
# Check the dimensions of A (rows,columns)
dim(A)
[1] 2 2
# View A
A
[,1] [,2]
[1,] 1 2
[2,] 3 4
Matrices can also be created with pre-defined vectors.
# We can build a matrix with a predefined vector
vector_1 <- c(1, 2, 3, 4)
B <- matrix(vector_1,
nrow = 2,
ncol = 2,
byrow = FALSE)
B
[,1] [,2]
[1,] 1 3
[2,] 2 4
We can combine multiple vectors to create matrices using the
cbind()
and rbind()
functions.
cbind()
will stack vectors horizontally (by column) and
rbind()
will stack vectors vertically (by row).
vector_2 <- c(5, 6, 7, 8)
C <- cbind(vector_1, vector_2)
C
vector_1 vector_2
[1,] 1 5
[2,] 2 6
[3,] 3 7
[4,] 4 8
Since we created our matrix using named vectors, our matrix C has
column names. To view the column names of any matrix, use the
colnames()
function.
# Obtain column names
colnames(C)
[1] "vector_1" "vector_2"
rbind()
creates matrices by row.
D <- rbind(vector_1, vector_2)
D
[,1] [,2] [,3] [,4]
vector_1 1 2 3 4
vector_2 5 6 7 8
We can access the row names by using the rownames()
function on our matrix D.
rownames(D)
[1] "vector_1" "vector_2"
We can also create or overwrite row/columns names for any matrix. In the code below, we assign column names to our matrix D and overwrite the original row names.
colnames(D) <- c('column_1', 'column_2', 'column_3', 'column_4')
rownames(D) <- c('row_1', 'row_2')
D
column_1 column_2 column_3 column_4
row_1 1 2 3 4
row_2 5 6 7 8
Like matrices, lists are R objects that can hold multiple vectors. But unlike matrices, they are one dimensional and can store mixed data types.
The advantage of lists is that they can hold varying data types with different lengths and dimensions. Think of lists as special vectors that can store different data structures in each position. Lists can be recursive, meaning that a list can contain list.
Lists are very important as most output from
statistical models in R
, such as linear regression or
clustering, are returned as lists.
Lists are created with the list()
function. To obtain
the named contents of a list, if any exist, use the names()
function.
We will discuss how to obtain the various objects within a list in the subsetting section of this tutorial.
my_list <- list(char_vector = c('A', 'B'),
numeric_vector = c(1.2, 3.4, 5, 12.01),
a_matrix = cbind(c(1, 2), c(3, 4)),
a_list = list(c(1L, 4L), c('A', 'D', 'E')))
# View contents
my_list
$char_vector
[1] "A" "B"
$numeric_vector
[1] 1.20 3.40 5.00 12.01
$a_matrix
[,1] [,2]
[1,] 1 3
[2,] 2 4
$a_list
$a_list[[1]]
[1] 1 4
$a_list[[2]]
[1] "A" "D" "E"
We can use the names()
function to see whether a list
has named elements. Remember that a list is one-dimensional. We can see
from the output below that my_list
contains 4 elements,
where the vector named char_vector is in the first
position of the list.
# View the names (if they exist) of my_list
names(my_list)
[1] "char_vector" "numeric_vector" "a_matrix" "a_list"
The most common R
data structure is the data frame. A
data frame is a specialized 2-dimensional list that must contain
equal-length vectors, possibly of varying type. Data frames are created
with the data.frame()
function.
A good comparison for a data frame would be a SQL table.
Data frames are rectangular tables of data, where columns represent variables and rows represent observations on these variables. Unlike general one-dimensional lists, data frames must have vectors of the same length. However, vectors may have different types, such as numeric, character, or factor.
To obtain the names of the variables in a data frame, use
names()
or colnames()
. To get the number of
rows in a data frame, use nrow()
or dim()
.
# Let's create a simple data frame with the data.frame function
my_data <- data.frame(student_id = c(100234, 132454, 453123),
test_1_grade = c(82, 93, 87),
hw_1_grade = c(92, 89, 98),
session = c("7 AM", "7 PM", "7 AM"))
# View the data
my_data
We can obtain the column names of our data frame by either using
names()
or colnames()
.
# Get the variable names
names(my_data)
[1] "student_id" "test_1_grade" "hw_1_grade" "session"
colnames(my_data)
[1] "student_id" "test_1_grade" "hw_1_grade" "session"
To get the number of rows or columns in a data frame, we can use
nrow()
, ncol()
, or dim()
.
ncol(my_data)
[1] 4
nrow(my_data)
[1] 3
# Number of rows and columns
dim(my_data)
[1] 3 4
In this section we will cover the most common operators in
R
that are used to manipulate data structures.
The <-
operator is used to create variables in
R
. This operator will assign what is to the right of it to
the variable name on the left side. We’ve been using this throughout the
tutorial.
A keyboard shortcut for <-
is ‘Alt’ + ‘-’ in
Windows.
my_vector <- c(10, 20)
my_vector
[1] 10 20
Standard mathematical operations, such as addition and
multiplication, are available in R
. In the examples below,
these operations are demonstrated with the appropriate operators.
R
computes operations in a vectorized
manner, meaning that if you add two vectors, for example, the addition
is performed element-wise within the corresponding vectors.
If you multiple a vector by a single number, all elements of the vector are multiplied by that number. This is commonly referred to as broadcasting.
# + operator adds two vectors, element-wise
v_1 <- c(2, 4, 7)
v_2 <- c(2, 5, 8)
v_1 + v_2
[1] 4 9 15
# - operator subtracts two vectors
v_1 - v_2
[1] 0 -1 -1
# * operator multiples two vectors
v_1 * v_2
[1] 4 20 56
# Multiplication by a constant
5*v_1
[1] 10 20 35
# Exponentiation
v_1 ^ 2
[1] 4 16 49
# / operator divides two vectors
v_1/v_2
[1] 1.000 0.800 0.875
Relational operators, such as <=
or
>
, are used to compare data values to each other. The
results from using relational operators will always return a logical
vector.
# Check where elements of v_1 are greater than v_2
# Produces a logical vector
v_1 > v_2
[1] FALSE FALSE FALSE
# >= greater than or equal to
v_1 >= v_2
[1] TRUE FALSE FALSE
# <
v_1 < v_2
[1] FALSE TRUE TRUE
# <=
v_1 <= v_2
[1] TRUE TRUE TRUE
# == operator checks for equality
v_1 == v_2
[1] TRUE FALSE FALSE
# != operator finds where the two vectors are not equal
v_1 != v_2
[1] FALSE TRUE TRUE
# Check where v_1 is greater than 2
v_1 > 2
[1] FALSE TRUE TRUE
Logical operators, such as AND (&
), OR
(|
), and NOT (!
), are used to perform
calculations with logical data types in R
. These are
important for filtering rows of data frames where variables meet certain
conditions.
a <- 5
b <- 10
The &
operator represents the logical AND operation.
It will return TRUE if both conditions on the left and right of it are
TRUE.
a == 5 & b == 10
[1] TRUE
For vectors, all pairwise elements are compared.
v_1 >= 3 & v_2 >= 2
[1] FALSE TRUE TRUE
The |
operator represents the logical OR operation. It
will return TRUE if either one of the conditions on the left and right
of it are TRUE.
a > 6 | b > 7
[1] TRUE
For vectors, all pairwise elements are compared.
v_1 > 5 | v_2 > 5
[1] FALSE FALSE TRUE
Finally, !
is used for negation. This means it will
convert all TRUE values to FALSE and FALSE values to TRUE.
L1 <- c(TRUE, FALSE, TRUE)
L1
[1] TRUE FALSE TRUE
# Give L1 opposite logical values
!L1
[1] FALSE TRUE FALSE
Be careful when using the !
operator. It is always best
to enclose any logical test within parentheses to make sure you are
getting the negation.
Without parentheses in the code below, !v_1 >= v_2
would be carried out as ‘negate v_1 and test whether it is greater than
v_2’. This is different from ‘test where v_1 is greater than v_2 and
negate the result’ that is executed by the code below.
# Find where v_1 is not greater than or equal to v_2
!(v_1 >= v_2)
[1] FALSE TRUE TRUE
When you want to extract elements of a vector that meet a logical
condition, or vectors stored in lists or data frames, you will have to
subset the R
objects with the [ ]
,
[[ ]]
or $
functions. This is best
demonstrated with some examples.
It’s possible to subset vectors in R
by using logical or
numeric vectors. I will demonstrate both methods below.
number_vector <- c(1, 3, 6, 10)
The code below produces a logical vector that indicates where
number_vector
is greater than 5
number_vector > 5
[1] FALSE FALSE TRUE TRUE
We can use the logical vector result from above to get only the
elements from number_vector
that are greater that 5. We
just place the logical condition within the [ ]
function to
the right of the original vector.
number_vector[number_vector > 5]
[1] 6 10
We can also use relational or logical operators.
number_vector > 3 | number_vector <= 10
[1] TRUE TRUE TRUE TRUE
number_vector[number_vector > 3 | number_vector <= 10]
[1] 1 3 6 10
We can also subset vectors with a numeric indexing vector. The code
below returns the 2nd and 4th elements from
number_vector
.
number_vector[c(2, 4)]
[1] 3 10
Unlike many other programming languages, elements within
R
data structures start at 1, not 0.
number_vector[1]
[1] 1
Remember that lists are collections of various data structures. To
access the data structures stored within lists, we can use the
[ ]
, [[ ]]
or $
functions.
If you have a list, call it my_list
, then
my_list[1]
returns a list with the first element of
my_list
. This is different from my_list[[1]]
,
which returns the contents of the first element of
my_list
.
If you imagine that my_list
is a large box that contains
10 small boxes, then my_list[1]
returns the first box
(which is still a box), while my_list[[1]]
returns the
contents of the first box (which may no longer be a box).
student_list <- list(student_id = c(12, 15),
section = c('001', '003'),
age = c(26, 20))
The code below will return a list that contains the first element of
student_list
.
student_list[1]
$student_id
[1] 12 15
# Check properties with str()
str(student_list[1])
List of 1
$ student_id: num [1:2] 12 15
If we want to extract the first element from the list, we need to use
[[ ]]
. Notice that str()
now tells use that we
got a numeric vector.
student_list[[1]]
[1] 12 15
# Check properties with str()
str(student_list[[1]])
num [1:2] 12 15
You can subset lists with the names of the elements stored in it. The code below is eqivalent to the code from above.
student_list[["student_id"]]
[1] 12 15
What if we want to extract the first number from the
student_id
vector that is in the first position of
student_list
? We will have to use [[ ]]
followed by [ ]
student_list[["student_id"]][1]
[1] 12
The $
operator is a shortcut for extracting elements
from a list and works like [[ ]]
# This will extract the student_id vector
student_list$student_id
[1] 12 15
This is an alternate way to get the first element of the
student_id
vector within student_list
.
student_list$student_id[1]
[1] 12
To subset rows and columns of a data frame we can use the following syntax:
my_data_frame[row condition, column condition]
The row/column conditions may be either numeric indexes, logical expressions, or vectors of column names (for column selection)
my_data_frame <- data.frame(make = c("Toyota","Honda","Ford", "Toyota",
"Ford", "Honda"),
mpg = c(34, 33, 22, 32, 29, 27),
cylinders = c(4, 4, 8, 6, 6, 8))
# View data
my_data_frame
In the example below, we select rows 1 - 3 and columns 1 - 2.
Remember that the :
functions creates a sequence of number
values. 1:3
will create the following vector [1, 2, 3].
# First three rows, first two columns
my_data_frame[1:3,1:2]
my_data_frame[c(1, 2, 3), c(1, 2)]
# row 2, columns 1 and 3
my_data_frame[2, c(1, 3)]
Leaving a row or column condition blank will return all values along that axis.
# Rows 1 and 2, all columns
my_data_frame[1:2, ]
We can also pass logical vectors into the row condition to obtain a
subset of our data. Let’s demonstrate this by selecting rows that have
mpg
values greater than or equal to 30.
# Create logical vector
logical_condition <- my_data_frame$mpg >= 30
# View results
logical_condition
[1] TRUE TRUE FALSE TRUE FALSE FALSE
Now we pass this logical vector into the row condition.
# Use to subset data
my_data_frame[logical_condition, ]
In practice, you would write this in one step.
my_data_frame[my_data_frame$mpg >= 30, ]
Both methods of indexing can be combined in a single expression. Below are some examples.
# All rows with at least 32 mpg, columns 2 and 3
my_data_frame[my_data_frame$mpg >= 32, c(2, 3)]
# The same conditions, but using column names
my_data_frame[my_data_frame$mpg >= 32, c("mpg", "cylinders")]
Just like with lists, a single [] will return a data frame and a double [[ ]] will extract a column vector from a data frame.
# This returns a data frame
my_data_frame[1]
str(my_data_frame[1])
'data.frame': 6 obs. of 1 variable:
$ make: chr "Toyota" "Honda" "Ford" "Toyota" ...
# This extracts the first column
my_data_frame[[1]]
[1] "Toyota" "Honda" "Ford" "Toyota" "Ford" "Honda"
str(my_data_frame[[1]])
chr [1:6] "Toyota" "Honda" "Ford" "Toyota" "Ford" "Honda"
The above can also be accomplished by using the name of the first column.
my_data_frame[['make']]
[1] "Toyota" "Honda" "Ford" "Toyota" "Ford" "Honda"
Just like with lists, we can use $
instead of
[[ ]]
. In this case, you must use the name of the
column.
my_data_frame$make
[1] "Toyota" "Honda" "Ford" "Toyota" "Ford" "Honda"
In this exercise, we will be working with the following list
my_list <- list(classes_offered = c("MIS 431", "MIS 310", "MIS 410", "MIS 412"),
student_data = data.frame(student_id = c(54, 100, 32, 423,
2, 19, 39),
age = c(18, 22, 27, 18, 29,
22, 20),
gpa = c(3.1, 2.8, 3.7, 3.4, 3.2,
3.4, 3.2),
stringsAsFactors = FALSE))
Write the R
code that calculates the median value (use
the median()
function) of the gpa
variable in
student_data
. All you need to do is pass the
gpa
vector into the median()
function.
To read about the median()
function, just execute the
following in your R
session: ?median
Subset my_data_frame
to only include rows that have a
cylinders
value of 4.
Copyright © David Svancer 2023 |