Data Manipulation Exercise
Ensure tidyverse
is installed.
Exercise 1
The iris
dataset is included with the R base package:
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Find out how many observations and variables there are in the dataset. Also, find out what type of variables the dataset contains.
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Exercise 2
What is the mean, median and variance of each of the numeric variables in the dataset?
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
var(iris$Sepal.Length)
## [1] 0.6856935
var(iris$Sepal.Width)
## [1] 0.1899794
var(iris$Petal.Length)
## [1] 3.116278
var(iris$Petal.Width)
## [1] 0.5810063
Exercise 3
What species of Iris are included in the dataset?
unique(iris$Species)
## [1] setosa versicolor virginica
## Levels: setosa versicolor virginica
Exercise 4
Calculate the mean, median and variance for each measurement type for each species of Iris.
setosa_index <- which(iris$Species == "setosa")
versicolor_index <- which(iris$Species == "versicolor")
virginica_index <- which(iris$Species == "virginica")
summary(iris[setosa_index,])
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.300 Min. :1.000 Min. :0.100
## 1st Qu.:4.800 1st Qu.:3.200 1st Qu.:1.400 1st Qu.:0.200
## Median :5.000 Median :3.400 Median :1.500 Median :0.200
## Mean :5.006 Mean :3.428 Mean :1.462 Mean :0.246
## 3rd Qu.:5.200 3rd Qu.:3.675 3rd Qu.:1.575 3rd Qu.:0.300
## Max. :5.800 Max. :4.400 Max. :1.900 Max. :0.600
## Species
## setosa :50
## versicolor: 0
## virginica : 0
##
##
##
summary(iris[versicolor_index,])
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.900 Min. :2.000 Min. :3.00 Min. :1.000
## 1st Qu.:5.600 1st Qu.:2.525 1st Qu.:4.00 1st Qu.:1.200
## Median :5.900 Median :2.800 Median :4.35 Median :1.300
## Mean :5.936 Mean :2.770 Mean :4.26 Mean :1.326
## 3rd Qu.:6.300 3rd Qu.:3.000 3rd Qu.:4.60 3rd Qu.:1.500
## Max. :7.000 Max. :3.400 Max. :5.10 Max. :1.800
## Species
## setosa : 0
## versicolor:50
## virginica : 0
##
##
##
summary(iris[virginica_index,])
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.900 Min. :2.200 Min. :4.500 Min. :1.400
## 1st Qu.:6.225 1st Qu.:2.800 1st Qu.:5.100 1st Qu.:1.800
## Median :6.500 Median :3.000 Median :5.550 Median :2.000
## Mean :6.588 Mean :2.974 Mean :5.552 Mean :2.026
## 3rd Qu.:6.900 3rd Qu.:3.175 3rd Qu.:5.875 3rd Qu.:2.300
## Max. :7.900 Max. :3.800 Max. :6.900 Max. :2.500
## Species
## setosa : 0
## versicolor: 0
## virginica :50
##
##
##
var(iris[setosa_index,"Sepal.Length"])
## [1] 0.124249
var(iris[setosa_index,"Sepal.Width"])
## [1] 0.1436898
#etc
Exercise 5
Rename the variables as shown below and store this updated dataset as iris.newnames
.
newnames <- c("SL", "SW", "PL", "PW", "Species")
iris.newnames <- iris
names(iris.newnames) <- newnames
## SL SW PL PW Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Exercise 6
Restructure iris
as shown below and save the result as iris2
.
iris2 <- gather(iris, key = type, value = value, c(Sepal.Length,Sepal.Width,Petal.Length,Petal.Width))
## Species type value
## 1 setosa Sepal.Length 5.1
## 2 setosa Sepal.Length 4.9
## 3 setosa Sepal.Length 4.7
## 4 setosa Sepal.Length 4.6
## 5 setosa Sepal.Length 5.0
## 6 setosa Sepal.Length 5.4
Exercise 7
Restructure iris
as shown below and save the result as iris.newnames3
.
iris3 <- separate(iris2, col = type, into = c("part", "measure"), sep = "\\.")
## Species part measure value
## 1 setosa Sepal Length 5.1
## 2 setosa Sepal Length 4.9
## 3 setosa Sepal Length 4.7
## 4 setosa Sepal Length 4.6
## 5 setosa Sepal Length 5.0
## 6 setosa Sepal Length 5.4
Exercise 8
Restructure iris
as shown below and save the result as iris4
. You do not always have to use spread()
, gather()
, etc. How else could you achieve this?
Length <- c(iris$Sepal.Length, iris$Petal.Length)
Width <- c(iris$Sepal.Width, iris$Petal.Width)
Part <- c(rep("Sepal", 150), rep("Petal",150))
Part <- rep(Part,2)
iris4 <- data.frame(Species = iris$Species, Length, Width, Part)
str(iris4)
## 'data.frame': 600 obs. of 4 variables:
## $ Species: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Length : num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Part : Factor w/ 2 levels "Petal","Sepal": 2 2 2 2 2 2 2 2 2 2 ...
## Species Length Width Part
## 1 setosa 5.1 3.5 Sepal
## 2 setosa 4.9 3.0 Sepal
## 3 setosa 4.7 3.2 Sepal
## 4 setosa 4.6 3.1 Sepal
## 5 setosa 5.0 3.6 Sepal
## 6 setosa 5.4 3.9 Sepal