As discussed in the slides, we will use functions from tidyr (automatically loaded with the tidyverse) to manipulate our data.
gather for wide to narrowLet’s read in our data set.
First we need to call in our libraries:
library(tidyverse)
Now we can read in our data using readr’s read_csv function
gap_wide <- read_csv("data/gapminder_wide.csv")
## Parsed with column specification:
## cols(
## .default = col_integer()
## )
## See spec(...) for full column specifications.
Go ahead and look at the data in the environment pane
Now we need to use the gather function:
gap_narrow <- gap_wide %>%
gather(key = "country", value = "population", -year)
Now we can inspect our work:
head(gap_narrow)
As a teaser, this can be used to make some very slick graphics
gap_narrow %>%
filter(country %in% c("India", "China", "Germany", "France")) %>%
ggplot(aes(year, population))+
geom_smooth()+
facet_wrap(~country)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

spreadSometimes data come in a long format where it needs to be a little wider to fit our tidy paradigm.
Let’s experiment with spreading data. Let’s read in the health_long.xlsx file from our data folder.
This requires the readxl package.
library(readxl)
Now let’s read it into memory.
health_long <- read_excel("data/health_long.xlsx")
We can look at the format of the data
head(health_long)
It appears that each subject appears in multiple rows with repeating attributes in the “measurement” column.
Thus we want to spread measurement (which will make new column names) and then have the values in these new columns be those values in the value column.
health_long %>%
spread(key = measurement, value = value)
Read in the heights.dta file. This file contains data on some physical attributes and earnings of subjects in a study.
Hint: you will need to use the haven package. Put the data into a tidy format. You will note that it is currently in a long format.
library(haven)
heights_raw <- read_dta("data/heights.dta")
Long -> Wide
heights_raw %>%
spread(key = description, value = value, fill = NA)
Read in the gapminder.sav data set and collapse all the metrics for life expectancy, population and GDP per Population into two columns one for parameter_name and the other for value.
gapminder<-read_sav("data/gapminder.sav")
Hint: you will need to use the -group_1, -group_2, etc syntax to not collapse the grouping variables that you wish to keep.
gapminder %>%
gather(key = parameter_name, value = value, -country, -continent, -year)
## Warning: attributes are not identical across measure variables;
## they will be dropped
Using select to subset
From the gapminder data set that you already read into memory, select the “year” and “pop” columns
gapminder %>%
select(year, pop)
Use the unite function to combine “year” and “country” into one column called country_year with values separated by a “-”. Save this into an object called “unite_demo.”
# In this case I will specify the two columns I want to joinh
# Equivalently I could have used the -lifeExp, -Continent, -pop, -gpdPercap and gotten the same results
unite_demo <- gapminder %>%
unite(country_year, sep = "-", c(year, country))
unite_demo
Now convert this new data set back into the “year” and “country” columns using separate.
unite_demo %>%
separate(col = country_year, into = c("country", "year"), sep = "-")
Now let’s return to the gap_narrow dataset and filter the “year” field for values for “1977” only using the filter function and the == operator
gap_narrow %>%
filter(year == 1977)
Using the gap_narrow dataset let’s print the row with the maximum population.
Let’s combine a few operations together using the pipe %>%.
Take the “health_long” data set
spread to a tidy format
rename the “subject” column to “subject_id”
use mutate to create a value for total cholesterol (e.g. total_cholesterol = lhl + hdl)
health_long %>%
spread(key = measurement, value = value, fill = NA) %>%
rename(subject_id = subject) %>%
mutate(total_cholesterol = ldl + hdl)
Of course if I want to write this data out I can do so:
my_new_health_data <- health_long %>%
spread(key = measurement, value = value, fill = NA) %>%
rename(subject_id = subject) %>%
mutate(total_cholesterol = ldl + hdl)
write_csv(my_new_health_data, "outputs/my_new_health_data.csv")
I discussed group_by for group-wise operations. Taking our gapminder data we could summarise the world population by year.
gapminder %>%
group_by(year) %>%
summarise(total_pop = max(pop))
Or by year and continent:
gapminder %>%
group_by(year, continent) %>%
summarise(total_pop = max(pop))
This then lets us do some neat graphing
gapminder %>%
group_by(year, continent) %>%
summarise(total_pop = max(pop)) %>%
ggplot(aes(year, total_pop,
group = continent, color = as_factor(continent)))+
geom_line()+
scale_y_log10()+
labs(
title = "Population Trends Over Time",
subtitle = "From The Hans Rosling's Gapminder Dataset",
y = "Population (log10)",
x = NULL,
color = "Continent"
)+
theme_minimal()
## Don't know how to automatically pick scale for object of type labelled. Defaulting to continuous.

Introduction to R
dewittme.wfu.edu
Office of Institutional Research
309 Reynolda Hall
Winston- Salem, NC, 27106
Copyright © 2018 Michael DeWitt. All rights reserved.