As discussed in the slides, we will use functions from tidyr
(automatically loaded with the tidyverse
) to manipulate our data.
gather
for wide to narrowLet’s read in our data set.
First we need to call in our libraries:
library(tidyverse)
Now we can read in our data using readr
’s read_csv
function
gap_wide <- read_csv("data/gapminder_wide.csv")
## Parsed with column specification:
## cols(
## .default = col_integer()
## )
## See spec(...) for full column specifications.
Go ahead and look at the data in the environment pane
Now we need to use the gather
function:
gap_narrow <- gap_wide %>%
gather(key = "country", value = "population", -year)
Now we can inspect our work:
head(gap_narrow)
As a teaser, this can be used to make some very slick graphics
gap_narrow %>%
filter(country %in% c("India", "China", "Germany", "France")) %>%
ggplot(aes(year, population))+
geom_smooth()+
facet_wrap(~country)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
spread
Sometimes data come in a long format where it needs to be a little wider to fit our tidy paradigm.
Let’s experiment with spreading data. Let’s read in the health_long.xlsx
file from our data folder.
This requires the readxl
package.
library(readxl)
Now let’s read it into memory.
health_long <- read_excel("data/health_long.xlsx")
We can look at the format of the data
head(health_long)
It appears that each subject appears in multiple rows with repeating attributes in the “measurement” column.
Thus we want to spread measurement (which will make new column names) and then have the values in these new columns be those values in the value column.
health_long %>%
spread(key = measurement, value = value)
Read in the heights.dta
file. This file contains data on some physical attributes and earnings of subjects in a study.
Hint: you will need to use the haven
package. Put the data into a tidy format. You will note that it is currently in a long
format.
library(haven)
heights_raw <- read_dta("data/heights.dta")
Long -> Wide
heights_raw %>%
spread(key = description, value = value, fill = NA)
Read in the gapminder.sav
data set and collapse all the metrics for life expectancy, population and GDP per Population into two columns one for parameter_name
and the other for value
.
gapminder<-read_sav("data/gapminder.sav")
Hint: you will need to use the -group_1, -group_2, etc
syntax to not collapse the grouping variables that you wish to keep.
gapminder %>%
gather(key = parameter_name, value = value, -country, -continent, -year)
## Warning: attributes are not identical across measure variables;
## they will be dropped
Using select
to subset
From the gapminder
data set that you already read into memory, select the “year” and “pop” columns
gapminder %>%
select(year, pop)
Use the unite
function to combine “year” and “country” into one column called country_year with values separated by a “-”. Save this into an object called “unite_demo.”
# In this case I will specify the two columns I want to joinh
# Equivalently I could have used the -lifeExp, -Continent, -pop, -gpdPercap and gotten the same results
unite_demo <- gapminder %>%
unite(country_year, sep = "-", c(year, country))
unite_demo
Now convert this new data set back into the “year” and “country” columns using separate
.
unite_demo %>%
separate(col = country_year, into = c("country", "year"), sep = "-")
Now let’s return to the gap_narrow
dataset and filter the “year” field for values for “1977” only using the filter
function and the ==
operator
gap_narrow %>%
filter(year == 1977)
Using the gap_narrow
dataset let’s print the row with the maximum population.
Let’s combine a few operations together using the pipe %>%
.
Take the “health_long” data set
spread
to a tidy format
rename
the “subject” column to “subject_id”
use mutate
to create a value for total cholesterol (e.g. total_cholesterol = lhl + hdl)
health_long %>%
spread(key = measurement, value = value, fill = NA) %>%
rename(subject_id = subject) %>%
mutate(total_cholesterol = ldl + hdl)
Of course if I want to write this data out I can do so:
my_new_health_data <- health_long %>%
spread(key = measurement, value = value, fill = NA) %>%
rename(subject_id = subject) %>%
mutate(total_cholesterol = ldl + hdl)
write_csv(my_new_health_data, "outputs/my_new_health_data.csv")
I discussed group_by
for group-wise operations. Taking our gapminder data we could summarise the world population by year.
gapminder %>%
group_by(year) %>%
summarise(total_pop = max(pop))
Or by year and continent:
gapminder %>%
group_by(year, continent) %>%
summarise(total_pop = max(pop))
This then lets us do some neat graphing
gapminder %>%
group_by(year, continent) %>%
summarise(total_pop = max(pop)) %>%
ggplot(aes(year, total_pop,
group = continent, color = as_factor(continent)))+
geom_line()+
scale_y_log10()+
labs(
title = "Population Trends Over Time",
subtitle = "From The Hans Rosling's Gapminder Dataset",
y = "Population (log10)",
x = NULL,
color = "Continent"
)+
theme_minimal()
## Don't know how to automatically pick scale for object of type labelled. Defaulting to continuous.
Introduction to R
dewittme.wfu.edu
Office of Institutional Research
309 Reynolda Hall
Winston- Salem, NC, 27106
Copyright © 2018 Michael DeWitt. All rights reserved.