From long to wide
The original data
Let’s take the gapminder data again. To help visualize how data are reshaped, let’s retain only data available after year 2000.
library(tidyverse)
library(gapminder)
data <- gapminder %>% filter(year >2000)
data
## # A tibble: 284 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 2002 42.1 25268405 727.
## 2 Afghanistan Asia 2007 43.8 31889923 975.
## 3 Albania Europe 2002 75.7 3508512 4604.
## 4 Albania Europe 2007 76.4 3600523 5937.
## 5 Algeria Africa 2002 71.0 31287142 5288.
## 6 Algeria Africa 2007 72.3 33333216 6223.
## 7 Angola Africa 2002 41.0 10866106 2773.
## 8 Angola Africa 2007 42.7 12420476 4797.
## 9 Argentina Americas 2002 74.3 38331121 8798.
## 10 Argentina Americas 2007 75.3 40301927 12779.
## # ... with 274 more rows
You can see that for each country, there are several lines, each corresponding to one year. This is typical of panel data where you collected the same information about a respondent at different times.
What do you want to obtain?
If your data set is originally in a long format, there may be cases where you want to reshape it in long format. The diagram below illustrates what is the objective:
Reshape one variable only
Let say you want to have a new table where:
- each line corresponds to a unique country
- each line contains information about the population for the different years
Since our original data has a long format, we will use the command pivot_wider()
. This function is part of the tidyr
package, a member of the tidyverse ecosystem, so do not forget to load tidyr
or the tidyverse
package:
data %>%
select(country, continent, year, pop) %>% # we keep the interesting variables
pivot_wider(names_from = year, values_from = c(pop), names_prefix = "popul") %>%
arrange(continent, country) # we sort
## # A tibble: 142 x 4
## country continent popul2002 popul2007
## <fct> <fct> <int> <int>
## 1 Algeria Africa 31287142 33333216
## 2 Angola Africa 10866106 12420476
## 3 Benin Africa 7026113 8078314
## 4 Botswana Africa 1630347 1639131
## 5 Burkina Faso Africa 12251209 14326203
## 6 Burundi Africa 7021078 8390505
## 7 Cameroon Africa 15929988 17696293
## 8 Central African Republic Africa 4048013 4369038
## 9 Chad Africa 8835739 10238807
## 10 Comoros Africa 614382 710960
## # ... with 132 more rows
We created two new columns which names are a combination of the names_prefix (here popul
) and the year (which we declared using the argument names_from). Each cell in these columns correspond to the countries' population in the corresponding years.
Reshape several variables
Let’s now assume that you want to have another table where:
- each line corresponds to a unique country
- each line contains information about the population and life expectancy for the 2000’s
Since there are two variables to be reshaped, the names_prefix is now automatically taken from the names_from
argument.
data %>%
select(country, continent, year, pop, lifeExp) %>% # we keep the interesting variables
pivot_wider(names_from = year, values_from = c(pop, lifeExp)) %>% #select variables to reshape
arrange(continent, country) # we sort
## # A tibble: 142 x 6
## country continent pop_2002 pop_2007 lifeExp_2002 lifeExp_2007
## <fct> <fct> <int> <int> <dbl> <dbl>
## 1 Algeria Africa 31287142 33333216 71.0 72.3
## 2 Angola Africa 10866106 12420476 41.0 42.7
## 3 Benin Africa 7026113 8078314 54.4 56.7
## 4 Botswana Africa 1630347 1639131 46.6 50.7
## 5 Burkina Faso Africa 12251209 14326203 50.6 52.3
## 6 Burundi Africa 7021078 8390505 47.4 49.6
## 7 Cameroon Africa 15929988 17696293 49.9 50.4
## 8 Central African Republ~ Africa 4048013 4369038 43.3 44.7
## 9 Chad Africa 8835739 10238807 50.5 50.7
## 10 Comoros Africa 614382 710960 63.0 65.2
## # ... with 132 more rows
To go further
The cases presented are the most basic uses of pivot_wider()
.
More details about pivot_wider can be found at: pivot_wider
Exercise: Wage data
For this exercise, we will use the data set WageData
that is coming with the panelr
package. The data come from the years 1976-1982 in the Panel Study of Income Dynamics (PSID), with information about the demographics and earnings of 595 individuals.
library(panelr)
data("WageData")
head(WageData)
## exp wks occ ind south smsa ms fem union ed blk lwage t id
## 1 3 32 0 0 1 0 1 0 0 9 0 5.56068 1 1
## 2 4 43 0 0 1 0 1 0 0 9 0 5.72031 2 1
## 3 5 40 0 0 1 0 1 0 0 9 0 5.99645 3 1
## 4 6 39 0 0 1 0 1 0 0 9 0 5.99645 4 1
## 5 7 42 0 1 1 0 1 0 0 9 0 6.06146 5 1
## 6 8 35 0 1 1 0 1 0 0 9 0 6.17379 6 1
The key columns are id
and t
. They tell you which respondent and which time point the row refers to, respectively. We will also retain the variable ed
that records the education level.
(wages <- WageData %>% select(id, t, lwage, ed)%>% as_tibble() )
## # A tibble: 4,165 x 4
## id t lwage ed
## <dbl> <dbl> <dbl> <dbl>
## 1 1 1 5.56 9
## 2 1 2 5.72 9
## 3 1 3 6.00 9
## 4 1 4 6.00 9
## 5 1 5 6.06 9
## 6 1 6 6.17 9
## 7 1 7 6.24 9
## 8 2 1 6.16 11
## 9 2 2 6.21 11
## 10 2 3 6.26 11
## # ... with 4,155 more rows
Task: Reshape wages it into a wide format, so that:
- one line corresponds to one id
- on each line, you can find the variables id, ed, and seven variables corresponding the lwages for each t.
Click to see the solution
wages %>% pivot_wider(names_from = t, values_from=lwage, names_prefix = "lwage")
## # A tibble: 595 x 9
## id ed lwage1 lwage2 lwage3 lwage4 lwage5 lwage6 lwage7
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 9 5.56 5.72 6.00 6.00 6.06 6.17 6.24
## 2 2 11 6.16 6.21 6.26 6.54 6.70 6.79 6.82
## 3 3 12 5.65 6.44 6.55 6.60 6.70 6.78 6.86
## 4 4 10 6.16 6.24 6.30 6.36 6.47 6.56 6.62
## 5 5 16 6.44 6.62 6.63 6.98 7.05 7.31 7.30
## 6 6 12 6.91 6.91 6.91 7.00 7.07 7.52 7.34
## 7 7 12 6.13 6.17 6.21 6.31 6.38 6.45 6.52
## 8 8 10 6.33 6.40 6.54 6.56 6.59 6.82 6.89
## 9 9 16 6.55 6.55 6.80 6.91 7.09 7.17 7.21
## 10 10 16 6.40 6.44 6.44 6.44 6.52 6.61 6.74
## # ... with 585 more rows