From long to wide

The original data

Let’s take the gapminder data again. To help visualize how data are reshaped, let’s retain only data available after year 2000.

library(tidyverse)
library(gapminder)
data <- gapminder %>% filter(year >2000)
data

## # A tibble: 284 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       2002    42.1 25268405      727.
##  2 Afghanistan Asia       2007    43.8 31889923      975.
##  3 Albania     Europe     2002    75.7  3508512     4604.
##  4 Albania     Europe     2007    76.4  3600523     5937.
##  5 Algeria     Africa     2002    71.0 31287142     5288.
##  6 Algeria     Africa     2007    72.3 33333216     6223.
##  7 Angola      Africa     2002    41.0 10866106     2773.
##  8 Angola      Africa     2007    42.7 12420476     4797.
##  9 Argentina   Americas   2002    74.3 38331121     8798.
## 10 Argentina   Americas   2007    75.3 40301927    12779.
## # ... with 274 more rows

You can see that for each country, there are several lines, each corresponding to one year. This is typical of panel data where you collected the same information about a respondent at different times.

What do you want to obtain?

If your data set is originally in a long format, there may be cases where you want to reshape it in long format. The diagram below illustrates what is the objective:

Reshape one variable only

Let say you want to have a new table where:

each line corresponds to a unique country
each line contains information about the population for the different years

Since our original data has a long format, we will use the command pivot_wider(). This function is part of the tidyr package, a member of the tidyverse ecosystem, so do not forget to load tidyr or the tidyverse package:

data %>% 
  select(country, continent, year, pop) %>%  # we keep the interesting variables
  pivot_wider(names_from = year, values_from = c(pop), names_prefix = "popul") %>%
  arrange(continent, country) # we sort

## # A tibble: 142 x 4
##    country                  continent popul2002 popul2007
##    <fct>                    <fct>         <int>     <int>
##  1 Algeria                  Africa     31287142  33333216
##  2 Angola                   Africa     10866106  12420476
##  3 Benin                    Africa      7026113   8078314
##  4 Botswana                 Africa      1630347   1639131
##  5 Burkina Faso             Africa     12251209  14326203
##  6 Burundi                  Africa      7021078   8390505
##  7 Cameroon                 Africa     15929988  17696293
##  8 Central African Republic Africa      4048013   4369038
##  9 Chad                     Africa      8835739  10238807
## 10 Comoros                  Africa       614382    710960
## # ... with 132 more rows

We created two new columns which names are a combination of the names_prefix (here popul) and the year (which we declared using the argument names_from). Each cell in these columns correspond to the countries' population in the corresponding years.

Reshape several variables

Let’s now assume that you want to have another table where:

each line corresponds to a unique country
each line contains information about the population and life expectancy for the 2000’s

Since there are two variables to be reshaped, the names_prefix is now automatically taken from the names_from argument.

data %>% 
  select(country, continent, year, pop, lifeExp) %>%  # we keep the interesting variables
  pivot_wider(names_from = year, values_from = c(pop, lifeExp)) %>% #select variables to reshape
  arrange(continent, country) # we sort

## # A tibble: 142 x 6
##    country                 continent pop_2002 pop_2007 lifeExp_2002 lifeExp_2007
##    <fct>                   <fct>        <int>    <int>        <dbl>        <dbl>
##  1 Algeria                 Africa    31287142 33333216         71.0         72.3
##  2 Angola                  Africa    10866106 12420476         41.0         42.7
##  3 Benin                   Africa     7026113  8078314         54.4         56.7
##  4 Botswana                Africa     1630347  1639131         46.6         50.7
##  5 Burkina Faso            Africa    12251209 14326203         50.6         52.3
##  6 Burundi                 Africa     7021078  8390505         47.4         49.6
##  7 Cameroon                Africa    15929988 17696293         49.9         50.4
##  8 Central African Republ~ Africa     4048013  4369038         43.3         44.7
##  9 Chad                    Africa     8835739 10238807         50.5         50.7
## 10 Comoros                 Africa      614382   710960         63.0         65.2
## # ... with 132 more rows

To go further

The cases presented are the most basic uses of pivot_wider().

More details about pivot_wider can be found at: pivot_wider

Exercise: Wage data

For this exercise, we will use the data set WageData that is coming with the panelr package. The data come from the years 1976-1982 in the Panel Study of Income Dynamics (PSID), with information about the demographics and earnings of 595 individuals.

library(panelr)
data("WageData")
head(WageData)

##   exp wks occ ind south smsa ms fem union ed blk   lwage t id
## 1   3  32   0   0     1    0  1   0     0  9   0 5.56068 1  1
## 2   4  43   0   0     1    0  1   0     0  9   0 5.72031 2  1
## 3   5  40   0   0     1    0  1   0     0  9   0 5.99645 3  1
## 4   6  39   0   0     1    0  1   0     0  9   0 5.99645 4  1
## 5   7  42   0   1     1    0  1   0     0  9   0 6.06146 5  1
## 6   8  35   0   1     1    0  1   0     0  9   0 6.17379 6  1

The key columns are id and t. They tell you which respondent and which time point the row refers to, respectively. We will also retain the variable ed that records the education level.

(wages <- WageData %>% select(id, t, lwage, ed)%>% as_tibble() )

## # A tibble: 4,165 x 4
##       id     t lwage    ed
##    <dbl> <dbl> <dbl> <dbl>
##  1     1     1  5.56     9
##  2     1     2  5.72     9
##  3     1     3  6.00     9
##  4     1     4  6.00     9
##  5     1     5  6.06     9
##  6     1     6  6.17     9
##  7     1     7  6.24     9
##  8     2     1  6.16    11
##  9     2     2  6.21    11
## 10     2     3  6.26    11
## # ... with 4,155 more rows

Task: Reshape wages it into a wide format, so that:

one line corresponds to one id
on each line, you can find the variables id, ed, and seven variables corresponding the lwages for each t.

Click to see the solution

wages %>% pivot_wider(names_from = t, values_from=lwage, names_prefix = "lwage")

## # A tibble: 595 x 9
##       id    ed lwage1 lwage2 lwage3 lwage4 lwage5 lwage6 lwage7
##    <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1     1     9   5.56   5.72   6.00   6.00   6.06   6.17   6.24
##  2     2    11   6.16   6.21   6.26   6.54   6.70   6.79   6.82
##  3     3    12   5.65   6.44   6.55   6.60   6.70   6.78   6.86
##  4     4    10   6.16   6.24   6.30   6.36   6.47   6.56   6.62
##  5     5    16   6.44   6.62   6.63   6.98   7.05   7.31   7.30
##  6     6    12   6.91   6.91   6.91   7.00   7.07   7.52   7.34
##  7     7    12   6.13   6.17   6.21   6.31   6.38   6.45   6.52
##  8     8    10   6.33   6.40   6.54   6.56   6.59   6.82   6.89
##  9     9    16   6.55   6.55   6.80   6.91   7.09   7.17   7.21
## 10    10    16   6.40   6.44   6.44   6.44   6.52   6.61   6.74
## # ... with 585 more rows

Last updated on Mar 18, 2022