Work with rows and columns

Learning objectives

In this chapter, you will learn how to:

For this, you will learn the commands arrange() and select()of the dplyr package. You will also practice the use of the pipes (%>%) to chain successive commands. The chaining of commands will allow you to arrange your data in a very transparent and easy way.

Sort rows

You are now familiar with dplyr commands and the pipe syntax. A second type of manipulation is the ordering of your data. For this, we will use the command arrange().

The command arrange() changes the order of the rows. The argument is a set of column names to order by.

data %>% arrange(country) 
Click to see the output

## # A tibble: 1,704 x 7
##    country     continent  year lifeExp      pop gdpPercap row_no
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>  <int>
##  1 Afghanistan Asia       1952    28.8  8425333      779.      1
##  2 Afghanistan Asia       1957    30.3  9240934      821.      2
##  3 Afghanistan Asia       1962    32.0 10267083      853.      3
##  4 Afghanistan Asia       1967    34.0 11537966      836.      4
##  5 Afghanistan Asia       1972    36.1 13079460      740.      5
##  6 Afghanistan Asia       1977    38.4 14880372      786.      6
##  7 Afghanistan Asia       1982    39.9 12881816      978.      7
##  8 Afghanistan Asia       1987    40.8 13867957      852.      8
##  9 Afghanistan Asia       1992    41.7 16317921      649.      9
## 10 Afghanistan Asia       1997    41.8 22227415      635.     10
## # ... with 1,694 more rows

If you provide more than one column name, the column names are separated with commas. Each additional column will be used to break ties in the values of preceding columns (lexicographic ordering).

data %>% arrange(continent, country, year)
Click to see the output

## # A tibble: 1,704 x 7
##    country continent  year lifeExp      pop gdpPercap row_no
##    <fct>   <fct>     <int>   <dbl>    <int>     <dbl>  <int>
##  1 Algeria Africa     1952    43.1  9279525     2449.     25
##  2 Algeria Africa     1957    45.7 10270856     3014.     26
##  3 Algeria Africa     1962    48.3 11000948     2551.     27
##  4 Algeria Africa     1967    51.4 12760499     3247.     28
##  5 Algeria Africa     1972    54.5 14760787     4183.     29
##  6 Algeria Africa     1977    58.0 17152804     4910.     30
##  7 Algeria Africa     1982    61.4 20033753     5745.     31
##  8 Algeria Africa     1987    65.8 23254956     5681.     32
##  9 Algeria Africa     1992    67.7 26298373     5023.     33
## 10 Algeria Africa     1997    69.2 29072015     4797.     34
## # ... with 1,694 more rows

Use desc() to re-order by a column in descending order:

data %>% arrange(desc(continent), country)
Click to see the output

## # A tibble: 1,704 x 7
##    country   continent  year lifeExp      pop gdpPercap row_no
##    <fct>     <fct>     <int>   <dbl>    <int>     <dbl>  <int>
##  1 Australia Oceania    1952    69.1  8691212    10040.     61
##  2 Australia Oceania    1957    70.3  9712569    10950.     62
##  3 Australia Oceania    1962    70.9 10794968    12217.     63
##  4 Australia Oceania    1967    71.1 11872264    14526.     64
##  5 Australia Oceania    1972    71.9 13177000    16789.     65
##  6 Australia Oceania    1977    73.5 14074100    18334.     66
##  7 Australia Oceania    1982    74.7 15184200    19477.     67
##  8 Australia Oceania    1987    76.3 16257249    21889.     68
##  9 Australia Oceania    1992    77.6 17481977    23425.     69
## 10 Australia Oceania    1997    78.8 18565243    26998.     70
## # ... with 1,694 more rows

Chaining data manipulation: selecting and sorting rows

Let’s develop a more complex example.

You want to show the countries with the highest 2007 GDP per capita of the African continent, and limit your results to the first six countries.

For this, you will need to:

  • select countries on the African continent for which we have data in 2007
  • sort the results in descending order of GDP
  • retain only the first six results

This can be done with this chain of commands:

data %>% 
  filter(continent == "Africa", year ==2007) %>%
  arrange(-gdpPercap) %>%
  slice_head(n=6)
Click to see the output

## # A tibble: 6 x 7
##   country           continent  year lifeExp      pop gdpPercap row_no
##   <fct>             <fct>     <int>   <dbl>    <int>     <dbl>  <int>
## 1 Gabon             Africa     2007    56.7  1454867    13206.    552
## 2 Botswana          Africa     2007    50.7  1639131    12570.    168
## 3 Equatorial Guinea Africa     2007    51.6   551201    12154.    492
## 4 Libya             Africa     2007    74.0  6036914    12057.    912
## 5 Mauritius         Africa     2007    72.8  1250882    10957.    984
## 6 South Africa      Africa     2007    49.3 43997828     9270.   1416

Note that:

  • each step corresponds to one command (filter, arrange, etc.) and for increased readability we can present each command on a separate line;
  • the different steps are chained, i.e., the results of the first command are passed as the first argument to the next command;
  • you did not have to create intermediate variables.

Select columns/variables

When you have a large survey with many variables, you may want to work only with the variables needed for your analysis.

The function select() will allow you to create a subset of variables.

Column positions

A first approach is to list the column positions.

data %>% select(1, 3, 6) 
Click to see the output

## # A tibble: 1,704 x 3
##    country      year gdpPercap
##    <fct>       <int>     <dbl>
##  1 Afghanistan  1952      779.
##  2 Afghanistan  1957      821.
##  3 Afghanistan  1962      853.
##  4 Afghanistan  1967      836.
##  5 Afghanistan  1972      740.
##  6 Afghanistan  1977      786.
##  7 Afghanistan  1982      978.
##  8 Afghanistan  1987      852.
##  9 Afghanistan  1992      649.
## 10 Afghanistan  1997      635.
## # ... with 1,694 more rows

However, this is not really recommended as the order of your columns could change, and lead to unexpected results. So, it is advisable to use column names instead.

Column names

A second approach is to list all the names of the variables you want to select:

data %>% select(country, year, gdpPercap)
Click to see the output

## # A tibble: 1,704 x 3
##    country      year gdpPercap
##    <fct>       <int>     <dbl>
##  1 Afghanistan  1952      779.
##  2 Afghanistan  1957      821.
##  3 Afghanistan  1962      853.
##  4 Afghanistan  1967      836.
##  5 Afghanistan  1972      740.
##  6 Afghanistan  1977      786.
##  7 Afghanistan  1982      978.
##  8 Afghanistan  1987      852.
##  9 Afghanistan  1992      649.
## 10 Afghanistan  1997      635.
## # ... with 1,694 more rows

However, it can become rapidly cumbersome to have to write a long list of variable names. Hopefully, you can use a few tricks.

Select ranges of variables

You can provide a range of variables or a range of column positions

# This will select the variables 
# between the column continent and pop
data %>% select(continent:pop)
Click to see the output

## # A tibble: 1,704 x 4
##    continent  year lifeExp      pop
##    <fct>     <int>   <dbl>    <int>
##  1 Asia       1952    28.8  8425333
##  2 Asia       1957    30.3  9240934
##  3 Asia       1962    32.0 10267083
##  4 Asia       1967    34.0 11537966
##  5 Asia       1972    36.1 13079460
##  6 Asia       1977    38.4 14880372
##  7 Asia       1982    39.9 12881816
##  8 Asia       1987    40.8 13867957
##  9 Asia       1992    41.7 16317921
## 10 Asia       1997    41.8 22227415
## # ... with 1,694 more rows

Exclude instead of selecting

If you need to select most variables and exclude only a few variables you can use the sign - to signal the variables you want to exclude:

data %>% select(-gdpPercap)
Click to see the output

## # A tibble: 1,704 x 6
##    country     continent  year lifeExp      pop row_no
##    <fct>       <fct>     <int>   <dbl>    <int>  <int>
##  1 Afghanistan Asia       1952    28.8  8425333      1
##  2 Afghanistan Asia       1957    30.3  9240934      2
##  3 Afghanistan Asia       1962    32.0 10267083      3
##  4 Afghanistan Asia       1967    34.0 11537966      4
##  5 Afghanistan Asia       1972    36.1 13079460      5
##  6 Afghanistan Asia       1977    38.4 14880372      6
##  7 Afghanistan Asia       1982    39.9 12881816      7
##  8 Afghanistan Asia       1987    40.8 13867957      8
##  9 Afghanistan Asia       1992    41.7 16317921      9
## 10 Afghanistan Asia       1997    41.8 22227415     10
## # ... with 1,694 more rows

Select by type

You can select the variables by their type or any other characteristic using where().

where() applies a function to all variables and selects those for which the function returns TRUE.

# will select only the numeric variables
data %>% select(where(is.numeric))
Click to see the output

## # A tibble: 1,704 x 5
##     year lifeExp      pop gdpPercap row_no
##    <int>   <dbl>    <int>     <dbl>  <int>
##  1  1952    28.8  8425333      779.      1
##  2  1957    30.3  9240934      821.      2
##  3  1962    32.0 10267083      853.      3
##  4  1967    34.0 11537966      836.      4
##  5  1972    36.1 13079460      740.      5
##  6  1977    38.4 14880372      786.      6
##  7  1982    39.9 12881816      978.      7
##  8  1987    40.8 13867957      852.      8
##  9  1992    41.7 16317921      649.      9
## 10  1997    41.8 22227415      635.     10
## # ... with 1,694 more rows

Use additional helpers on variable names

To make your life easier, you can also use a number of helper function within select

# Select of the variables with names starting with C
data %>% select(starts_with("c"))
Click to see the output

## # A tibble: 1,704 x 2
##    country     continent
##    <fct>       <fct>    
##  1 Afghanistan Asia     
##  2 Afghanistan Asia     
##  3 Afghanistan Asia     
##  4 Afghanistan Asia     
##  5 Afghanistan Asia     
##  6 Afghanistan Asia     
##  7 Afghanistan Asia     
##  8 Afghanistan Asia     
##  9 Afghanistan Asia     
## 10 Afghanistan Asia     
## # ... with 1,694 more rows

# Select of the variables with names containing O
data %>% select(contains("O"))
Click to see the output

## # A tibble: 1,704 x 4
##    country     continent      pop row_no
##    <fct>       <fct>        <int>  <int>
##  1 Afghanistan Asia       8425333      1
##  2 Afghanistan Asia       9240934      2
##  3 Afghanistan Asia      10267083      3
##  4 Afghanistan Asia      11537966      4
##  5 Afghanistan Asia      13079460      5
##  6 Afghanistan Asia      14880372      6
##  7 Afghanistan Asia      12881816      7
##  8 Afghanistan Asia      13867957      8
##  9 Afghanistan Asia      16317921      9
## 10 Afghanistan Asia      22227415     10
## # ... with 1,694 more rows

It was probably not convincing with such an example, but imagine that you have a data set with information collected at different time, and you have variables such as x2000, x2001, x2002, x2003

# Presented for reference only, 
# We do not have variables x2000, x2001, x2002, x2003!
data %>% select(starts_with("x"))

Use a combination of these strategies

Note that you can mix the strategies to obtain a list of variables. For example, you want to select the row_no and year variables and the variables containing “O”.

data %>% select(row_no, year, contains("o"))
Click to see the output

## # A tibble: 1,704 x 5
##    row_no  year country     continent      pop
##     <int> <int> <fct>       <fct>        <int>
##  1      1  1952 Afghanistan Asia       8425333
##  2      2  1957 Afghanistan Asia       9240934
##  3      3  1962 Afghanistan Asia      10267083
##  4      4  1967 Afghanistan Asia      11537966
##  5      5  1972 Afghanistan Asia      13079460
##  6      6  1977 Afghanistan Asia      14880372
##  7      7  1982 Afghanistan Asia      12881816
##  8      8  1987 Afghanistan Asia      13867957
##  9      9  1992 Afghanistan Asia      16317921
## 10     10  1997 Afghanistan Asia      22227415
## # ... with 1,694 more rows

Reordering columns

Sometimes, you just want to re-order the variables. Indeed you can do this with select, but it can still be cumbersome if you need to move only a few among many columns.

In such case you can use the function relocate().

The default behavior corresponds to the case where you want to move some variables to the front:

# this will make the row_no variable as the first colum
data %>% relocate(row_no)
Click to see the output

## # A tibble: 1,704 x 7
##    row_no country     continent  year lifeExp      pop gdpPercap
##     <int> <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1      1 Afghanistan Asia       1952    28.8  8425333      779.
##  2      2 Afghanistan Asia       1957    30.3  9240934      821.
##  3      3 Afghanistan Asia       1962    32.0 10267083      853.
##  4      4 Afghanistan Asia       1967    34.0 11537966      836.
##  5      5 Afghanistan Asia       1972    36.1 13079460      740.
##  6      6 Afghanistan Asia       1977    38.4 14880372      786.
##  7      7 Afghanistan Asia       1982    39.9 12881816      978.
##  8      8 Afghanistan Asia       1987    40.8 13867957      852.
##  9      9 Afghanistan Asia       1992    41.7 16317921      649.
## 10     10 Afghanistan Asia       1997    41.8 22227415      635.
## # ... with 1,694 more rows

Use the function helper if you need more sophisticated changes.

Selecting rows and columns

You have all the tools to select rows and columns. Using tidyverse, you will chain each manipulation. For example, you want to select the column country and pop, and consider only the African countries:

data %>% filter(continent == "Africa") %>%
  select(country, pop)
Click to see the output

## # A tibble: 624 x 2
##    country      pop
##    <fct>      <int>
##  1 Algeria  9279525
##  2 Algeria 10270856
##  3 Algeria 11000948
##  4 Algeria 12760499
##  5 Algeria 14760787
##  6 Algeria 17152804
##  7 Algeria 20033753
##  8 Algeria 23254956
##  9 Algeria 26298373
## 10 Algeria 29072015
## # ... with 614 more rows

Note that the order of the commands can be important. In our example, we selected the row where continent was “Africa” and then selected two variables country and pop. Imagine what would happen if you interchange the two commands:

data %>%  select(country, pop) %>%
  filter(continent == "Africa") 

Exercises

Exercise 1:

Show the 2007 GDP per capita of the 10 most populated countries (in 2007) in descending order of population

Click to view the solution

data %>% filter(year==2007) %>%
  arrange(-pop) %>%
  select(country, pop, gdpPercap) %>%
  slice_head(n=10)
## # A tibble: 10 x 3
##    country              pop gdpPercap
##    <fct>              <int>     <dbl>
##  1 China         1318683096     4959.
##  2 India         1110396331     2452.
##  3 United States  301139947    42952.
##  4 Indonesia      223547000     3541.
##  5 Brazil         190010647     9066.
##  6 Pakistan       169270617     2606.
##  7 Bangladesh     150448339     1391.
##  8 Nigeria        135031164     2014.
##  9 Japan          127467972    31656.
## 10 Mexico         108700891    11978.

Exercise 2:

Show the 2007 life expectancy (in 2007). Order by descending GDP per capita. Show the countries with the 5 highest GDP per capita and the 5 lowest GDP per capita.

Click to view the solution

data %>% filter(year==2007) %>%
  arrange(-gdpPercap) %>%
  select(country, lifeExp, gdpPercap) %>%
  slice(1:5, (n()-4):n())  #be careful with the parenthesis
## # A tibble: 10 x 3
##    country          lifeExp gdpPercap
##    <fct>              <dbl>     <dbl>
##  1 Norway              80.2    49357.
##  2 Kuwait              77.6    47307.
##  3 Singapore           80.0    47143.
##  4 United States       78.2    42952.
##  5 Ireland             78.9    40676.
##  6 Guinea-Bissau       46.4      579.
##  7 Zimbabwe            43.5      470.
##  8 Burundi             49.6      430.
##  9 Liberia             45.7      415.
## 10 Congo, Dem. Rep.    46.5      278.

Previous
Next