Select rows

Learning objectives

If you want exclude some individuals that are outside the scope of your analysis, you need to be able to select subsamples of your data set. In this section, you will learn how to select specific rows of your data set. This is our first encounter with the package dplyr and the use of pipes (%>%) for chaining commands. This is an important section, as you will use this syntax for all types of data manipulation that follow.

At the end of this section, you should be able to select specific rows of a data set :

Data set up

First, let’s rename the dataframe gapminder with a shorter name, and let’s add a column row_no using the data.frame commands. That column is not really essential, but it will help visualizing the rows that have been selected.

library(tidyverse)
library(gapminder)

data <- gapminder
data$row_no = 1:nrow(data)

Selecting rows by their position

To do this we will start to use the dplyr package. Note that dplyr was loaded when you loaded the package tidyverse, so you do not need an additional command.

The command slice()

The command slice() lets you index rows by their positions. It allows you to select, remove, and duplicate rows.

The easiest way to use slice is to indicate the vector of row numbers you wish to keep as in:

slice(data, 2:5)  # to select the row from 2 to 5
slice(data, c(7, 5, 1)) # to select specific rows in a specific order
slice(data, 10) #note that including a numeric and not a vector also works
Click to view the results

## # A tibble: 4 x 7
##   country     continent  year lifeExp      pop gdpPercap row_no
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>  <int>
## 1 Afghanistan Asia       1957    30.3  9240934      821.      2
## 2 Afghanistan Asia       1962    32.0 10267083      853.      3
## 3 Afghanistan Asia       1967    34.0 11537966      836.      4
## 4 Afghanistan Asia       1972    36.1 13079460      740.      5
## # A tibble: 3 x 7
##   country     continent  year lifeExp      pop gdpPercap row_no
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>  <int>
## 1 Afghanistan Asia       1982    39.9 12881816      978.      7
## 2 Afghanistan Asia       1972    36.1 13079460      740.      5
## 3 Afghanistan Asia       1952    28.8  8425333      779.      1
## # A tibble: 1 x 7
##   country     continent  year lifeExp      pop gdpPercap row_no
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>  <int>
## 1 Afghanistan Asia       1997    41.8 22227415      635.     10

Note that all dplyr related commands work in the same way:

  • The first argument is a data frame.
  • The subsequent arguments describe what to do with the data frame, using the variable names without quotes or sometimes some vectors of numbers.
  • The result is a new data frame.

Using the %>% (pipe) operator

A very convenient feature of tidyverse is the %>% operator. To understand how it works, we can reproduce the precedent example using it.

data %>% slice(2:5)
Click to view the results

## # A tibble: 4 x 7
##   country     continent  year lifeExp      pop gdpPercap row_no
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>  <int>
## 1 Afghanistan Asia       1957    30.3  9240934      821.      2
## 2 Afghanistan Asia       1962    32.0 10267083      853.      3
## 3 Afghanistan Asia       1967    34.0 11537966      836.      4
## 4 Afghanistan Asia       1972    36.1 13079460      740.      5

How did that work:

You need to read this command from left to right, and think that the results of the previous command are passed as the first argument of the next command.

The first command is to take the data.frame data, the result data is passed on as the first argument of the command slice(…). So literally, you think about this command as: take “data”, then use slice to select the row 2 to 5…

In this very simple command, the advantage of using the pipe is not obvious as you would probably be faster in writing data[2:5, ], but you will soon see its main advantage when we will start “chaining” several commands.

n() is a very useful dplyr function that outputs the number of rows of the dataset. For example, you could use it with slice to discard the first 200 rows.

slice(data, 200:n()) %>% head(10)
## # A tibble: 10 x 7
##    country      continent  year lifeExp      pop gdpPercap row_no
##    <fct>        <fct>     <int>   <dbl>    <int>     <dbl>  <int>
##  1 Burkina Faso Africa     1987    49.6  7586551      912.    200
##  2 Burkina Faso Africa     1992    50.3  8878303      932.    201
##  3 Burkina Faso Africa     1997    50.3 10352843      946.    202
##  4 Burkina Faso Africa     2002    50.6 12251209     1038.    203
##  5 Burkina Faso Africa     2007    52.3 14326203     1217.    204
##  6 Burundi      Africa     1952    39.0  2445618      339.    205
##  7 Burundi      Africa     1957    40.5  2667518      380.    206
##  8 Burundi      Africa     1962    42.0  2961915      355.    207
##  9 Burundi      Africa     1967    43.5  3330989      413.    208
## 10 Burundi      Africa     1972    44.1  3529983      464.    209

(note: we have chained the command with head(10) using the pipe to display only the first ten rows of the new data set)

slice helpers

Note you can use several variants of slice() for specific uses - we will sometimes call them helpers. In particular, you may want to look at the commands slice_head() and slice_tail() to select the first or last $n$ rows of your data set.

Exercise 1: Select the last 10 rows

Using the command slice_tail(), select the last 10 rows of the dataset data.

Click to view the solution

# note that it outputs a new dataframe, however you can check that the initial data 
# was not modified, and that you did not create a new object in the workspace
data %>% slice_tail(n=10)  
## # A tibble: 10 x 7
##    country  continent  year lifeExp      pop gdpPercap row_no
##    <fct>    <fct>     <int>   <dbl>    <int>     <dbl>  <int>
##  1 Zimbabwe Africa     1962    52.4  4277736      527.   1695
##  2 Zimbabwe Africa     1967    54.0  4995432      570.   1696
##  3 Zimbabwe Africa     1972    55.6  5861135      799.   1697
##  4 Zimbabwe Africa     1977    57.7  6642107      686.   1698
##  5 Zimbabwe Africa     1982    60.4  7636524      789.   1699
##  6 Zimbabwe Africa     1987    62.4  9216418      706.   1700
##  7 Zimbabwe Africa     1992    60.4 10704340      693.   1701
##  8 Zimbabwe Africa     1997    46.8 11404948      792.   1702
##  9 Zimbabwe Africa     2002    40.0 11926563      672.   1703
## 10 Zimbabwe Africa     2007    43.5 12311143      470.   1704
# if you want to save it with a new name
data2 <- data %>% slice_tail(n=10)  

You may want to overwrite the old data

data <- data %>% slice_tail(n=10)  

However, by doing so you deleted the old dataset! Make sure this is what you wanted to do…

Exercise 2: Select the rows 3, 50 and 200

Using a command to select the 3, 50 and 200 of the dataset data.

Click to view the solution

data %>% slice(c(3,50, 200))
## # A tibble: 3 x 7
##   country      continent  year lifeExp      pop gdpPercap row_no
##   <fct>        <fct>     <int>   <dbl>    <int>     <dbl>  <int>
## 1 Afghanistan  Asia       1962    32.0 10267083      853.      3
## 2 Argentina    Americas   1957    64.4 19610538     6857.     50
## 3 Burkina Faso Africa     1987    49.6  7586551      912.    200

Randomly selecting rows

A more interesting application is the creation of a random subsample of cases (i.e., rows). Again, you can use a slice helperslice_sample() which allows you to random select with or without replacement.

set.seed(12345)
data %>% slice_sample(n = 5)
data %>% slice_sample(n = 5, replace = TRUE)
Click to view the results

## # A tibble: 5 x 7
##   country   continent  year lifeExp       pop gdpPercap row_no
##   <fct>     <fct>     <int>   <dbl>     <int>     <dbl>  <int>
## 1 Bolivia   Americas   1997    62.0   7693188     3326.    142
## 2 Argentina Americas   1962    65.1  21283783     7133.     51
## 3 Indonesia Asia       2007    70.6 223547000     3541.    720
## 4 Iran      Asia       1997    68.0  63327987     8264.    730
## 5 Portugal  Europe     1987    74.1   9915289    13039.   1244
## # A tibble: 5 x 7
##   country          continent  year lifeExp      pop gdpPercap row_no
##   <fct>            <fct>     <int>   <dbl>    <int>     <dbl>  <int>
## 1 Hong Kong, China Asia       1967    70    3722800     6198.    664
## 2 Kenya            Africa     1997    54.4 28263827     1360.    826
## 3 Guatemala        Americas   1972    53.7  5149581     4031.    605
## 4 Ghana            Africa     2002    58.5 20550751     1112.    587
## 5 Slovak Republic  Europe     1987    71.1  5199318    12037.   1376

Exercise 3

  1. Generate a subset of data with its first 10 rows and name it data10
  2. Generate a random subsample of data10 containing 20 rows. (tip: understand what the argument replace is doing)
Click to see the solution

data10 <- data %>% slice_head(n=10)

Random sample with replacement. Note that it you do not include replace=TRUE argument, R will throw an error (why?)

data10 %>% slice_sample(n = 20, replace = TRUE)
## # A tibble: 20 x 7
##    country     continent  year lifeExp      pop gdpPercap row_no
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>  <int>
##  1 Afghanistan Asia       1987    40.8 13867957      852.      8
##  2 Afghanistan Asia       1957    30.3  9240934      821.      2
##  3 Afghanistan Asia       1977    38.4 14880372      786.      6
##  4 Afghanistan Asia       1977    38.4 14880372      786.      6
##  5 Afghanistan Asia       1982    39.9 12881816      978.      7
##  6 Afghanistan Asia       1997    41.8 22227415      635.     10
##  7 Afghanistan Asia       1952    28.8  8425333      779.      1
##  8 Afghanistan Asia       1987    40.8 13867957      852.      8
##  9 Afghanistan Asia       1982    39.9 12881816      978.      7
## 10 Afghanistan Asia       1977    38.4 14880372      786.      6
## 11 Afghanistan Asia       1952    28.8  8425333      779.      1
## 12 Afghanistan Asia       1967    34.0 11537966      836.      4
## 13 Afghanistan Asia       1987    40.8 13867957      852.      8
## 14 Afghanistan Asia       1997    41.8 22227415      635.     10
## 15 Afghanistan Asia       1962    32.0 10267083      853.      3
## 16 Afghanistan Asia       1992    41.7 16317921      649.      9
## 17 Afghanistan Asia       1967    34.0 11537966      836.      4
## 18 Afghanistan Asia       1997    41.8 22227415      635.     10
## 19 Afghanistan Asia       1982    39.9 12881816      978.      7
## 20 Afghanistan Asia       1957    30.3  9240934      821.      2

Selecting rows based on logical criteria

Let say you want to look only at South Africa data.

We will now use another command from the dplyr package : filter()

SAdata <- filter(data, country=="South Africa")
SAdata
Click to view the results

## # A tibble: 12 x 7
##    country      continent  year lifeExp      pop gdpPercap row_no
##    <fct>        <fct>     <int>   <dbl>    <int>     <dbl>  <int>
##  1 South Africa Africa     1952    45.0 14264935     4725.   1405
##  2 South Africa Africa     1957    48.0 16151549     5487.   1406
##  3 South Africa Africa     1962    50.0 18356657     5769.   1407
##  4 South Africa Africa     1967    51.9 20997321     7114.   1408
##  5 South Africa Africa     1972    53.7 23935810     7766.   1409
##  6 South Africa Africa     1977    55.5 27129932     8029.   1410
##  7 South Africa Africa     1982    58.2 31140029     8568.   1411
##  8 South Africa Africa     1987    60.8 35933379     7826.   1412
##  9 South Africa Africa     1992    61.9 39964159     7225.   1413
## 10 South Africa Africa     1997    60.2 42835005     7479.   1414
## 11 South Africa Africa     2002    53.4 44433622     7711.   1415
## 12 South Africa Africa     2007    49.3 43997828     9270.   1416

Note again how the command worked :

  • The first argument is a data frame.
  • The subsequent arguments describe what to do with the data frame, using the variable names without quotes.
  • The result is a new data frame that we named SAdata

You can create more complicated criteria using functions and operators such as:

  • ==, >, >=, <, <= (when constructing comparison expressions)
  • %in% when checking presence in a list of possible value
  • & (and), | (or), ! (not) (when combining expression)

In the following example, we are selecting South Africa records for years after 2000:

SA2000 <- filter(data, country=="South Africa" & year > 2000)
Click to see the result

Exercise 4: Select recent African data

  • Select records from African country with a population greater than 30 million for years after 2003 (note there are three selection criteria)
Click to see the solution

filter(data, continent=="Africa" , year > 2003, pop > 30000000)
## # A tibble: 10 x 7
##    country          continent  year lifeExp       pop gdpPercap row_no
##    <fct>            <fct>     <int>   <dbl>     <int>     <dbl>  <int>
##  1 Algeria          Africa     2007    72.3  33333216     6223.     36
##  2 Congo, Dem. Rep. Africa     2007    46.5  64606759      278.    336
##  3 Egypt            Africa     2007    71.3  80264543     5581.    468
##  4 Ethiopia         Africa     2007    52.9  76511887      691.    516
##  5 Kenya            Africa     2007    54.1  35610177     1463.    828
##  6 Morocco          Africa     2007    71.2  33757175     3820.   1032
##  7 Nigeria          Africa     2007    46.9 135031164     2014.   1140
##  8 South Africa     Africa     2007    49.3  43997828     9270.   1416
##  9 Sudan            Africa     2007    58.6  42292929     2602.   1452
## 10 Tanzania         Africa     2007    52.5  38139640     1107.   1524

Exercise 5: Select the data that from South Africa except those from 1962, 1977 and 2003

Click to see the solution

filter(data, country=="South Africa" & !(year %in% c(1962, 1977, 2003)))
## # A tibble: 10 x 7
##    country      continent  year lifeExp      pop gdpPercap row_no
##    <fct>        <fct>     <int>   <dbl>    <int>     <dbl>  <int>
##  1 South Africa Africa     1952    45.0 14264935     4725.   1405
##  2 South Africa Africa     1957    48.0 16151549     5487.   1406
##  3 South Africa Africa     1967    51.9 20997321     7114.   1408
##  4 South Africa Africa     1972    53.7 23935810     7766.   1409
##  5 South Africa Africa     1982    58.2 31140029     8568.   1411
##  6 South Africa Africa     1987    60.8 35933379     7826.   1412
##  7 South Africa Africa     1992    61.9 39964159     7225.   1413
##  8 South Africa Africa     1997    60.2 42835005     7479.   1414
##  9 South Africa Africa     2002    53.4 44433622     7711.   1415
## 10 South Africa Africa     2007    49.3 43997828     9270.   1416

More advanced sampling

We will now use again with the gapminder data.

Let’s first select a sub-sample. We want to select only two years (2002 and 2007), and we want to select randomly ten countries.

A first idea would be to use the command slice_sample(), as in:

set.seed(123)
data <- gapminder %>% 
  filter(year > 2000 ) %>%
  slice_sample(n=10) %>% arrange(country)
data
Click to view the results

## # A tibble: 10 x 6
##    country         continent  year lifeExp        pop gdpPercap
##    <fct>           <fct>     <int>   <dbl>      <int>     <dbl>
##  1 Austria         Europe     2007    79.8    8199783    36126.
##  2 France          Europe     2007    80.7   61083916    30470.
##  3 Gabon           Africa     2002    56.8    1299304    12522.
##  4 India           Asia       2007    64.7 1110396331     2452.
##  5 Madagascar      Africa     2002    57.3   16473477      895.
##  6 Nepal           Asia       2002    61.3   25873917     1057.
##  7 Pakistan        Asia       2002    63.6  153403524     2093.
##  8 Slovak Republic Europe     2002    73.8    5410052    13639.
##  9 Swaziland       Africa     2007    39.6    1133066     4513.
## 10 Zimbabwe        Africa     2002    40.0   11926563      672.

The problem with this approach is that each country has two records. So with “blind” line sampling, we will encounter cases where we only have one record for a country. That is what happened here, and that is not what we wanted.

What we really want is a subsample of countries, and once a country is selected, we want to see all the records of that selected country. To do this, we need a different approach and use the base command sample():

# using tidyverse approach:
random_names <- gapminder %>% 
  select(country) %>%  
  unique() %>%      # take only one record per country
  as_vector() %>%   # as_vector transforms a data.frame into a vector
  sample(10, replace = FALSE)   # take a 10 random sample 

# sometimes tidyverse can be long
# it might be easier to use the following command:
set.seed(123)
random_names <- sample(unique(data$country), 10, replace=FALSE)
random_names
##  [1] Gabon           Zimbabwe        France          Slovak Republic
##  [5] Nepal           Swaziland       Austria         Pakistan       
##  [9] Madagascar      India          
## 142 Levels: Afghanistan Albania Algeria Angola Argentina Australia ... Zimbabwe

Now, we can use this vector of country names to select the corresponding countries

gapminder %>% filter(country %in% random_names & year > 2000) 
Click to view the results

## # A tibble: 20 x 6
##    country         continent  year lifeExp        pop gdpPercap
##    <fct>           <fct>     <int>   <dbl>      <int>     <dbl>
##  1 Austria         Europe     2002    79.0    8148312    32418.
##  2 Austria         Europe     2007    79.8    8199783    36126.
##  3 France          Europe     2002    79.6   59925035    28926.
##  4 France          Europe     2007    80.7   61083916    30470.
##  5 Gabon           Africa     2002    56.8    1299304    12522.
##  6 Gabon           Africa     2007    56.7    1454867    13206.
##  7 India           Asia       2002    62.9 1034172547     1747.
##  8 India           Asia       2007    64.7 1110396331     2452.
##  9 Madagascar      Africa     2002    57.3   16473477      895.
## 10 Madagascar      Africa     2007    59.4   19167654     1045.
## 11 Nepal           Asia       2002    61.3   25873917     1057.
## 12 Nepal           Asia       2007    63.8   28901790     1091.
## 13 Pakistan        Asia       2002    63.6  153403524     2093.
## 14 Pakistan        Asia       2007    65.5  169270617     2606.
## 15 Slovak Republic Europe     2002    73.8    5410052    13639.
## 16 Slovak Republic Europe     2007    74.7    5447502    18678.
## 17 Swaziland       Africa     2002    43.9    1130269     4128.
## 18 Swaziland       Africa     2007    39.6    1133066     4513.
## 19 Zimbabwe        Africa     2002    40.0   11926563      672.
## 20 Zimbabwe        Africa     2007    43.5   12311143      470.

Next