Factors

Learning objectives

Factors are used to store categorical variables, i.e. variables that take a limited number of different values. Categorical variables enter statistical models differently than continuous variables, which is why R developers have created a specific type of data to ensure that the modeling functions will handle this data correctly. Another reason why factors were created is that they store data in levels to reduce data redundancy and save a lot of space in the memory.

Here you will learn how to create and manipulate factors.

Factors

Factors are stored as a vector of strings - the levels.

Convert a vector

To create a factor, we use the function factor(). The only required argument to factor is a vector of values which will be returned as a vector of factor values. Both numeric and character variables can be made into factors, but a factor’s levels will always be character values. You can see the possible levels for a factor through the levels() command.

# Transform character vector into a factor
gender = c("male","female","female","male","female","female")
gender.fac = factor(gender)
gender.fac
## [1] male   female female male   female female
## Levels: female male

You can examine the structure of a factor function by using str(gender.fac)

# examine the structure of the factor gender.fac
str(gender.fac)
##  Factor w/ 2 levels "female","male": 2 1 1 2 1 1

You can see that gender.fac is a factor with 2 levels. The function factor converted the character vector into a vector of integer values. “Female” is the first level encoded as 1 whereas the “Male” is the second level, encoded as 2.

We can see that creating a factor is an efficient way to store a vector of character values, because each unique character value is stored only once, and the data itself is stored as a vector of integers.

Changing the order of the levels

The levels of a factor are used when displaying the factor. By default, when factoring a vector of string, the levels are generated using alphabetic order. In the previous example, this is why “female” were assigned to 1, and “male” were assigned to 2.

You can change these levels when you create a factor by passing a vector with the new values through the levels= argument.

To convert the vector gender into a factor where male is coded 1, and female is coded 2, we use the following command:

gender
## [1] "male"   "female" "female" "male"   "female" "female"
gender.fac2 = factor(gender, levels=c("male", "female"))
gender.fac2
## [1] male   female female male   female female
## Levels: male female
str(gender.fac2)
##  Factor w/ 2 levels "male","female": 1 2 2 1 2 2

The manipulation of factor levels becomes very interesting when you want to control the order of appearance of the different levels.

Consider a vector theMonths containing a list of months:

theMonths = c("March","April","March", "January","November","January",
 "September","October","September","November","August",
 "January","November","November","February","May","August",
 "July","December","August","August","September","November",
 "February","April")
 theMonths <- factor(theMonths)
 theMonths
##  [1] March     April     March     January   November  January   September
##  [8] October   September November  August    January   November  November 
## [15] February  May       August    July      December  August    August   
## [22] September November  February  April    
## 11 Levels: April August December February January July March May ... September

By default, the levels will be coded using alphabetic order. As a result, April is coded 1, August is coded 2, etc. In the case of Months, this will make the readings of results a bit difficult when summarizing information.

For example, the function table() will tell us how many times each month has appeared in our vector theMonths, but you can see the months appear in a un-natural way.

table(theMonths)
## theMonths
##     April    August  December  February   January      July     March       May 
##         2         4         1         2         3         1         2         1 
##  November   October September 
##         5         1         3

If you want to circonvent this problem, you can use the levels argument when creating your vector theMonths

theMonths_ord <- factor(theMonths,levels=c("January","February","March",
             "April","May","June","July","August","September",
             "October","November","December"))

table(theMonths_ord)
## theMonths_ord
##   January  February     March     April       May      June      July    August 
##         3         2         2         2         1         0         1         4 
## September   October  November  December 
##         3         1         5         1

Ordered vs. Unordered Factors

Although the months clearly have an ordering, this is not reflected in the output of the table function.

theMonths <- factor(theMonths,levels=c("January","February","March",
             "April","May","June","July","August","September",
             "October","November","December"))

table(theMonths)
## theMonths
##   January  February     March     April       May      June      July    August 
##         3         2         2         2         1         0         1         4 
## September   October  November  December 
##         3         1         5         1
theMonths[2] > theMonths[3]
## Warning in Ops.factor(theMonths[2], theMonths[3]): '>' not meaningful for
## factors
## [1] NA

As the results of the last operation shows, the comparison operators are not supported for unordered factors. Creating an ordered factor solves this problem:

theMonths <- factor(theMonths,levels=c("January","February","March",
             "April","May","June","July","August","September",
             "October","November","December"), ordered = TRUE)

theMonths[2] > theMonths[3]  # now we can compare February and March
## [1] TRUE

Creating a factor using the function cut()

Another common way to create factors is to split a continuous numeric variables into intervals using the cut() function. The breaks= argument to cut is used to describe how ranges of numbers will be converted to factor values. If a number is provided through the breaks= argument, the resulting factor will be created by dividing the range of the variable into that number of equal length intervals; if a vector of values is provided, the values in the vector are used to determine the breakpoint. Note that if a vector of values is provided, the number of levels of the resultant factor will be one less than the number of values in the vector.

For example, consider a vector women.heights where you recorded the height for a sample of women. If we wanted to create a factor corresponding to weight, with three equally-spaced levels, we could use the following:

women.heights <- c(158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 155)
women.heights.factor = cut(women.heights,3)  
women.heights.factor
##  [1] (155,161] (155,161] (155,161] (161,166] (161,166] (161,166] (161,166]
##  [8] (161,166] (161,166] (166,172] (166,172] (166,172] (166,172] (166,172]
## [15] (166,172] (155,161]
## Levels: (155,161] (161,166] (166,172]
table(women.heights.factor)
## women.heights.factor
## (155,161] (161,166] (166,172] 
##         4         6         6

If you want to change the way the factor levels are displayed, use the labels= argument. It allows you to specify the levels of the factors, instead of the intervals:

women.heights.factor2 = cut(women.heights,3,labels=c('Low','Medium','High'))
table(women.heights.factor2)
## women.heights.factor2
##    Low Medium   High 
##      4      6      6

Exercises

Exercise 1

R provides a predefined variable month.abb with the first three letters of months:

##  [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
  1. Using month.abb, create a random sample of 20 months selected from month.abb and transform it as a factor; when doing so, make sure the coding will not be made according to alphabetic order of the months.
  2. Count the number of times each month has been sampled using the function table()
Show the answer

For the first question, you can use the function sample with the argument replace = TRUE

themonths <- sample(month.abb, 30, replace = TRUE)
themonths.factor <- factor(themonths, levels=month.abb)
themonths.factor
##  [1] Oct Apr Dec Nov Mar Jul Sep Mar Jan Feb Jun Oct Feb Feb Oct Sep Feb Jul Oct
## [20] Dec Jun Dec Oct Oct Dec Oct Dec Oct Aug Nov
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

For the second question, you can use the function table

table(themonths.factor)
## themonths.factor
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 
##   1   4   2   1   0   2   2   1   2   8   2   5

Previous
Next