Basic data-frame manipulations in R

Many people use spreadsheets for data entry. This is no surprise. But some go further, using them as a tool for – shudder! – data manipulation. It may be better to manage your spreadsheet via an R script instead. Why might you want to do this?

  • Accidental typing. Changing a cell’s value by mistake could, at best, have you scratching your head; at worst, your statistical analysis will be wrong. This is a very real problem for people with gigantic fingers/cats (the cats are not necessarily gigantic).
  • Sorting. If you forget to highlight a column when sorting data, the rows could become mismatched.
  • Ditching data. You might decide to delete data that are later unrecoverable, or you cut some values and forget to paste them (we’ve all been there).
  • Poor formulation. Errors in formulae are difficult to spot unless you highlight the offending cell and look in the formula bar. Mistakes could be copied and pasted across multiple cells, spreading the problem.
  • Data re-use. I know it looks cool to format your data like an E.E. Cummings poem, but future users of your datafile may be confused; formatting with R in mind can avoid this.
  • …And a bunch of other reasons. See here for the benefits of R over Microsoft Excel, by Isaac Peterson.

Do not let Tibbles sabotage your raw data (CC User Mattes/Wikimedia).

The bottom line is that data handling in a spreadsheet could result in irreversible and/or unnoticed changes to your data. Mistakes are made when scripting in R, of course, but at least you won’t have made any catastrophic alterations to the precious raw data. As Pink Floyd once never said: Hey student! Leave that raw data alone!

Below are some simple manipulations to aid organisation and modification of dataframes without running into the problems listed above. The example data are from SCImago Journal Rank. Regardless of your feelings on journal rankings/impact factors, they make for an interesting and useable real-world example.

This post will be updated from time to time with new techniques. If there’s something we’ve missed, or you have a better approach, or you just love spreadsheets, share your thoughts in the comments.

Jump to:
Example dataframe
Basic summaries
Accessing data
Altering a value
Renaming columns
Adding new columns
Subsetting data
Renaming factor levels
Reordering factor levels
Collapsing factor levels
Appending dataframes
Merging dataframes

Example dataframe

  • We are going to build a dataframe by specifying three vectors as objects using the gets notation (<-), and then arranging them into columns using the data.frame function.
  • This dataframe is used as the starting point for the manipulations covered in this post.
VAR1 <- c("Ecol.Lett", "Annu.Rev.Ecol.Evol.S", "Ecol.Monogr", "Front.Ecol.Environ","Glob.Ecol.Biogeogr", "Glob.Change.Biol", "Ecology", "J.Ecol") # Journal titles
VAR2 <- c("BP", "ARI", "ESA", "ESA", "BP", "BP", "ESA", "BP") # Abbreviated publisher names
VAR3 <- c(7.995, 7.019, 5.21, 3.644, 3.477, 3.425, 3.246, 3.145) # SCImago Journal Rank (SJR)
data <- data.frame(VAR1, VAR2, VAR3)
data
## VAR1 VAR2 VAR3
## 1 Ecol.Lett BP 7.995
## 2 Annu.Rev.Ecol.Evol.S ARI 7.019
## 3 Ecol.Monogr ESA 5.210
## 4 Front.Ecol.Environ ESA 3.644
## 5 Glob.Ecol.Biogeogr BP 3.477
## 6 Glob.Change.Biol BP 3.425
## 7 Ecology ESA 3.246
## 8 J.Ecol BP 3.145

Basic summaries

  • The following functions allow you to look at the dataframe, or parts of it, and some simple summary statistics.
data # Show full dataframe
## VAR1 VAR2 VAR3
## 1 Ecol.Lett BP 7.995
## 2 Annu.Rev.Ecol.Evol.S ARI 7.019
## 3 Ecol.Monogr ESA 5.210
## 4 Front.Ecol.Environ ESA 3.644
## 5 Glob.Ecol.Biogeogr BP 3.477
## 6 Glob.Change.Biol BP 3.425
## 7 Ecology ESA 3.246
## 8 J.Ecol BP 3.145
head(data, n = 4) # Show first four rows, default is six
## VAR1 VAR2 VAR3
## 1 Ecol.Lett BP 7.995
## 2 Annu.Rev.Ecol.Evol.S ARI 7.019
## 3 Ecol.Monogr ESA 5.210
## 4 Front.Ecol.Environ ESA 3.644
names(data) # List variable (column) names
## [1] "VAR1" "VAR2" "VAR3"
nrow(data) # Number of rows
## [1] 8
ncol(data) # Number of columns
## [1] 3
summary(data) # Factor levels frequencies, quartiles and integer ranges
## VAR1 VAR2 VAR3
## Annu.Rev.Ecol.Evol.S:1 ARI:1 Min. :3.15
## Ecol.Lett :1 BP :4 1st Qu.:3.38
## Ecol.Monogr :1 ESA:3 Median :3.56
## Ecology :1 Mean :4.64
## Front.Ecol.Environ :1 3rd Qu.:5.66
## Glob.Change.Biol :1 Max. :8.00
## (Other) :2
str(data) # STRucture of dataframe (data type, number of levels, first few values)
## 'data.frame': 8 obs. of 3 variables:
## $ VAR1: Factor w/ 8 levels "Annu.Rev.Ecol.Evol.S",..: 2 1 3 5 7 6 4 8
## $ VAR2: Factor w/ 3 levels "ARI","BP","ESA": 2 1 3 3 2 2 3 2
## $ VAR3: num 8 7.02 5.21 3.64 3.48 ...

Accessing data

  • You will often need to access individual data points or groups of data points.
  • This is done using dollar ($) or square bracket ([]) notation, or a hybrid of the two, to call upon specific rows and columns.
data$VAR2 # Dollar notation; returns values from column 'VAR1' in dataframe 'data'
## [1] BP ARI ESA ESA BP BP ESA BP
## Levels: ARI BP ESA
data[2] # Square-bracket notation (indexing); returns column 1 in dataframe 'data'
## VAR2
## 1 BP
## 2 ARI
## 3 ESA
## 4 ESA
## 5 BP
## 6 BP
## 7 ESA
## 8 BP
data[3, 2] # Value in 3rd row, 2nd column
## [1] ESA
## Levels: ARI BP ESA
data$VAR2[3] # Hybrid notation; returns value from the third row of column 'VAR1' in dataframe 'data'
## [1] ESA
## Levels: ARI BP ESA

Altering a value

  • Once you have located the value(s) you want changed, you can use the gets notation (<-) to specify that location as a new value.
data$VAR3[1] # Shows incorrect value of 7.995
## [1] 7.995
data$VAR3[1] <- 7.996 # You can use $, [], or a hybrid notation to locate data
data$VAR3[1] # Value corrected to 7.996
## [1] 7.996

Renaming columns

  • Informative column headers can be used to replace boring ones from your dataframe (e.g. VAR1 in our example).
  • The names function allows us to look at the column names, which we can change using the gets notation (<-).
  • You can change the column headers one by one, but it is easier to concatenate (c) a list of new column headers if you want to change several at once.
names(data) <- c("Title", "Publisher", "SJR") # Rename multiple columns
names(data)[3] <- "SJR" # Rename the third column only
data
## Title Publisher SJR
## 1 Ecol.Lett BP 7.996
## 2 Annu.Rev.Ecol.Evol.S ARI 7.019
## 3 Ecol.Monogr ESA 5.210
## 4 Front.Ecol.Environ ESA 3.644
## 5 Glob.Ecol.Biogeogr BP 3.477
## 6 Glob.Change.Biol BP 3.425
## 7 Ecology ESA 3.246
## 8 J.Ecol BP 3.145

Adding new columns

  • This is similar to the in the Example dataframe section above, where you specify a concatenated (c) list, or vector, of values to make up the new column.
  • Create a new empty column in the dataframe using dollar notation (e.g. dframe1$column.one creates a new column called column.one in a dataframe called dframe1) and fill it with your vector using the gets notation (<-)
data$H.index <- c(127, 125, 98, 57, 71, 126, 172, 102) # Add new column called 'H.index' with listed data
data
## Title Publisher SJR H.index
## 1 Ecol.Lett BP 7.996 127
## 2 Annu.Rev.Ecol.Evol.S ARI 7.019 125
## 3 Ecol.Monogr ESA 5.210 98
## 4 Front.Ecol.Environ ESA 3.644 57
## 5 Glob.Ecol.Biogeogr BP 3.477 71
## 6 Glob.Change.Biol BP 3.425 126
## 7 Ecology ESA 3.246 172
## 8 J.Ecol BP 3.145 102

Subsetting data

  • Subsetting allows you to focus on certain sections of your dataframe (e.g. you have a dataframe about Pokémon, but you want to narrow your analysis down to bug-types only).
  • You can select the data you want by matching it to a list, using using logical operators to simplify the process: == means equal to; != means not equal to; %in% means match to a list of options.

(1) Use the ‘subset’ function

data.BP <- subset(data, Publisher == "BP") # Dataframe containing publisher BP only
data.notBP <- subset(data, Publisher != "BP") # Dataframe containing all publishers except BP
data.ARIandESA <- subset(data, Publisher %in% c("ARI", "ESA")) # Dataframe containing publishers ARI and ESA only

(2) Use square-bracket notation (indexing)

data.BP <- data[data$Publisher == "BP", ] # Dataframe containing publisher BP only
data.notBP <- data[data$Publisher != "BP", ] # Dataframe containing all publishers except BP
data.ARIandESA <- data[data$Publisher %in% c("ARI", "ESA"), ] # Blank space after the comma in the square brackets indicates inclusion of all columns
data.ARIandESA <- data[data$Treatment %in% c("ARI", "ESA"), c(2, 3)] # If we only want some of the columns (e.g. columns 2 and 3)

Renaming factor levels

  • This is similar to renaming columns (above), but uses the levels function instead of the names function.
summary(data$Title) # List original level names
## Annu.Rev.Ecol.Evol.S Ecol.Lett Ecol.Monogr
## 1 1 1
## Ecology Front.Ecol.Environ Glob.Change.Biol
## 1 1 1
## Glob.Ecol.Biogeogr J.Ecol
## 1 1
levels(data$Title) <- c("AREES", "EL", "EM", "Ecol", "FEE", "GCB", "GEB", "JE") # Replacing level names as ordered in the summary (alphabetically)
summary(data$Title) # List new level names
## AREES EL EM Ecol FEE GCB GEB JE
## 1 1 1 1 1 1 1 1

Reordering factor levels

  • By default, factors are listed alphabetically, but you may want them to appear in a different order (this allows you to plot these data in the order you specify, too).
summary(data$Title) # Check the levels
## AREES EL EM Ecol FEE GCB GEB JE
## 1 1 1 1 1 1 1 1
data$Title <- factor(data$Title, levels = c("EM", "AREES", "FEE", "GEB", "EL",
"Ecol", "GCB", "JE")) # Order the levels as you please
summary(data$Title) # Check order
## EM AREES FEE GEB EL Ecol GCB JE
## 1 1 1 1 1 1 1 1

Collapsing factor levels

  • This can be done to simplify a model; perhaps there is not enough of a difference between Factors A and B, so they can be considered together.

(1) Create new levels to collapse current levels into

data$Collapsed.titles <- data$Title # Copy 'Title' column to new column 'Collapsed.title'
levels(data$Collapsed.titles) # Factor list is AREES, EL, EM, Ecol, FEE, GCB', GEB, JE
## [1] "EM" "AREES" "FEE" "GEB" "EL" "Ecol" "GCB" "JE"
levels(data$Collapsed.titles) <- c(levels(data$Collapsed.titles), "Top4", "Next4") # Introduce levels to collapse others into

(2) Collapse the current levels into the new levels

data$Collapsed.titles[data$Collapsed.titles %in% c("EL", "AREES", "EM", "FEE")] <- "Top4" # Four journals with highest SJR
data$Collapsed.titles[data$Collapsed.titles %in% c("GEB", "GCB", "Ecol", "JE")] <- "Next4" # Four journals with next highest SJR

(3) Remove the now-unwanted old levels

summary(data$Collapsed.titles) # Unwanted levels still present
## EM AREES FEE GEB EL Ecol GCB JE Top4 Next4
## 0 0 0 0 0 0 0 0 4 4
data <- droplevels(data) # Drop the unwanted levels
summary(data$Collapsed.titles) # Unwanted levels gone
## Top4 Next4
## 4 4

Appending dataframes

  • This allows you to attach two dataframes together vertically (‘one on top of the other’).
  • The column names in each dataframe must match for this to work.

(1) Set-up two example dataframes to append

Title <- c("EL", "AREES", "EM", "FEE", "GEB", "GCB", "Ecol", "JE")
Publisher <- c("BP", "ARI", "ESA", "ESA", "BP", "BP", "ESA", "BP")
SJR <- c(7.996, 7.019, 5.21, 3.644, 3.477, 3.425, 3.246, 3.145)
H.index <- c(127, 125, 98, 57, 71, 126, 172, 102)
data <- data.frame(Title, Publisher, SJR, H.index)
data
## Title Publisher SJR H.index
## 1 EL BP 7.996 127
## 2 AREES ARI 7.019 125
## 3 EM ESA 5.210 98
## 4 FEE ESA 3.644 57
## 5 GEB BP 3.477 71
## 6 GCB BP 3.425 126
## 7 Ecol ESA 3.246 172
## 8 JE BP 3.145 102
Title <- c("Ecog", "DD")
Publisher <- c("BP", "BP")
SJR <- c(2.87, 2.811)
H.index <- c(72, 60)
Published.2012 <- c(134, 116)
data2 <- data.frame(Title, Publisher, SJR, H.index, Published.2012)
data2
## Title Publisher SJR H.index Published.2012
## 1 Ecog BP 2.870 72 134
## 2 DD BP 2.811 60 116

(2) Check column names match

names(data) == names(data2) # Check that columns match ('data2' has an extra column)
## Warning: longer object length is not a multiple of shorter object length
## [1] TRUE TRUE TRUE TRUE FALSE
data2$Published.2012 <- NULL # Effectively deletes the column
names(data) == names(data2) # Check again that columns match
## [1] TRUE TRUE TRUE TRUE

(3) Append

data.2and3 <- rbind(data, data2) # Append the dataframes
data.2and3 # Check for success
## Title Publisher SJR H.index
## 1 EL BP 7.996 127
## 2 AREES ARI 7.019 125
## 3 EM ESA 5.210 98
## 4 FEE ESA 3.644 57
## 5 GEB BP 3.477 71
## 6 GCB BP 3.425 126
## 7 Ecol ESA 3.246 172
## 8 JE BP 3.145 102
## 9 Ecog BP 2.870 72
## 10 DD BP 2.811 60

Merging dataframes

  • You have a second dataframe with further information about levels in the first dataframe.
  • For instance, our example dataframe can be merged with a dataframe containing further information about the publishers.

(1) Set-up two example dataframes to merge

Title <- c("EL", "AREES", "EM", "FEE", "GEB", "GCB", "Ecol", "JE")
Publisher <- c("BP", "ARI", "ESA", "ESA", "BP", "BP", "ESA", "BP")
SJR <- c(7.996, 7.019, 5.21, 3.644, 3.477, 3.425, 3.246, 3.145)
H.index <- c(127, 125, 98, 57, 71, 126, 172, 102)
data <- data.frame(Title, Publisher, SJR, H.index)
data
## Title Publisher SJR H.index
## 1 EL BP 7.996 127
## 2 AREES ARI 7.019 125
## 3 EM ESA 5.210 98
## 4 FEE ESA 3.644 57
## 5 GEB BP 3.477 71
## 6 GCB BP 3.425 126
## 7 Ecol ESA 3.246 172
## 8 JE BP 3.145 102
Publisher <- c("BP", "ARI", "ESA")
Country <- c("UK", "USA", "USA")
data3 <- data.frame(Publisher, Country) # Note that this dataframe contains further information about factors in the dataframe 'data' (the country that each publisher operates from)
data3
## Publisher Country
## 1 BP UK
## 2 ARI USA
## 3 ESA USA

(2) Merge the two dataframes

data.merge <- merge(data, data3, by = "Publisher") # We've now added extra information
data.merge
## Publisher Title SJR H.index Country
## 1 ARI AREES 7.019 125 USA
## 2 BP EL 7.996 127 UK
## 3 BP GEB 3.477 71 UK
## 4 BP GCB 3.425 126 UK
## 5 BP JE 3.145 102 UK
## 6 ESA EM 5.210 98 USA
## 7 ESA FEE 3.644 57 USA
## 8 ESA Ecol 3.246 172 USA

This post was partially created using Markdown and Knitr in RStudio (combining Adriana’s twin passions of R and knitting).

If you desperately need to manipulate your data in spreadsheet form, you can always use the edit() function to call a pseudo-spreadsheet within R (witchcraft!). Altering these numbers does not change the original data file, and the console window gives you a summary of the changes.

Advertisements

2 responses to “Basic data-frame manipulations in R

  1. Hey There. I found your blog the usage of msn. This is an extremely well written article.
    I’ll be sure to bookmark it and return to read more of your helpful info.
    Thanks for the post. I’ll certainly return.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s