[Note that this blog has now moved to rostrum.blog where you’ll find a lot more R content.]
Apps, sensors and GPS have made auto-analytics (self-tracking? life-logging?) easy. I decided to investigate my daily commute, which is tracked through my use of an Oyster card. I starting receiving travel statements auto-generated by Transport for London (TfL) just over a year ago. I collated the data for weekday train journeys to and from the stations nearest my home and workplace (Zones 5 and 1). The plot is below.
My average outbound journey (just over 46 mins) was slightly faster than the average return (about 47.5 mins). This might because delays and track problems have longer to build up over the course of a day. These problems are usually signal or point failures on my route, though there was also a ‘person on the line’ a few times and even a ‘deer strike’.
The average outbound journey duration has got marginally quicker over time; the inbound journey slower. This is likely because I moved house in May and trains are far less frequent to and from my nearest train station. Missed connections can add a lot of time to the total duration.This may also be a reason for the higher mean inbound journey duration and is the likely cause of my most arduous journey: an 84 min homeward journey on Monday 6 July 2015. (This was beaten by a two and a half hour trip recently, but this involved a bus journey and is not included here.)
There were 241 outbound records and 218 inbound. There were fewer inbound because I haven’t included any non-regular routes home (e.g. I met with friends and headed home from a different station than normal). I took 416 weekday journeys to work by train in the year from 1 October 2014 to 30 September 2015, costing £5.26 per trip. Of course, I actually took more journeys in total including both bus and train, so the cost was more like £4.58 per journey. That’s not bad.
Oyster-data wrangling in R
There a few problems with the format of the Comma-Separated Values (CSV) files sent by TfL, including that:
- row 1 is blank
- the start and end destinations are combined in one column
- there is no column identifying train and bus journeys separately
- purchases and journeys are in the same columns
Here are a few snippets of code to prepare the data from these CSVs for analysis in R. Bear in mind that this will extract information for train journeys only.
First set the working directory to the folder containing the CSV files only.
setwd(dir = "~/Documents/OysterData") # example filepath
Create an object containing a list of the names of the CSV files.
oy.file.list <- list.files(pattern = "*.csv")
Read in the CSV files for each month and join them together
oy.df <- do.call( # dataframe of all records "rbind" # append function , lapply(oy.file.list , function(x) read.csv(x # read files , stringsAsFactors = FALSE # character strings , skip = 1 # ignore blank first row , header = TRUE # keep column names , na.strings = "" # blanks as NAs ) ) )
The start and end stations of each trip are in the form ‘Station A to Station B’ and must be split into two separate columns. Then the columns can be renamed.
library(stringr) # package required station.mat <- str_split_fixed( oy.df$Journey.Action # col to split , " to " # start/end stations before/after this string , 2 # num cols to split into ) oy.df2 <- cbind(oy.df, station.mat) # add station cols names(oy.df2)[c(9, 10)] <- c("station.start", "station.end") # rename cols
Identify outbound and inbound journeys (e.g. outbound starting at station A, B or C and ending at X, Y or Z; vice versa for inbound journeys). I created a new column called ‘direction’ and filled it with NAs then filled it with ‘outbound’ or ‘inbound’ if starting and ending at these stations.
oy.df2$direction <- NA # create col of NAS oy.df2$direction[oy.df2$station.start %in% c( "Station Name A" , "Station Name B" , "Station Name C" ) & oy.df2$station.end %in% c( "Station Name X" , "Station Name Y" , "Station Name Z" ) # if starting at A/B/C and ending at X/Y/Z... ] <- "outbound" # ...fill col as outbound oy.df2$direction[oy.df2$station.start %in% c( "Station Name X" , "Station Name Y" , "Station Name Z" ) & oy.df2$station.end %in% c( "Station Name A" , "Station Name B" , "Station Name C" ) ] <- "inbound"
The date (DD-MMM-YYYY) and time (HH:MM) character columns must be combined.
oy.df2$dmy.start <- paste( # combine... oy.df2$Date # the date col , oy.df2$Start.Time # the start time col , sep = " " # and put a space between them ) # results in DD-MMM-YYYY HH:MM oy.df2$dmy.end <- paste( oy.df2$Date , oy.df2$End.Time , sep = " " )
The data type should be changed (POSIXct) so the data are recognised as date-time.
library(lubridate) # package required oy.df2$dmy.start2 <- dmy_hm(oy.df2$dmy.start) # YYYY-MM-DD HH:MM:SS for start of journey oy.df2$dmy.end2 <- dmy_hm(oy.df2$dmy.end) # same for end of journey oy.df2$wday.start <- wday(oy.df2$dmy.start2, label = TRUE) # day of week for start of journey oy.df2$wday.end <- wday(oy.df2$dmy.end2, label = TRUE) # day of week for end of journey oy.df2$t.dif <- oy.df2$dmy.end2-oy.df2$dmy.start2 # time difference from start to end of journey
I concentrated on journeys on work-days, so I extracted only the weekday trips.
oy.wday.df <- droplevels(subset(oy.df2, wday.start != "Sat" & wday.start != "Sun"))