Want Some Data With Those Oysters?

[Note that this blog has now moved to rostrum.blog where you’ll find a lot more R content.]


Photo: Matt Buck CC BY-SA 2.0 (click for source)

Nicholas ‘Feltron’ Felton recently released his tenth and final annual report. The subject? His ‘quantified self’.

Apps, sensors and GPS have made auto-analytics (self-tracking? life-logging?) easy. I decided to investigate my daily commute, which is tracked through my use of an Oyster card. I starting receiving travel statements auto-generated by Transport for London (TfL) just over a year ago. I collated the data for weekday train journeys to and from the stations nearest my home and workplace (Zones 5 and 1). The plot is below.

Journey duration

Trip duration for weekday train journeys to (outbound) and from (inbound) my workplace between August 2014 and September 2015. The solid red line is the mean duration; dashed red is a linear fit. Click to enlarge.

My average outbound journey (just over 46 mins) was slightly faster than the average return (about 47.5 mins). This might because delays and track problems have longer to build up over the course of a day. These problems are usually signal or point failures on my route, though there was also a ‘person on the line’ a few times and even a ‘deer strike’.

The average outbound journey duration has got marginally quicker over time; the inbound journey slower. This is likely because I moved house in May and trains are far less frequent to and from my nearest train station. Missed connections can add a lot of time to the total duration.This may also be a reason for the higher mean inbound journey duration and is the likely cause of my most arduous journey: an 84 min homeward journey on Monday 6 July 2015. (This was beaten by a two and a half hour trip recently, but this involved a bus journey and is not included here.)

There were 241 outbound records and 218 inbound. There were fewer inbound because I haven’t included any non-regular routes home (e.g. I met with friends and headed home from a different station than normal). I took 416 weekday journeys to work by train in the year from 1 October 2014 to 30 September 2015, costing £5.26 per trip. Of course, I actually took more journeys in total including both bus and train, so the cost was more like £4.58 per journey. That’s not bad.


Photo: PaulKitchener CC BY-NC 2.0 (click for source)

Oyster-data wrangling in R

There a few problems with the format of the Comma-Separated Values (CSV) files sent by TfL, including that:

  • row 1 is blank
  • the start and end destinations are combined in one column
  • there is no column identifying train and bus journeys separately
  • purchases and journeys are in the same columns

Here are a few snippets of code to prepare the data from these CSVs for analysis in R. Bear in mind that this will extract information for train journeys only.

First set the working directory to the folder containing the CSV files only.

setwd(dir = "~/Documents/OysterData")  # example filepath

Create an object containing a list of the names of the CSV files.

oy.file.list <- list.files(pattern = "*.csv")

Read in the CSV files for each month and join them together

oy.df <- do.call(  # dataframe of all records
    "rbind"  # append function
  , lapply(oy.file.list
    , function(x) read.csv(x  # read files
      , stringsAsFactors = FALSE  # character strings
      , skip = 1  # ignore blank first row
      , header = TRUE  # keep column names
      , na.strings = ""  # blanks as NAs

The start and end stations of each trip are in the form ‘Station A to Station B’ and must be split into two separate columns. Then the columns can be renamed.

library(stringr) # package required

station.mat <- str_split_fixed(
    oy.df$Journey.Action # col to split
  , " to "  # start/end stations before/after this string
  , 2  # num cols to split into

oy.df2 <- cbind(oy.df, station.mat)  # add station cols

names(oy.df2)[c(9, 10)] <- c("station.start", "station.end")  # rename cols

Identify outbound and inbound journeys (e.g. outbound starting at station A, B or C and ending at X, Y or Z; vice versa for inbound journeys). I created a new column called ‘direction’ and filled it with NAs then filled it with ‘outbound’ or ‘inbound’ if starting and ending at these stations.

oy.df2$direction <- NA  # create col of NAS

oy.df2$direction[oy.df2$station.start %in% c(
    "Station Name A"
  , "Station Name B"
  , "Station Name C"
  ) & oy.df2$station.end %in% c(
    "Station Name X"
  , "Station Name Y"
  , "Station Name Z"
  )  # if starting at A/B/C and ending at X/Y/Z...
] <- "outbound"  # ...fill col as outbound

oy.df2$direction[oy.df2$station.start %in% c(
    "Station Name X"
  , "Station Name Y"
  , "Station Name Z"
  ) & oy.df2$station.end %in% c(
    "Station Name A"
  , "Station Name B"
  , "Station Name C"
] <- "inbound"

The date (DD-MMM-YYYY) and time (HH:MM) character columns must be combined.

oy.df2$dmy.start <- paste(  # combine...
    oy.df2$Date  # the date col
  , oy.df2$Start.Time # the start time col
  , sep = " "  # and put a space between them
)  # results in DD-MMM-YYYY HH:MM

oy.df2$dmy.end <- paste(
  , oy.df2$End.Time
  , sep = " "

The data type should be changed (POSIXct) so the data are recognised as date-time.

library(lubridate)  # package required

oy.df2$dmy.start2 <- dmy_hm(oy.df2$dmy.start)  # YYYY-MM-DD HH:MM:SS for start of journey
oy.df2$dmy.end2 <- dmy_hm(oy.df2$dmy.end)  # same for end of journey

oy.df2$wday.start <- wday(oy.df2$dmy.start2, label = TRUE)  # day of week for start of journey
oy.df2$wday.end <- wday(oy.df2$dmy.end2, label = TRUE)  # day of week for end of journey

oy.df2$t.dif <- oy.df2$dmy.end2-oy.df2$dmy.start2  # time difference from start to end of journey

I concentrated on journeys on work-days, so I extracted only the weekday trips.

oy.wday.df <- droplevels(subset(oy.df2, wday.start != "Sat" & wday.start != "Sun"))

Photo: Scott Cawley CC BY-NC 2.0 (click for source)


One response to “Want Some Data With Those Oysters?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s