1 Introduction

The aim of this project is to make a qualitative analysis and gain insights into the open source New York City Taxi and Limousine Service Trip Record Data. In this notebook, we will tackle a research problem and perform analysis on millions of taxi trips data — all in a reproducible manner with R, in that the results are accompanied by the data and code needed to produce them.[1]

Source code of this Rmarkdown can be found in the GitHub repo if one wishes to reproduce the exact same results of this study. Likewise, an initial exploratory analysis of the data can be viewed here.


2 Research Problem

Despite predicting taxi fare is not as rewarding as predicting airline fares[2], we may want to start practicing from predicting the former given that now we have an extensive amount of open source data available.

In this study, we are interested in understanding how taxis arrive at fares to be paid by the passengers. To do so, we will first access and report the relation between the taxi fares (response) with other attributes (explanatory). Following that, we will build statistical models in order to predict taxi fares for rides in the New York City given the input variables. The final model could allow passenger to make use of this knowledge before calling for a taxi service.

2.1 Hypothesis

To approach this research problem, we will first list down a set of hypotheses, which in our case are potential factors that could affect the cost of a taxi trip.

  1. Trip distance: Taxi fare is higher for long rides.
  2. Trip duration: Trips caught in traffic jam may have higher fares.
  3. Pickup or dropoff location: Fares may be different depending on the location travelling to/from.
  4. Time of travel: Fare amount may differ during peak and non-peak hours.
  5. Day of travel: On weekends, taxi fares may be higher.
  6. Trip to/from airport: Rides to airport have a flat fare[3].
  7. Weather conditions: Bad weather condition may affect the availability and thus results in higher taxi fare.

For the purpose of this study, we will use the following R packages:

library(tidyverse)
library(data.table)
library(rgdal)
library(lubridate)
library(geosphere)
library(corrplot)
library(h2o)
  • tidyverse: to help in general manipulating the dataframes and plotting
  • data.table: to enable efficient reading and pre-processing of datasets
  • rgdal: R’s interface to the popular C/C++ spatial data processing library gdal
  • lubridate: allows to manipulate datetime variables
  • geosphare: spherical trigonometry for geographic applications
  • corrplot: package for graphical display of a correlation matrix
  • h2o: R interface to distributed in-memory machine learning platform

3 Data Preparation

In any analytics project, 80% of the time and effort is spent on preparing, cleaning and organizing data for analysis4. This section describes briefly the data used and the steps taken to process and prepare them for the study.

3.1 Data Loading

We will use the following open souce dataset in the analysis.

3.1.1 New York City Taxi & Limousine Commission (TLC) Service Trip Record Data

The primary dataset we will be using in this study. The dataset covers trips taken in various different types of licensed taxi and limousine services in the New York City area from 2009 to 2019. Briefly, each taxi trip records pick-up and drop-off datetime and location, trip distance, itemized fare, payment type, and driver-reported passenger count.

3.1.2 TLC Taxi Zone Data

New York City taxi zone locations information from TLC that includes a lookup table that contains TLC taxi zone location IDs, location names and corresponding boroughs for each ID. It also has a polygon shapefile containing the boundaries for the TLC taxi zones.

# read in taxi zones information
zones_lookup <- read_csv("data/taxi+_zone_lookup.csv", col_types="iccc")
taxi_zones <- readOGR(dsn = "data/taxi_zones/taxi_zones.shp", verbose=FALSE)
taxi_zones$id <- row.names(taxi_zones)

# merge zone information with taxi service zone in lookup table
taxi_zones@data <- taxi_zones@data %>%
  left_join(zones_lookup %>% 
              select(-LocationID, -Borough) %>%
              unique(), 
            by=c('zone' = 'Zone')) %>%
  mutate(zone = as.factor(zone),
         service_zone = as.factor(service_zone))

# reproject to commonly used CRS
taxi_zones <- spTransform(taxi_zones, CRS("+init=epsg:4326"))

3.1.3 Holidays

Dataset primarily created in python to support the taxi fare analysis. It contains list of public holidays for New York at the state level for the year 2014 and 2015.

#############################################
## python code to generate ny_holidays.csv ##
# from datetime import date
# import pandas as pd
# import holidays
# 
# ny_holidays = holidays.US(state='NY', years=[2014, 2015])
# 
# df = pd.DataFrame(list(ny_holidays.items()), columns=["Date", "Holiday"])
# 
# df.to_csv("ny_holidays.csv", index=False)
############ end of python code #############

holidays <- read.csv("data/ny_holidays.csv") %>%
  mutate(Date = ymd(Date))

3.1.4 New York City’s Central Park Weather Data

In order to study how weather conditions affect taxi fare (hypothesis 7) in New York City, we will use the New York City Central Park weather data that comes from National Climatic Data Center which contains daily weather observation at the central park for year 2014 and 2015.
One limitation of using this dataset is that we will need to assume that entire New York City will share the exact same weather condition with central park. Moreover, aggreagted daily data may be crude as weather could fluctuate from hour to hour.

# read in weather data
weather <- read.csv("data/ny_city_central_park_weather.csv") %>%
  mutate(DATE = ymd(DATE),
         TAVG = (TMIN + TMAX) / 2,
         AWND = ifelse(is.na(AWND), mean(AWND, na.rm=TRUE), AWND),
         AWND = round(AWND, digits=2)) %>%
  select(-NAME, -TMAX, -TMIN)

3.2 Data Selection

We will look at green taxi trips for this study. Green taxis (as opposed to yellow ones) are taxis that are not allowed to pick up passengers inside of the densely populated areas of Manhattan. They do not seem to attract much attention of the research community in comparison with the relatively larger yellow taxi but may also contains interesting human behaviour-related information.

3.2.1 Period Selection

The entire green taxi dataset is large, covering from year 2013 to 2019. For the purpose of this study, we will look at only the data from year 2014 to 2015 as it contains precise pickup and dropoff location information in latitude and longitude - which may be helpful to in deriving the neighbourhoods and identifying airport trips (as needed for hypothesis 3 and 6). Furthermore, spanning the scope of the analysis across whole year allows us take in account of the effect of seasonality.

3.2.2 Attributes Selection

With all the data as above, we will look at a number of attributes for the study. Specifically, the following attributes will be used to aid in our analysis:

[Total Fare, Pickup and Dropoff Location, Pickup and Dropoff Day and Time, Trip Distance and Duration, Weather Condition and Holiday]

The selection of these attributes are well justified by hypothesis 1 to 7, in that we need them to in order to try to infer the result of hypotheses. For the purpose of this study, we will regard Total Fare as the amount required to be paid by passenger excluding the tips.

# function to compute taxi fare
compute_total_fare <- function(taxi_data) {
  taxi_data %>%
    mutate(Total_fare = Total_amount - Tip_amount)
}

# function to compute trip duration
compute_trip_duration <- function(taxi_data) {
  taxi_data %>%
    rename(Pickup_datetime = lpep_pickup_datetime,
           Dropoff_datetime = Lpep_dropoff_datetime) %>%
    mutate(Pickup_datetime = ymd_hms(Pickup_datetime),
           Dropoff_datetime = ymd_hms(Dropoff_datetime),
           Trip_duration = difftime(Dropoff_datetime, Pickup_datetime, units="mins"),
           Trip_duration = as.numeric(Trip_duration) %>% round(digits=2))
}

3.3 Data Cleansing

It is important to clean the data as removing incorrect information can improve the data quality and in doing so, increases overal productivity. The trips data has been checked and none of rows - with chosen attributes contains missing value. Next the data was checked to remove any abnormal or wrong records. This step is important as it ensures the data is correct, consistent and usable by identifying any errors or inaccurate information in the data. Briefly, bad records or outliers have been filtered out according to the follwing criteria:

  • Number of passengers in the vehicle \(= 0\) or \(> 6\): There is a maximum passenger limit imposed on any cabs5.
  • Pickup/Dropoff latitude \(\notin (39, 42)\) or Pickup/Dropoff longitutde \(\notin (-76, -72)\): Only trips within New York6 are of interest.
  • Total fare \(<= 0\) or \(> 300\) dollars: Remove negative values and fare above 0.999 quantile (outliers).
  • Trip duration \(< 1\) minute or \(> 12\) hours: Remove trips that are too short or too long (outliers).
clean_taxi_data <- function(taxi_data) {
  taxi_data %>%
    # remove rows with missing value(s)
    drop_na() %>%
    # data cleansing
    filter(Passenger_count > 0, Passenger_count <= 6,
           Pickup_latitude > 39, Pickup_latitude < 42,
           Pickup_longitude > -76, Pickup_longitude < -72,
           Dropoff_latitude > 39, Dropoff_latitude < 42,
           Dropoff_longitude > -76, Dropoff_longitude < -72,
           Total_fare > 0, Total_fare < 300,
           Trip_duration > 1, Trip_duration < (12 * 60))
}

3.4 Data Preprocessing

Data preprocessing has been carried out in order to prepare the data for our analysis. Very briefly, the steps involved in transforming the data into a format useful for our study are:

  1. All the taxi trips’ pickup and dropoff locations were overlaid on spatial polygons defining every taxi zone to find pickup and dropoff location associated with each trip. This information is useful for downstream spatial analysis of taxi fare with each taxi zone.
spatial_overlap_zones <- function(taxi_data, zones=taxi_zones) {
  # find intersect between trip locations and taxi zones
  pickup_zones <- SpatialPointsDataFrame(coords=taxi_data[, c('Pickup_longitude', 'Pickup_latitude')],
                                         data=taxi_data,
                                         proj4string=CRS(proj4string(zones))) %>%
    over(zones) %>%
    pull(zone)
  
  dropoff_zones <- SpatialPointsDataFrame(coords=taxi_data[, c('Dropoff_longitude', 'Dropoff_latitude')],
                                          data=taxi_data,
                                          proj4string=CRS(proj4string(zones))) %>%
    over(zones) %>%
    pull(zone)
  
  taxi_data %>%
    mutate(Pickup_zone = factor(pickup_zones, levels=levels(zones$zone)),
           Dropoff_zone = factor(dropoff_zones, levels=levels(zones$zone))) %>%
    # some trips may not fall in known taxi zones
    drop_na()
}
  1. Identify airport rides as trips that either
  • start or terminate at the 3 NYC area airports: John F. Kennedy International Airport (JFK), LaGuardia Airport (LGA), and the Newark Liberty International Airport (EWR) or
  • have RateCodeID of 2 (JFK) or 3 (EWR)
# find airport zones
airport_zones <- taxi_zones@data %>%
  filter(service_zone %in% c("Airports", "EWR")) %>%
  pull(zone) %>%
  as.vector()


identify_airport_trips <- function(taxi_data, zones=airport_zones) {
  taxi_data %>%
    mutate(Airport_trip = Pickup_zone %in% zones |
                          Pickup_zone %in% zones |
                          Dropoff_zone %in% zones |
                          Dropoff_zone %in% zones |
                          RateCodeID %in% c(2, 3))
}
  1. Extract month, day of week and hour from trips and then determine the season7, weekday/weekend and time of day for all trips.
# helper function to return US season
season <- function(datetime) {
  m <- month(datetime)
  terms <- c("Spring", "Summer", "Autumn", "Winter")
  s <- factor(character(length(m)), levels=terms)
  s[m %in% c(3, 4, 5)] <- terms[1]
  s[m %in% c(6, 7, 8)] <- terms[2]
  s[m %in% c(9, 10, 11)] <- terms[3]
  s[m %in% c(12, 1, 2)] <- terms[4]
  s
}


# function to extract day time-related features
extract_time_features <- function(taxi_data) {
  taxi_data %>%
    mutate(Pickup_month = month(Pickup_datetime, label=TRUE),
           Pickup_day = wday(Pickup_datetime, label=TRUE),
           Pickup_hour = hour(Pickup_datetime),
           Season = season(Pickup_datetime),
           Weekend = Pickup_day %in% c("Sat", "Sun"),
           Time = ifelse(Pickup_hour >= 6 & Pickup_hour < 18,
                         "Daytime",
                         "Nighttime"),
           Time = factor(Time, levels=c("Daytime", "Nighttime")))
}
  1. Compute orthodromic distance – shortest distance between two points on the surface of a sphere8.
# function to compute orthodromic distance
great_circle_distance <- function(taxi_data) {
  taxi_data %>%
    mutate(Great_circle_dist = distHaversine(
      cbind(Dropoff_longitude, Dropoff_latitude),
      cbind(Pickup_longitude, Pickup_latitude)) %>%
        round(digits=2))
}
  1. Combine with external data in order to gain information on weather conditions and if the day is holiday or not.
join_external_data <- function(taxi_data, holiday_data=holidays, weather_data=weather) {
  taxi_data %>%
    # determine if the day is holiday
    mutate(Pickup_date = as.Date(Pickup_datetime),
           Holiday = Pickup_date %in% holiday_data$Date) %>%
    # join with weather data
    left_join(weather_data, by=c("Pickup_date" = "DATE")) %>%
    select(-Pickup_date)
}

# function to prepress data
preprocess_taxi_data <- function(taxi_data) {
  taxi_data %>%
    spatial_overlap_zones() %>%
    identify_airport_trips() %>%
    extract_time_features() %>%
    great_circle_distance() %>%
    join_external_data()
}

3.5 Data Processing

With all the functions needed for cleaning and pre-processing the data, we can now read all green taxi trips from 2014 to 2015 from local directory, pass through a pipeline of data preparation steps and result in a final dataframe to be used in statistical modelling.

read_cols <- c("lpep_pickup_datetime", "Lpep_dropoff_datetime", "RateCodeID",
               "Pickup_longitude", "Pickup_latitude", 
               "Dropoff_longitude", "Dropoff_latitude",
               "Passenger_count", "Trip_distance", "Tip_amount", "Total_amount")

out_cols <- c("Season", "Pickup_month", "Pickup_day", "Weekend", "Pickup_hour", 
              "Time", "Pickup_zone", "Dropoff_zone", "Passenger_count",
              "Airport_trip", "Holiday", "AWND", "PRCP", "SNOW", "SNWD", "TAVG",
              "Trip_distance", "Trip_duration", "Great_circle_dist", "Total_fare")


read_one_taxi_data <- function(filename, read_columns=read_cols, out_columns=out_cols) {
  print(paste("Reading", filename))
  fread(filename, select=read_columns, fill=TRUE, showProgress=FALSE) %>%
    as_tibble() %>%
    compute_total_fare() %>%
    compute_trip_duration() %>%
    clean_taxi_data() %>%
    preprocess_taxi_data() %>%
    select(out_columns)
}

read_all_taxi_data <- function(path, pattern) {
  lapply(list.files(path, pattern, full.names=TRUE), 
         function(x) read_one_taxi_data(x, read_cols, out_cols)) %>%
    rbindlist()
}

# set to TRUE if want to rerun everything
dev <- FALSE

if (dev) {
  trips_data <- read_all_taxi_data("data", "green_tripdata")
  saveRDS(trips_data, file="green_trips_data_2014-15.rds")
}

The processed data has 34,095,643 rows and 20 columns and cost around 3.9~4 in RAM in local machine. In the fitting model section, we will turn to h2o package for its fast and scalable machine learning application.


4 Analysis

Before jumping into fitting statistical models, we may want to first analyse and report taxi fares (response) via descriptive statistics for a group of explanatory variables. In our case, we will do this on a random subsample of prepared dataset which contains 1 million records. A reasonable assumption here is that the records in this subset may very well be a fair representative sample of the entire dataset.

if (dev) {
  # randomly subset
  set.seed(100)
  n <- 1e6
  subsetted <- sample_n(trips_data, n)
  saveRDS(subsetted, file="1m_green_trips_data_2014-15.rds")
} else {
  # read previously saved dataframe
  subsetted <- readRDS("1m_green_trips_data_2014-15.rds")
}

4.1 Trip distance and duration

In an attempt to infer result from hypothesis 1, we can see the effect of trip distance and duration on taxi fare by stratifying the observations into groups of different distance and duration. In Figure 1, we can see trips with long duration are generally associated with higher fare, whereas trip distance seems to have a smaller effect on the fare. Note that distance and duration were discretised into bins such that there are equal number of observations in each bin. A summary of mean total fare for each group is also provided below, as shown in Table 1.

# discretise trip distance and duration
n_bins <- 3
subsetted <- subsetted %>%
  mutate(Distance = cut_number(Trip_distance, n_bins),
         Duration = cut_number(Trip_duration, n_bins)) 

# plot vs taxi fare
subsetted %>%
  ggplot(aes(x=Distance, y=Total_fare, fill=Duration)) +
  geom_boxplot() +
  ylim(c(0, 50)) +
  theme_minimal() +
  labs(title = "Trip distance and duration vs Total fare")
**Figure 1**: Boxplot of taxi fares, stratified by groups of discretised trip distance and duration. (Cutoff at y=50 for visualization purpose)

Figure 1: Boxplot of taxi fares, stratified by groups of discretised trip distance and duration. (Cutoff at y=50 for visualization purpose)

# table of summary
subsetted %>%
  group_by(Distance, Duration) %>%
  summarise(Mean_fare = mean(Total_fare)) %>%
  spread(Duration, Mean_fare) %>%
  rename("Distance / Duration" = Distance) %>%
  knitr::kable(digits=2, format="pandoc", 
               caption="Table 1: Mean total fare for each distance and duration group")
Table 1: Mean total fare for each distance and duration group
Distance / Duration [1.02,7.5] (7.5,14.2] (14.2,718]
[0,1.35] 6.46 8.63 13.67
(1.35,3] 8.51 10.58 14.09
(3,105] 13.01 15.38 24.34

4.2 Day of week and time

To get a sense of how different day and time affect the taxi fare (hypothesis 4 and 5), we may want to inspect the mean taxi fare across different days of week for both day and night trips. Note that day trips were defined as trips took place from 6:00 to 18:00 while night trips as trips from 18:00 to 6:00. From Figure 2, we can see that day trips and night trips have very different mean fare during the weekdays. This suggest that time of day may be a factor affecting the taxi fare depending on whether it is weekday or weekend. We can infer that night trips generally have lower taxi fare comparing to day trips during weekdays.

# plot day of week and time of day vs fare
subsetted %>%
  group_by(Pickup_day, Time) %>%
  summarise(Total_fare = mean(Total_fare)) %>%
  ggplot(aes(x=Pickup_day, y=Total_fare, color=Time, group=Time)) + 
  geom_point(aes(pch=Time), size=2) +
  geom_line(aes(linetype=Time), size=1) +
  theme_minimal() +
  labs(title = "Pickup day and time vs (Mean) total fare")
**Figure 2**: Taxi fares of day trips and night trips across different days of week.

Figure 2: Taxi fares of day trips and night trips across different days of week.

4.3 Effect of trip to/from airport

To understand the effect of airport rides on taxi fares (hypothesis 6), a plot of distribution of taxi fares for airport and non-airport trips is presented in Figure 3. As expected, taxi fares show very different distribution under both scenarios which could not be otherwise easily detected without grouping. For completeness, a summary of taxi fares is also provided in Table 2 which again shows that the distribution of fares differ between airport and non-airport trips.

# plot airport/non-airport vs fare
subsetted %>%
  ggplot(aes(x=Total_fare, fill=Airport_trip)) +
  geom_density(alpha=0.4) +
  xlim(c(0, 100)) +
  theme_minimal() +
  scale_fill_brewer(palette="Dark2") +
  labs(title = "Distribution of total fare for airport/non-airport trips")
**Figure 3**: Density plot showing distribution of airport and non-airport taxi fares. (Cutoff at x=100 for visualization purpose)

Figure 3: Density plot showing distribution of airport and non-airport taxi fares. (Cutoff at x=100 for visualization purpose)

# summary table
subsetted %>%
  group_by(Airport_trip) %>%
  do(data_frame_(summary(.$Total_fare))) %>%
  knitr::kable(digits=2, format="pandoc", 
               caption="Table 2: A summary table of distribution of fares for airport vs non-airport rides")
Table 2: A summary table of distribution of fares for airport vs non-airport rides
Airport_trip Min. 1st Qu. Median Mean 3rd Qu. Max.
FALSE 0.01 7.5 10.5 12.98 16.0 292.3
TRUE 0.31 18.8 31.5 33.35 43.3 224.8

4.4 Correlation

Understanding the correlation between the explanatory variable may also be useful prior to building statistical models explaining relation of taxi fare with attributes. In this regard, a correlation matrix plot is presented in Figure 4. From the plot, we can see that total fare is weakly or uncorrelated with the different weather condition measures (hypothesis 7). This potentially arises due to using a crude and coarse-grained weather dataset. On the other hand, trip distance, duration and great circle distance computed seem to have strong positive correlation with the response. While circle distance is strongly correlated with trip distance, we do expect that either one of them and trip duration to be strong predictors for taxi fare.

# compute correlatoin
corr <- subsetted %>%
  select(-Pickup_hour) %>%
  select_if(is.numeric) %>%
  cor()

# save space
rm(subsetted)


corrplot(corr, type="lower", method="color", 
         addCoef.col="black", order="hclust",
         tl.srt = 45, tl.col="black", 
         number.cex=1, diag=FALSE, 
         title="Correlation of numeric variables",
         mar=c(0, 0, 1, 0))
**Figure 4**: Correlation plot of numeric explanatory variables. Notice that number of passengers seems irrelevant to any other attributes and could simply be removed in downstream analysis or ignored.

Figure 4: Correlation plot of numeric explanatory variables. Notice that number of passengers seems irrelevant to any other attributes and could simply be removed in downstream analysis or ignored.


5 Fit model

We will now build statistical models, each with different underlying assumptions about the data to better explain the relation between taxi fare and other attributes. Due to time and physical limitations of RAM and disk storage, we will only build models and train on 3 millions of the data – randomly sampled from the entire processed dataset. To do this, as mentioned previously we will use R’s interface to h2o which handles large scale machine learning algorithm by leveraging computing power of distributed systems and ultilizing parallelized algorithms similar to in-memory mapreduce operations.

Traning and validation sets each containing 3 millions of records are first created.

if (dev) {
  set.seed(100)
  n <- 3e6
  train_idx <- sample(nrow(trips_data), n)
  
  # split to training and validation set
  train <- trips_data[train_idx,] %>%
    mutate(Pickup_month = factor(Pickup_month, ordered=FALSE),
           Pickup_day = factor(Pickup_hour, ordered=FALSE))
  
  test <- trips_data[-train_idx,] %>%
    sample_n(n) %>%
    mutate(Pickup_month = factor(Pickup_month, ordered=FALSE),
           Pickup_day = factor(Pickup_hour, ordered=FALSE))
}

In the context of fitting statistical models, we will build 3 models with Total_fare as the response and the rest attributes as predictors:

  • Multiple Regression (GLM): This is essentially linear regression with Gaussian family model with the identity link function. It models the dependency between the response and predictors as a linear function. It has the advantages of being fast and easy to interprete over other machine learning models.
  • Distributed Random Forest (DRF): DRF generates a forest of regression trees, each of which is a weak learner built on a subset of observations and attributes. Regression with average prediction over all of the trees to make a final prediction usually results in a lower variance of the model.
  • Gradient Boosting Machine (GBM): A forward learning ensemble method which builds regression trees on all the features in the dataset in a fully distributed way – each tree is built in parallel. Briefly, weak regression algorithms are sequentially applied to the incrementally changed data to create a series of decision trees, producing an ensemble of weak prediction models. In general, boosting is a flexible nonlinear regression procedure that helps improve the accuracy of trees9.

It should be noted that categorical encoding is carried out while fitting the GLM. The values of all numeric predictors are also standarized to have zero mean and unit variance for ease of comparison downstream.

capture.output(localH2O <- h2o.init(nthreads = -1), file="/dev/null")

if (dev) {
  # save space
  rm(trips_data)
  
  # initialise H2O
  train.h2o <- as.h2o(train)
  rm(train)
  test.h2o <- as.h2o(test)
  rm(test)
  
  response <- "Total_fare"
  predictors <- setdiff(names(train.h2o), response)
  
  
  # train GLM
  regression.model <- h2o.glm(y=response, x=predictors, training_frame=train.h2o, family="gaussian",
                               validation_frame=test.h2o, standardize=TRUE, 
                               lambda_search=FALSE, alpha=0, lambda=0)
  
  glm_path <- h2o.saveModel(regression.model, path="models/", force=TRUE)
  
  print(paste("GLM saved in", glm_path))
  
  
  
  # train DRF
  rforest.model <- h2o.randomForest(y=response, x=predictors, training_frame=train.h2o, 
                                    validation_frame=test.h2o,
                                    ntrees=1000, mtries=3, max_depth=4, seed=100)
  
  
  rf_path <- h2o.saveModel(rforest.model, path="models/", force=TRUE)
  
  print(paste("RForest saved in", rf_path))
  
  
  
  # train GBM
  gbm.model <- h2o.gbm(y=response, x=predictors, training_frame=train.h2o, 
                       validation_frame=test.h2o,
                       ntrees=1000, max_depth=4, learn_rate=0.01, seed=100)
  
  
  gbm_path <- h2o.saveModel(gbm.model, path="models/", force=TRUE)
  
  print(paste("GBM saved in", gbm_path))
  
} else {
  # load saved model
  regression.model <- h2o.loadModel("models/GLM_model_R_1567913957322_10")
  rforest.model <- h2o.loadModel("models/DRF_model_R_1567913957322_2")
  gbm.model <- h2o.loadModel("models/GBM_model_R_1567913957322_3")
}

6 Results

It is important to interpret the models and understand how they have performed without treating them as black boxes. In this section we will examine how each of them performed on training and validation sets.

6.1 Comparing models

To understand how the models performed, barcharts showing performance metrics of each model on training and validation sets are presented in Figure 5. From the figure, surprisingly DRF performed poorly comparing to other two models. This could arise from the fact that most predictors are not useful in predicting the taxi fare – thus ensemble of weak learners actually have high bias.

Nevertheless, we managed to see GLM and GBM performed well in terms of root mean squared error (RMSE) and \(R^2\). In particular, despite of longer training time, GBM achieved an astonishing 2.21 for RMSE and 0.941 for \(R^2\) – which in turn translates to on average $2.21 difference between actual fare and predicted fare amount and 94.1% of the actual taxi fare’s variation explained by the model. Table 3 shows summary of performance metrics of each models in tabular form.

extract_metrics <- function(...) {
  lapply(..., function(model) {
    data.frame(
      list(rmse=h2o.rmse(model, train=TRUE, valid=TRUE),
           r_squared=h2o.r2(model, train=TRUE, valid=TRUE), 
           mean_residual_deviance=h2o.mean_residual_deviance(model, train=TRUE, valid=TRUE),
           training_time=c("train"=model@model$run_time / 1000, "valid"=NA))) %>%
      rownames_to_column("Dataset") %>%
      gather(Measure, Value, -Dataset) %>%
      mutate(Model = model@algorithm) %>%
      drop_na()
  }) %>%
    bind_rows()
}


# plot performance comparison of models
extract_metrics(c(regression.model, rforest.model, gbm.model)) %>%
  mutate(Model = factor(Model, levels=c("glm", "drf", "gbm"))) %>%
  ggplot(aes(x=Model, y=Value, fill=Dataset)) +
  geom_bar(stat="identity", position=position_dodge(), color="black") +
  facet_wrap(Measure ~ ., scales="free_y") +
  scale_fill_brewer(palette="Set2") +
  labs(title="Performance of models") +
  theme_bw()
**Figure 5**: Performance metrics of all 3 models on training and validation sets.

Figure 5: Performance metrics of all 3 models on training and validation sets.

performance_table <- function(...) {
  extract_metrics(...) %>%
    filter(Measure == "rmse" | Measure == "r_squared" | Measure == "training_time") %>%
    unite("Dataset_Measure", c("Dataset", "Measure")) %>%
    spread(Dataset_Measure, Value) %>%
    select(Model, Train_RMSE=train_rmse, Validation_RMSE=valid_rmse,
           Train_R2=train_r_squared, Validation_R2=valid_r_squared,
           Training_time=train_training_time)
}

# table summary of performance metrics
performance_table(c(regression.model, rforest.model, gbm.model)) %>%
  mutate(Model = toupper(Model),
         Model = factor(Model, levels=c("GLM", "DRF", "GBM"))) %>%
  arrange(Model) %>%
  knitr::kable(digits=3, format="pandoc", 
               caption="Table 3: Performance metrics of all 3 models in tabular form.")
Table 3: Performance metrics of all 3 models in tabular form.
Model Train_RMSE Validation_RMSE Train_R2 Validation_R2 Training_time
GLM 2.887 2.852 0.899 0.901 73.137
DRF 4.054 4.030 0.801 0.803 508.872
GBM 2.086 2.209 0.947 0.941 1051.443

6.2 Variable importances

In an attempt to understand how predictors contributed in the predicting the taxi fare, we will use different measures for each model in order to rank the importance of each predictors.

In the case of GLM, we will compare the standardized coefficients which represent the mean change in the response given a one standard deviation change in each predictor. In simpler terms, it allows us to compare the relative effects of predictors measured on different scales. Generally, a predictor with high magnitude of standardized coefficient indicates that it is important in predicting the response. From Figure 6, we see top 20 predictors with the highest standardized coefficients. As expected, trip distance contributes a lot in predicting the taxi fare, followed by other location and airport attributes (hypothesis 1, 3 and 6).

## plot bar chart representing the standardized coefficient magnitude
list("std_coef"=h2o.coef_norm(regression.model)) %>%
  data.frame() %>%
  rownames_to_column("variable") %>%
  mutate(abs = abs(std_coef),
         sign = factor(ifelse(std_coef > 0, "Positive", "Negative"))) %>%
  arrange(-abs) %>%
  head(20) %>%
  ggplot(aes(x=reorder(variable, abs), y=abs, fill=sign)) +
  geom_bar(stat="identity", color="black") +
  coord_flip() +
  labs(title="Standardized coefficient magnitudes of \nGLM",
       x="Magnitude",
       y="Variable") +
  theme_minimal()
**Figure 6**: Barplot showing standardized coefficient magnitudes of top 20 predictors.

Figure 6: Barplot showing standardized coefficient magnitudes of top 20 predictors.

On the other hand, for the tree-based models, we can determine the variable importance by calculating the relative influence of each variable, that is, whether that variable was selected to split on during the tree building process, and how much the RMSE over all trees improved as a result. Specifically, Figure 7 shows the top 10 predictors with highest variable importance in estimating taxi fare for the GBM. Again it suggests that trip distance could potentially be the most important feature in determining the taxi fare, followed by trip duration. Interestingly, other predictors seemed to contribute little in predicting taxi fare.

# plot graph of the variable importances
h2o.varimp(gbm.model) %>%
  as_tibble() %>%
  select(variable, percentage) %>%
  arrange(-percentage) %>%
  head(10) %>%
  ggplot(aes(x=reorder(variable, percentage), y=percentage, fill=percentage)) +
  geom_bar(stat="identity", color="black") +
  coord_flip() +
  labs(title="Percentage of feature importances of GBM",
       x="Percentage",
       y="Variable") +
  theme_minimal()
**Figure 7**: Variable importance of each predictor in GBM.

Figure 7: Variable importance of each predictor in GBM.


7 Refine model

We will now improve GLM and GBM by performing the following steps:

  • Remove insignificant predictors such as weather condition and passenger count.
  • Fit GLM with elastic net regularization10 and search for best lambda (regularization parameter).
  • Fit GBM with more and deeper trees.

Briefly, penalties from Ridge and LASSO regression introduced to the model building process can avoid overfitting, reduce variance of the prediction error (reduce variance), handle correlated predictors. Elastic net combines both penalties controlled by alpha and lambda hyperparameters. Similary, increasing the number of trees in GBM could potentially lower the variance of prediction erorr (reduce variance) while having deeper tree helps to convey more precise information (reduce bias).

if (dev) {
  # fit elastic net
  elastic_net.model <- h2o.glm(y=response, x=predictors, training_frame=train.h2o, family="gaussian",
                               validation_frame=test.h2o, standardize=TRUE, 
                               lambda_search=TRUE)
  
  elastic_net_path <- h2o.saveModel(elastic_net.model, path="models/", force=TRUE)
  
  print(paste("Elastic net regression saved in", elastic_net_path))

  
  # fit deeper GBM
  deepgbm.model <- h2o.gbm(y=response, x=predictors, training_frame=train.h2o, 
                           validation_frame=test.h2o,
                           ntrees=2000, max_depth=6, learn_rate=0.05, seed=100)
  
  deepgbm_path <- h2o.saveModel(deepgbm.model, path="models/", force=TRUE)
  
  print(paste("Deep GBM saved in", deepgbm_path))
} else {
  # load saved models
  elastic_net.model <- h2o.loadModel("models/GLM_model_R_1567913957322_9")
  deepgbm.model <- h2o.loadModel("models/GBM_model_R_1567913957322_11")
}

7.1 Final model

A summary of performance of improved models is provided as shown in Table 4. In both cases, we observe decent improvements on both validation RMSE and \(R^2\). That is, the improved models now explain more proportion of variance in taxi fare and yield a lower prediction error on validation dataset.

We will use the refined GBM (Deep GBM) as our final model for predicting taxi fare in the New York City. On average, according to the validation set randomly sampled from year 2014 and 2015, we expect to see the predicted fares deviate mostly within ± $2.17 from the actual fares reported by taximeter – which is quite accurate according to the leaderboard of a competition hosted by Google Cloud at kaggle11.

# summary table of refined models
performance_table(c(elastic_net.model, deepgbm.model)) %>%
  mutate(Model = ifelse(Model == "glm", "Elastic_net", "Deep_GBM"),
         Model = factor(Model, levels=c("Elastic_net", "Deep_GBM"))) %>%
  arrange(Model) %>%
  knitr::kable(digits=3, format="pandoc", 
               caption="Table 4: Performance summary of improved models.")
Table 4: Performance summary of improved models.
Model Train_RMSE Validation_RMSE Train_R2 Validation_R2 Training_time
Elastic_net 2.812 2.780 0.904 0.906 1.701
Deep_GBM 1.791 2.172 0.961 0.943 1007.299

7.2 Further considerations

In terms of future directions, we could evaluate the final model on yellow taxi data and see it performs. We could also do the same to data from year 2016 and onwards to see if there’s an annual effect from year to year. To improve the fare prediction, we could consider training on more data instead of just 3 millions records, or use neural networks as it can model highly complex and non-linear data with latent variables.

Finally, we could also deploy the final model as an interactive web application by turning the predictive model into a real-time API. However, it should be addressed that there is an issue of user not having information on the actual trip distance and duration prior to using the service, which happened to be most important variable in predicting taxi fares according to Figure 7. This also can be shown by fitting a nested model with same configurations as the final model on all attributes but trip distance and duration and evaluating it’s performance. From Table 5, we see significant degradation in both RMSE and \(R^2\). It can also be shown that, by likelihood ratio test, that the trip distance and duration are significant to the final model over nested model. A potential workaround for this is to ultilise OSRM12, a high performance routing engine to augment the data and compute shortest trip distance to be fed in the model during prediction. For now, I will leave the options open for future work.

if (dev) {
  # fit nested GBM without trip distance and duration
  nestedgbm.model <- h2o.gbm(y=response, x=setdiff(predictors, c("Trip_distance", "Trip_duration")),
                             training_frame=train.h2o, validation_frame=test.h2o,
                             ntrees=2000, max_depth=6, learn_rate=0.05, seed=100)
  
  nestedgbm_path <- h2o.saveModel(a.model, path="models/", force=TRUE)
  
  print(paste("A saved in", nestedgbm_path))
} else {
  # load nested model
  nestedgbm.model <- h2o.loadModel("models/GBM_model_R_1567913957322_12")
}

# summary table of nested GBM
performance_table(list(nestedgbm.model)) %>%
  mutate(Model = "Nested_GBM") %>%
  knitr::kable(digits=3, format="pandoc", 
               caption="Table 5: Performance summary of Deep GBM fitted on everything except trip distance and duration.")
Table 5: Performance summary of Deep GBM fitted on everything except trip distance and duration.
Model Train_RMSE Validation_RMSE Train_R2 Validation_R2 Training_time
Nested_GBM 3.618 3.774 0.841 0.827 849.044

8 Conclusion

We performed an analysis on taxi fares using the green taxi data that spans from year 2014 to 2015 entirely in R - in a reproducible manner. We reported the taxi fare and its relation with other attributes as an attempt to shed light and infer the hypotheses proposed. We then fitted statistical models on training set and evaluated them on validation set, each of which contains 3 million records each. Models were then compared, interpreted and a final model was chosen after refining the models. Finally, suggestions for further improving the performance were given.

# cleaning up
h2o.removeAll()
capture.output(h2o.shutdown(prompt=FALSE), file="/dev/null")
rm(list=ls())