The aim of this project is to make a qualitative analysis and gain insights into the open source New York City Taxi and Limousine Service Trip Record Data. In this notebook, we will tackle a research problem and perform analysis on millions of taxi trips data — all in a reproducible manner with R, in that the results are accompanied by the data and code needed to produce them.[1]
Source code of this Rmarkdown can be found in the GitHub repo if one wishes to reproduce the exact same results of this study. Likewise, an initial exploratory analysis of the data can be viewed here.
Despite predicting taxi fare is not as rewarding as predicting airline fares[2], we may want to start practicing from predicting the former given that now we have an extensive amount of open source data available.
In this study, we are interested in understanding how taxis arrive at fares to be paid by the passengers. To do so, we will first access and report the relation between the taxi fares (response) with other attributes (explanatory). Following that, we will build statistical models in order to predict taxi fares for rides in the New York City given the input variables. The final model could allow passenger to make use of this knowledge before calling for a taxi service.
To approach this research problem, we will first list down a set of hypotheses, which in our case are potential factors that could affect the cost of a taxi trip.
For the purpose of this study, we will use the following R packages:
library(tidyverse)
library(data.table)
library(rgdal)
library(lubridate)
library(geosphere)
library(corrplot)
library(h2o)
gdal
In any analytics project, 80% of the time and effort is spent on preparing, cleaning and organizing data for analysis4. This section describes briefly the data used and the steps taken to process and prepare them for the study.
We will use the following open souce dataset in the analysis.
The primary dataset we will be using in this study. The dataset covers trips taken in various different types of licensed taxi and limousine services in the New York City area from 2009 to 2019. Briefly, each taxi trip records pick-up and drop-off datetime and location, trip distance, itemized fare, payment type, and driver-reported passenger count.
New York City taxi zone locations information from TLC that includes a lookup table that contains TLC taxi zone location IDs, location names and corresponding boroughs for each ID. It also has a polygon shapefile containing the boundaries for the TLC taxi zones.
# read in taxi zones information
zones_lookup <- read_csv("data/taxi+_zone_lookup.csv", col_types="iccc")
taxi_zones <- readOGR(dsn = "data/taxi_zones/taxi_zones.shp", verbose=FALSE)
taxi_zones$id <- row.names(taxi_zones)
# merge zone information with taxi service zone in lookup table
taxi_zones@data <- taxi_zones@data %>%
left_join(zones_lookup %>%
select(-LocationID, -Borough) %>%
unique(),
by=c('zone' = 'Zone')) %>%
mutate(zone = as.factor(zone),
service_zone = as.factor(service_zone))
# reproject to commonly used CRS
taxi_zones <- spTransform(taxi_zones, CRS("+init=epsg:4326"))
Dataset primarily created in python to support the taxi fare analysis. It contains list of public holidays for New York at the state level for the year 2014 and 2015.
#############################################
## python code to generate ny_holidays.csv ##
# from datetime import date
# import pandas as pd
# import holidays
#
# ny_holidays = holidays.US(state='NY', years=[2014, 2015])
#
# df = pd.DataFrame(list(ny_holidays.items()), columns=["Date", "Holiday"])
#
# df.to_csv("ny_holidays.csv", index=False)
############ end of python code #############
holidays <- read.csv("data/ny_holidays.csv") %>%
mutate(Date = ymd(Date))
In order to study how weather conditions affect taxi fare (hypothesis 7) in New York City, we will use the New York City Central Park weather data that comes from National Climatic Data Center which contains daily weather observation at the central park for year 2014 and 2015.
One limitation of using this dataset is that we will need to assume that entire New York City will share the exact same weather condition with central park. Moreover, aggreagted daily data may be crude as weather could fluctuate from hour to hour.
# read in weather data
weather <- read.csv("data/ny_city_central_park_weather.csv") %>%
mutate(DATE = ymd(DATE),
TAVG = (TMIN + TMAX) / 2,
AWND = ifelse(is.na(AWND), mean(AWND, na.rm=TRUE), AWND),
AWND = round(AWND, digits=2)) %>%
select(-NAME, -TMAX, -TMIN)
We will look at green taxi trips for this study. Green taxis (as opposed to yellow ones) are taxis that are not allowed to pick up passengers inside of the densely populated areas of Manhattan. They do not seem to attract much attention of the research community in comparison with the relatively larger yellow taxi but may also contains interesting human behaviour-related information.
The entire green taxi dataset is large, covering from year 2013 to 2019. For the purpose of this study, we will look at only the data from year 2014 to 2015 as it contains precise pickup and dropoff location information in latitude and longitude - which may be helpful to in deriving the neighbourhoods and identifying airport trips (as needed for hypothesis 3 and 6). Furthermore, spanning the scope of the analysis across whole year allows us take in account of the effect of seasonality.
With all the data as above, we will look at a number of attributes for the study. Specifically, the following attributes will be used to aid in our analysis:
[Total Fare
, Pickup and Dropoff Location
, Pickup and Dropoff Day and Time
, Trip Distance and Duration
, Weather Condition
and Holiday
]
The selection of these attributes are well justified by hypothesis 1 to 7, in that we need them to in order to try to infer the result of hypotheses. For the purpose of this study, we will regard Total Fare
as the amount required to be paid by passenger excluding the tips.
# function to compute taxi fare
compute_total_fare <- function(taxi_data) {
taxi_data %>%
mutate(Total_fare = Total_amount - Tip_amount)
}
# function to compute trip duration
compute_trip_duration <- function(taxi_data) {
taxi_data %>%
rename(Pickup_datetime = lpep_pickup_datetime,
Dropoff_datetime = Lpep_dropoff_datetime) %>%
mutate(Pickup_datetime = ymd_hms(Pickup_datetime),
Dropoff_datetime = ymd_hms(Dropoff_datetime),
Trip_duration = difftime(Dropoff_datetime, Pickup_datetime, units="mins"),
Trip_duration = as.numeric(Trip_duration) %>% round(digits=2))
}
It is important to clean the data as removing incorrect information can improve the data quality and in doing so, increases overal productivity. The trips data has been checked and none of rows - with chosen attributes contains missing value. Next the data was checked to remove any abnormal or wrong records. This step is important as it ensures the data is correct, consistent and usable by identifying any errors or inaccurate information in the data. Briefly, bad records or outliers have been filtered out according to the follwing criteria:
clean_taxi_data <- function(taxi_data) {
taxi_data %>%
# remove rows with missing value(s)
drop_na() %>%
# data cleansing
filter(Passenger_count > 0, Passenger_count <= 6,
Pickup_latitude > 39, Pickup_latitude < 42,
Pickup_longitude > -76, Pickup_longitude < -72,
Dropoff_latitude > 39, Dropoff_latitude < 42,
Dropoff_longitude > -76, Dropoff_longitude < -72,
Total_fare > 0, Total_fare < 300,
Trip_duration > 1, Trip_duration < (12 * 60))
}
Data preprocessing has been carried out in order to prepare the data for our analysis. Very briefly, the steps involved in transforming the data into a format useful for our study are:
spatial_overlap_zones <- function(taxi_data, zones=taxi_zones) {
# find intersect between trip locations and taxi zones
pickup_zones <- SpatialPointsDataFrame(coords=taxi_data[, c('Pickup_longitude', 'Pickup_latitude')],
data=taxi_data,
proj4string=CRS(proj4string(zones))) %>%
over(zones) %>%
pull(zone)
dropoff_zones <- SpatialPointsDataFrame(coords=taxi_data[, c('Dropoff_longitude', 'Dropoff_latitude')],
data=taxi_data,
proj4string=CRS(proj4string(zones))) %>%
over(zones) %>%
pull(zone)
taxi_data %>%
mutate(Pickup_zone = factor(pickup_zones, levels=levels(zones$zone)),
Dropoff_zone = factor(dropoff_zones, levels=levels(zones$zone))) %>%
# some trips may not fall in known taxi zones
drop_na()
}
RateCodeID
of 2 (JFK) or 3 (EWR)# find airport zones
airport_zones <- taxi_zones@data %>%
filter(service_zone %in% c("Airports", "EWR")) %>%
pull(zone) %>%
as.vector()
identify_airport_trips <- function(taxi_data, zones=airport_zones) {
taxi_data %>%
mutate(Airport_trip = Pickup_zone %in% zones |
Pickup_zone %in% zones |
Dropoff_zone %in% zones |
Dropoff_zone %in% zones |
RateCodeID %in% c(2, 3))
}
# helper function to return US season
season <- function(datetime) {
m <- month(datetime)
terms <- c("Spring", "Summer", "Autumn", "Winter")
s <- factor(character(length(m)), levels=terms)
s[m %in% c(3, 4, 5)] <- terms[1]
s[m %in% c(6, 7, 8)] <- terms[2]
s[m %in% c(9, 10, 11)] <- terms[3]
s[m %in% c(12, 1, 2)] <- terms[4]
s
}
# function to extract day time-related features
extract_time_features <- function(taxi_data) {
taxi_data %>%
mutate(Pickup_month = month(Pickup_datetime, label=TRUE),
Pickup_day = wday(Pickup_datetime, label=TRUE),
Pickup_hour = hour(Pickup_datetime),
Season = season(Pickup_datetime),
Weekend = Pickup_day %in% c("Sat", "Sun"),
Time = ifelse(Pickup_hour >= 6 & Pickup_hour < 18,
"Daytime",
"Nighttime"),
Time = factor(Time, levels=c("Daytime", "Nighttime")))
}
# function to compute orthodromic distance
great_circle_distance <- function(taxi_data) {
taxi_data %>%
mutate(Great_circle_dist = distHaversine(
cbind(Dropoff_longitude, Dropoff_latitude),
cbind(Pickup_longitude, Pickup_latitude)) %>%
round(digits=2))
}
join_external_data <- function(taxi_data, holiday_data=holidays, weather_data=weather) {
taxi_data %>%
# determine if the day is holiday
mutate(Pickup_date = as.Date(Pickup_datetime),
Holiday = Pickup_date %in% holiday_data$Date) %>%
# join with weather data
left_join(weather_data, by=c("Pickup_date" = "DATE")) %>%
select(-Pickup_date)
}
# function to prepress data
preprocess_taxi_data <- function(taxi_data) {
taxi_data %>%
spatial_overlap_zones() %>%
identify_airport_trips() %>%
extract_time_features() %>%
great_circle_distance() %>%
join_external_data()
}
With all the functions needed for cleaning and pre-processing the data, we can now read all green taxi trips from 2014 to 2015 from local directory, pass through a pipeline of data preparation steps and result in a final dataframe to be used in statistical modelling.
read_cols <- c("lpep_pickup_datetime", "Lpep_dropoff_datetime", "RateCodeID",
"Pickup_longitude", "Pickup_latitude",
"Dropoff_longitude", "Dropoff_latitude",
"Passenger_count", "Trip_distance", "Tip_amount", "Total_amount")
out_cols <- c("Season", "Pickup_month", "Pickup_day", "Weekend", "Pickup_hour",
"Time", "Pickup_zone", "Dropoff_zone", "Passenger_count",
"Airport_trip", "Holiday", "AWND", "PRCP", "SNOW", "SNWD", "TAVG",
"Trip_distance", "Trip_duration", "Great_circle_dist", "Total_fare")
read_one_taxi_data <- function(filename, read_columns=read_cols, out_columns=out_cols) {
print(paste("Reading", filename))
fread(filename, select=read_columns, fill=TRUE, showProgress=FALSE) %>%
as_tibble() %>%
compute_total_fare() %>%
compute_trip_duration() %>%
clean_taxi_data() %>%
preprocess_taxi_data() %>%
select(out_columns)
}
read_all_taxi_data <- function(path, pattern) {
lapply(list.files(path, pattern, full.names=TRUE),
function(x) read_one_taxi_data(x, read_cols, out_cols)) %>%
rbindlist()
}
# set to TRUE if want to rerun everything
dev <- FALSE
if (dev) {
trips_data <- read_all_taxi_data("data", "green_tripdata")
saveRDS(trips_data, file="green_trips_data_2014-15.rds")
}
The processed data has 34,095,643 rows and 20 columns and cost around 3.9~4 in RAM in local machine. In the fitting model section, we will turn to h2o package for its fast and scalable machine learning application.
Before jumping into fitting statistical models, we may want to first analyse and report taxi fares (response) via descriptive statistics for a group of explanatory variables. In our case, we will do this on a random subsample of prepared dataset which contains 1 million records. A reasonable assumption here is that the records in this subset may very well be a fair representative sample of the entire dataset.
if (dev) {
# randomly subset
set.seed(100)
n <- 1e6
subsetted <- sample_n(trips_data, n)
saveRDS(subsetted, file="1m_green_trips_data_2014-15.rds")
} else {
# read previously saved dataframe
subsetted <- readRDS("1m_green_trips_data_2014-15.rds")
}
In an attempt to infer result from hypothesis 1, we can see the effect of trip distance and duration on taxi fare by stratifying the observations into groups of different distance and duration. In Figure 1, we can see trips with long duration are generally associated with higher fare, whereas trip distance seems to have a smaller effect on the fare. Note that distance and duration were discretised into bins such that there are equal number of observations in each bin. A summary of mean total fare for each group is also provided below, as shown in Table 1.
# discretise trip distance and duration
n_bins <- 3
subsetted <- subsetted %>%
mutate(Distance = cut_number(Trip_distance, n_bins),
Duration = cut_number(Trip_duration, n_bins))
# plot vs taxi fare
subsetted %>%
ggplot(aes(x=Distance, y=Total_fare, fill=Duration)) +
geom_boxplot() +
ylim(c(0, 50)) +
theme_minimal() +
labs(title = "Trip distance and duration vs Total fare")
# table of summary
subsetted %>%
group_by(Distance, Duration) %>%
summarise(Mean_fare = mean(Total_fare)) %>%
spread(Duration, Mean_fare) %>%
rename("Distance / Duration" = Distance) %>%
knitr::kable(digits=2, format="pandoc",
caption="Table 1: Mean total fare for each distance and duration group")
Distance / Duration | [1.02,7.5] | (7.5,14.2] | (14.2,718] |
---|---|---|---|
[0,1.35] | 6.46 | 8.63 | 13.67 |
(1.35,3] | 8.51 | 10.58 | 14.09 |
(3,105] | 13.01 | 15.38 | 24.34 |
To get a sense of how different day and time affect the taxi fare (hypothesis 4 and 5), we may want to inspect the mean taxi fare across different days of week for both day and night trips. Note that day trips were defined as trips took place from 6:00 to 18:00 while night trips as trips from 18:00 to 6:00. From Figure 2, we can see that day trips and night trips have very different mean fare during the weekdays. This suggest that time of day may be a factor affecting the taxi fare depending on whether it is weekday or weekend. We can infer that night trips generally have lower taxi fare comparing to day trips during weekdays.
# plot day of week and time of day vs fare
subsetted %>%
group_by(Pickup_day, Time) %>%
summarise(Total_fare = mean(Total_fare)) %>%
ggplot(aes(x=Pickup_day, y=Total_fare, color=Time, group=Time)) +
geom_point(aes(pch=Time), size=2) +
geom_line(aes(linetype=Time), size=1) +
theme_minimal() +
labs(title = "Pickup day and time vs (Mean) total fare")
To understand the effect of airport rides on taxi fares (hypothesis 6), a plot of distribution of taxi fares for airport and non-airport trips is presented in Figure 3. As expected, taxi fares show very different distribution under both scenarios which could not be otherwise easily detected without grouping. For completeness, a summary of taxi fares is also provided in Table 2 which again shows that the distribution of fares differ between airport and non-airport trips.
# plot airport/non-airport vs fare
subsetted %>%
ggplot(aes(x=Total_fare, fill=Airport_trip)) +
geom_density(alpha=0.4) +
xlim(c(0, 100)) +
theme_minimal() +
scale_fill_brewer(palette="Dark2") +
labs(title = "Distribution of total fare for airport/non-airport trips")
# summary table
subsetted %>%
group_by(Airport_trip) %>%
do(data_frame_(summary(.$Total_fare))) %>%
knitr::kable(digits=2, format="pandoc",
caption="Table 2: A summary table of distribution of fares for airport vs non-airport rides")
Airport_trip | Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
---|---|---|---|---|---|---|
FALSE | 0.01 | 7.5 | 10.5 | 12.98 | 16.0 | 292.3 |
TRUE | 0.31 | 18.8 | 31.5 | 33.35 | 43.3 | 224.8 |
Understanding the correlation between the explanatory variable may also be useful prior to building statistical models explaining relation of taxi fare with attributes. In this regard, a correlation matrix plot is presented in Figure 4. From the plot, we can see that total fare is weakly or uncorrelated with the different weather condition measures (hypothesis 7). This potentially arises due to using a crude and coarse-grained weather dataset. On the other hand, trip distance, duration and great circle distance computed seem to have strong positive correlation with the response. While circle distance is strongly correlated with trip distance, we do expect that either one of them and trip duration to be strong predictors for taxi fare.
# compute correlatoin
corr <- subsetted %>%
select(-Pickup_hour) %>%
select_if(is.numeric) %>%
cor()
# save space
rm(subsetted)
corrplot(corr, type="lower", method="color",
addCoef.col="black", order="hclust",
tl.srt = 45, tl.col="black",
number.cex=1, diag=FALSE,
title="Correlation of numeric variables",
mar=c(0, 0, 1, 0))
We will now build statistical models, each with different underlying assumptions about the data to better explain the relation between taxi fare and other attributes. Due to time and physical limitations of RAM and disk storage, we will only build models and train on 3 millions of the data – randomly sampled from the entire processed dataset. To do this, as mentioned previously we will use R’s interface to h2o
which handles large scale machine learning algorithm by leveraging computing power of distributed systems and ultilizing parallelized algorithms similar to in-memory mapreduce operations.
Traning and validation sets each containing 3 millions of records are first created.
if (dev) {
set.seed(100)
n <- 3e6
train_idx <- sample(nrow(trips_data), n)
# split to training and validation set
train <- trips_data[train_idx,] %>%
mutate(Pickup_month = factor(Pickup_month, ordered=FALSE),
Pickup_day = factor(Pickup_hour, ordered=FALSE))
test <- trips_data[-train_idx,] %>%
sample_n(n) %>%
mutate(Pickup_month = factor(Pickup_month, ordered=FALSE),
Pickup_day = factor(Pickup_hour, ordered=FALSE))
}
In the context of fitting statistical models, we will build 3 models with Total_fare
as the response and the rest attributes as predictors:
It should be noted that categorical encoding is carried out while fitting the GLM. The values of all numeric predictors are also standarized to have zero mean and unit variance for ease of comparison downstream.
capture.output(localH2O <- h2o.init(nthreads = -1), file="/dev/null")
if (dev) {
# save space
rm(trips_data)
# initialise H2O
train.h2o <- as.h2o(train)
rm(train)
test.h2o <- as.h2o(test)
rm(test)
response <- "Total_fare"
predictors <- setdiff(names(train.h2o), response)
# train GLM
regression.model <- h2o.glm(y=response, x=predictors, training_frame=train.h2o, family="gaussian",
validation_frame=test.h2o, standardize=TRUE,
lambda_search=FALSE, alpha=0, lambda=0)
glm_path <- h2o.saveModel(regression.model, path="models/", force=TRUE)
print(paste("GLM saved in", glm_path))
# train DRF
rforest.model <- h2o.randomForest(y=response, x=predictors, training_frame=train.h2o,
validation_frame=test.h2o,
ntrees=1000, mtries=3, max_depth=4, seed=100)
rf_path <- h2o.saveModel(rforest.model, path="models/", force=TRUE)
print(paste("RForest saved in", rf_path))
# train GBM
gbm.model <- h2o.gbm(y=response, x=predictors, training_frame=train.h2o,
validation_frame=test.h2o,
ntrees=1000, max_depth=4, learn_rate=0.01, seed=100)
gbm_path <- h2o.saveModel(gbm.model, path="models/", force=TRUE)
print(paste("GBM saved in", gbm_path))
} else {
# load saved model
regression.model <- h2o.loadModel("models/GLM_model_R_1567913957322_10")
rforest.model <- h2o.loadModel("models/DRF_model_R_1567913957322_2")
gbm.model <- h2o.loadModel("models/GBM_model_R_1567913957322_3")
}
It is important to interpret the models and understand how they have performed without treating them as black boxes. In this section we will examine how each of them performed on training and validation sets.
To understand how the models performed, barcharts showing performance metrics of each model on training and validation sets are presented in Figure 5. From the figure, surprisingly DRF performed poorly comparing to other two models. This could arise from the fact that most predictors are not useful in predicting the taxi fare – thus ensemble of weak learners actually have high bias.
Nevertheless, we managed to see GLM and GBM performed well in terms of root mean squared error (RMSE) and \(R^2\). In particular, despite of longer training time, GBM achieved an astonishing 2.21 for RMSE and 0.941 for \(R^2\) – which in turn translates to on average $2.21 difference between actual fare and predicted fare amount and 94.1% of the actual taxi fare’s variation explained by the model. Table 3 shows summary of performance metrics of each models in tabular form.
extract_metrics <- function(...) {
lapply(..., function(model) {
data.frame(
list(rmse=h2o.rmse(model, train=TRUE, valid=TRUE),
r_squared=h2o.r2(model, train=TRUE, valid=TRUE),
mean_residual_deviance=h2o.mean_residual_deviance(model, train=TRUE, valid=TRUE),
training_time=c("train"=model@model$run_time / 1000, "valid"=NA))) %>%
rownames_to_column("Dataset") %>%
gather(Measure, Value, -Dataset) %>%
mutate(Model = model@algorithm) %>%
drop_na()
}) %>%
bind_rows()
}
# plot performance comparison of models
extract_metrics(c(regression.model, rforest.model, gbm.model)) %>%
mutate(Model = factor(Model, levels=c("glm", "drf", "gbm"))) %>%
ggplot(aes(x=Model, y=Value, fill=Dataset)) +
geom_bar(stat="identity", position=position_dodge(), color="black") +
facet_wrap(Measure ~ ., scales="free_y") +
scale_fill_brewer(palette="Set2") +
labs(title="Performance of models") +
theme_bw()
performance_table <- function(...) {
extract_metrics(...) %>%
filter(Measure == "rmse" | Measure == "r_squared" | Measure == "training_time") %>%
unite("Dataset_Measure", c("Dataset", "Measure")) %>%
spread(Dataset_Measure, Value) %>%
select(Model, Train_RMSE=train_rmse, Validation_RMSE=valid_rmse,
Train_R2=train_r_squared, Validation_R2=valid_r_squared,
Training_time=train_training_time)
}
# table summary of performance metrics
performance_table(c(regression.model, rforest.model, gbm.model)) %>%
mutate(Model = toupper(Model),
Model = factor(Model, levels=c("GLM", "DRF", "GBM"))) %>%
arrange(Model) %>%
knitr::kable(digits=3, format="pandoc",
caption="Table 3: Performance metrics of all 3 models in tabular form.")
Model | Train_RMSE | Validation_RMSE | Train_R2 | Validation_R2 | Training_time |
---|---|---|---|---|---|
GLM | 2.887 | 2.852 | 0.899 | 0.901 | 73.137 |
DRF | 4.054 | 4.030 | 0.801 | 0.803 | 508.872 |
GBM | 2.086 | 2.209 | 0.947 | 0.941 | 1051.443 |
In an attempt to understand how predictors contributed in the predicting the taxi fare, we will use different measures for each model in order to rank the importance of each predictors.
In the case of GLM, we will compare the standardized coefficients which represent the mean change in the response given a one standard deviation change in each predictor. In simpler terms, it allows us to compare the relative effects of predictors measured on different scales. Generally, a predictor with high magnitude of standardized coefficient indicates that it is important in predicting the response. From Figure 6, we see top 20 predictors with the highest standardized coefficients. As expected, trip distance contributes a lot in predicting the taxi fare, followed by other location and airport attributes (hypothesis 1, 3 and 6).
## plot bar chart representing the standardized coefficient magnitude
list("std_coef"=h2o.coef_norm(regression.model)) %>%
data.frame() %>%
rownames_to_column("variable") %>%
mutate(abs = abs(std_coef),
sign = factor(ifelse(std_coef > 0, "Positive", "Negative"))) %>%
arrange(-abs) %>%
head(20) %>%
ggplot(aes(x=reorder(variable, abs), y=abs, fill=sign)) +
geom_bar(stat="identity", color="black") +
coord_flip() +
labs(title="Standardized coefficient magnitudes of \nGLM",
x="Magnitude",
y="Variable") +
theme_minimal()
On the other hand, for the tree-based models, we can determine the variable importance by calculating the relative influence of each variable, that is, whether that variable was selected to split on during the tree building process, and how much the RMSE over all trees improved as a result. Specifically, Figure 7 shows the top 10 predictors with highest variable importance in estimating taxi fare for the GBM. Again it suggests that trip distance could potentially be the most important feature in determining the taxi fare, followed by trip duration. Interestingly, other predictors seemed to contribute little in predicting taxi fare.
# plot graph of the variable importances
h2o.varimp(gbm.model) %>%
as_tibble() %>%
select(variable, percentage) %>%
arrange(-percentage) %>%
head(10) %>%
ggplot(aes(x=reorder(variable, percentage), y=percentage, fill=percentage)) +
geom_bar(stat="identity", color="black") +
coord_flip() +
labs(title="Percentage of feature importances of GBM",
x="Percentage",
y="Variable") +
theme_minimal()
We will now improve GLM and GBM by performing the following steps:
Briefly, penalties from Ridge and LASSO regression introduced to the model building process can avoid overfitting, reduce variance of the prediction error (reduce variance), handle correlated predictors. Elastic net combines both penalties controlled by alpha
and lambda
hyperparameters. Similary, increasing the number of trees in GBM could potentially lower the variance of prediction erorr (reduce variance) while having deeper tree helps to convey more precise information (reduce bias).
if (dev) {
# fit elastic net
elastic_net.model <- h2o.glm(y=response, x=predictors, training_frame=train.h2o, family="gaussian",
validation_frame=test.h2o, standardize=TRUE,
lambda_search=TRUE)
elastic_net_path <- h2o.saveModel(elastic_net.model, path="models/", force=TRUE)
print(paste("Elastic net regression saved in", elastic_net_path))
# fit deeper GBM
deepgbm.model <- h2o.gbm(y=response, x=predictors, training_frame=train.h2o,
validation_frame=test.h2o,
ntrees=2000, max_depth=6, learn_rate=0.05, seed=100)
deepgbm_path <- h2o.saveModel(deepgbm.model, path="models/", force=TRUE)
print(paste("Deep GBM saved in", deepgbm_path))
} else {
# load saved models
elastic_net.model <- h2o.loadModel("models/GLM_model_R_1567913957322_9")
deepgbm.model <- h2o.loadModel("models/GBM_model_R_1567913957322_11")
}
A summary of performance of improved models is provided as shown in Table 4. In both cases, we observe decent improvements on both validation RMSE and \(R^2\). That is, the improved models now explain more proportion of variance in taxi fare and yield a lower prediction error on validation dataset.
We will use the refined GBM (Deep GBM) as our final model for predicting taxi fare in the New York City. On average, according to the validation set randomly sampled from year 2014 and 2015, we expect to see the predicted fares deviate mostly within ± $2.17 from the actual fares reported by taximeter – which is quite accurate according to the leaderboard of a competition hosted by Google Cloud at kaggle11.
# summary table of refined models
performance_table(c(elastic_net.model, deepgbm.model)) %>%
mutate(Model = ifelse(Model == "glm", "Elastic_net", "Deep_GBM"),
Model = factor(Model, levels=c("Elastic_net", "Deep_GBM"))) %>%
arrange(Model) %>%
knitr::kable(digits=3, format="pandoc",
caption="Table 4: Performance summary of improved models.")
Model | Train_RMSE | Validation_RMSE | Train_R2 | Validation_R2 | Training_time |
---|---|---|---|---|---|
Elastic_net | 2.812 | 2.780 | 0.904 | 0.906 | 1.701 |
Deep_GBM | 1.791 | 2.172 | 0.961 | 0.943 | 1007.299 |
In terms of future directions, we could evaluate the final model on yellow taxi data and see it performs. We could also do the same to data from year 2016 and onwards to see if there’s an annual effect from year to year. To improve the fare prediction, we could consider training on more data instead of just 3 millions records, or use neural networks as it can model highly complex and non-linear data with latent variables.
Finally, we could also deploy the final model as an interactive web application by turning the predictive model into a real-time API. However, it should be addressed that there is an issue of user not having information on the actual trip distance and duration prior to using the service, which happened to be most important variable in predicting taxi fares according to Figure 7. This also can be shown by fitting a nested model with same configurations as the final model on all attributes but trip distance and duration and evaluating it’s performance. From Table 5, we see significant degradation in both RMSE and \(R^2\). It can also be shown that, by likelihood ratio test, that the trip distance and duration are significant to the final model over nested model. A potential workaround for this is to ultilise OSRM12, a high performance routing engine to augment the data and compute shortest trip distance to be fed in the model during prediction. For now, I will leave the options open for future work.
if (dev) {
# fit nested GBM without trip distance and duration
nestedgbm.model <- h2o.gbm(y=response, x=setdiff(predictors, c("Trip_distance", "Trip_duration")),
training_frame=train.h2o, validation_frame=test.h2o,
ntrees=2000, max_depth=6, learn_rate=0.05, seed=100)
nestedgbm_path <- h2o.saveModel(a.model, path="models/", force=TRUE)
print(paste("A saved in", nestedgbm_path))
} else {
# load nested model
nestedgbm.model <- h2o.loadModel("models/GBM_model_R_1567913957322_12")
}
# summary table of nested GBM
performance_table(list(nestedgbm.model)) %>%
mutate(Model = "Nested_GBM") %>%
knitr::kable(digits=3, format="pandoc",
caption="Table 5: Performance summary of Deep GBM fitted on everything except trip distance and duration.")
Model | Train_RMSE | Validation_RMSE | Train_R2 | Validation_R2 | Training_time |
---|---|---|---|---|---|
Nested_GBM | 3.618 | 3.774 | 0.841 | 0.827 | 849.044 |
We performed an analysis on taxi fares using the green taxi data that spans from year 2014 to 2015 entirely in R - in a reproducible manner. We reported the taxi fare and its relation with other attributes as an attempt to shed light and infer the hypotheses proposed. We then fitted statistical models on training set and evaluated them on validation set, each of which contains 3 million records each. Models were then compared, interpreted and a final model was chosen after refining the models. Finally, suggestions for further improving the performance were given.
# cleaning up
h2o.removeAll()
capture.output(h2o.shutdown(prompt=FALSE), file="/dev/null")
rm(list=ls())