New York City Taxi Trips Analysis

January 30, 2022 3 minute read

If you are a big fan of R, then this is for you.

Overview

Using R Markdown and knitr, I created 2 open-source notebooks - one for visualization and one for modelling.

Both notebooks were rendered to HTML documents and they contain not only the code to do ETL, plotting and modelling, but the actual report and comments on findings as well.

The reports and findings are completely reproducible - in that the results are accompanied by the data and code needed to produce them.

You can find the code for notebooks at this repo

Key libraries used

ggplot2 - for data visualization
tidyverse and data.table- for manipulating dataframes
H2O - for modelling purposes

Report 1: Visualization

Read the visualization report here.

The aim of this study is to gain an initial insight into the open source taxi and weather datasets for the year 2015 in the New York city. In the notebook, I will be dealing with millions of taxi trips data, performing initial exploratory data analysis on taxi usage and visualizing the relationships with other attributes.

In summary, I performed an initial analysis on the green taxi data that contains millions of trip records entire in R - in a reproducible manner. With ggplot and tmap, I depicted the relationship between attributes and geospatial visualization onto map of New York City taxi zones, and then attempted to answer a few questions for the study, using the figures and plots produced.

Report 2: Modelling

Read the modelling report here

The aim of this project is to make a qualitative analysis and gain insights into the open source New York City Taxi and Limousine Service Trip Record Data. In the notebook, I will tackle a research problem and perform analysis on millions of taxi trips data.

In short, I performed an analysis on taxi fares using the green taxi data that spans from year 2014 to 2015 entirely in R - in a reproducible manner. I reported the taxi fare and its relation with other attributes as an attempt to shed light and infer the hypotheses proposed. Then, statistical models were fitted on training set and evaluated using validation set, each of which contains 3 million records each. Models were then compared, interpreted and a final model was chosen after refining the models. Finally, suggestions for further improving the performance were given.

Share on

Twitter Facebook LinkedIn

Xuanken Tay

New York City Taxi Trips Analysis

Overview

Key libraries used

Report 1: Visualization

Report 2: Modelling

Share on

Leave a comment

Read more

Hashicorp Certified: Vault Associate 002

Helm: Migrate Releases from Helm 2 to Helm 3

Kubernetes: Deployments

Anagram Substrings