Posts tagged 'data science'

Processing NYC Taxi Data Part 3: Labeling and Aggregating

In this multi-part series, I will process five years of taxi trips from the NYC TLC dataset. This part describes how to cluster and aggregate the dataset using PySpark.

Processing NYC Taxi Data Part 2: Geofiltering

In this multi-part series, I will process five years of taxi trips from the NYC TLC dataset. This part describes how to Spark to select only trips that happen within Manhattan.

Processing NYC Taxi Data Part 1: Downloading

In this multi-part series, I will process five years of taxi trips from the NYC TLC dataset. This part describes how to move the data to S3.