Starting from:

$30

Project 3: Predicting Taxi Ride Duration


Project 3: Predicting Taxi Ride Duration

Collaboration Policy
Data science is a collaborative activity. While you may talk with others about the project, we ask that you write your
solutions individually. If you do discuss the assignments with others please include their names at the top of your
notebook.
Collaborators: list collaborators here
Score Breakdown
Question Points
1b 2
1c 3
1d 2
2a 1
2b 2
3a 2
3b 1
3c 2
3d 2
4a 2
4b 2
4c 2
4d 2
4e 2
4f 2
4g 4
5b 7
5c 3
Total 43
This Assignment
In this project, you will use what you've learned in class to create a regression model that predicts the travel time of a
taxi ride in New York. Some questions in this project are more substantial than those of past projects.
After this project, you should feel comfortable with the following:
# Initialize autograder
# If you see an error message, you'll need to do
# pip3 install otter-grader
import otter
grader = otter.Notebook()
3/4/2020 proj3 - Jupyter Notebook
localhost:8888/notebooks/proj3_otter/proj3.ipynb 2/30
The data science lifecycle: data selection and cleaning, EDA, feature engineering, and model selection.
Using sklearn to process data and fit linear regression models.
Embedding linear regression as a component in a more complex model.
First, let's import:
In [2]:
The Data
Attributes of all yellow taxi (https://en.wikipedia.org/wiki/Taxicabs_of_New_York_City) trips in January 2016 are
published by the NYC Taxi and Limosine Commission (https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page).
The full data set takes a long time to download directly, so we've placed a simple random sample of the data into
taxi.db , a SQLite database. You can view the code used to generate this sample in the taxi_sample.ipynb file
included with this project (not required).
Columns of the taxi table in taxi.db include:
pickup_datetime : date and time when the meter was engaged
dropoff_datetime : date and time when the meter was disengaged
pickup_lon : the longitude where the meter was engaged
pickup_lat : the latitude where the meter was engaged
dropoff_lon : the longitude where the meter was disengaged
dropoff_lat : the latitude where the meter was disengaged
passengers : the number of passengers in the vehicle (driver entered value)
distance : trip distance
duration : duration of the trip in seconds
Your goal will be to predict duration from the pick-up time, pick-up and drop-off locations, and distance.
Part 1: Data Selection and Cleaning
In this part, you will limit the data to trips that began and ended on Manhattan Island (map
(https://www.google.com/maps/place/Manhattan,+New+York,+NY/@40.7590402,-74.0394431,12z/data=!3m1!4b1!4m5!3m
73.9712488)).
The below cell uses a SQL query to load the taxi table from taxi.db into a Pandas DataFrame called all_taxi .
It only includes trips that have both pick-up and drop-off locations within the boundaries of New York City:
Longitude is between -74.03 and -73.75 (inclusive of both boundaries)
Latitude is between 40.6 and 40.88 (inclusive of both boundaries)
You don't have to change anything, just run this cell.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
3/4/2020 proj3 - Jupyter Notebook
localhost:8888/notebooks/proj3_otter/proj3.ipynb 3/30
In [3]:
A scatter plot of pickup locations shows that most of them are on the island of Manhattan. The empty white rectangle is
Central Park; cars are not allowed there.
Out[3]:
pickup_datetime dropoff_datetime pickup_lon pickup_lat dropoff_lon dropoff_lat passengers distance duration
0
2016-01-30
22:47:32
2016-01-30
23:03:53 -73.988251 40.743542 -74.015251 40.709808 1 3.99 981
1
2016-01-04
04:30:48
2016-01-04
04:36:08 -73.995888 40.760010 -73.975388 40.782200 1 2.03 320
2
2016-01-07
21:52:24
2016-01-07
21:57:23 -73.990440 40.730469 -73.985542 40.738510 1 0.70 299
3
2016-01-01
04:13:41
2016-01-01
04:19:24 -73.944725 40.714539 -73.955421 40.719173 1 0.80 343
4
2016-01-08
18:46:10
2016-01-08
18:54:00 -74.004494 40.706989 -74.010155 40.716751 5 0.97 470
import sqlite3
conn = sqlite3.connect('taxi.db')
lon_bounds = [-74.03, -73.75]
lat_bounds = [40.6, 40.88]
c = conn.cursor()
my_string = 'SELECT * FROM taxi WHERE'
for word in ['pickup_lat', 'AND dropoff_lat']:
 my_string += ' {} BETWEEN {} AND {}'.format(word, lat_bounds[0], lat_bounds[1])

for word in ['AND pickup_lon', 'AND dropoff_lon']:
 my_string += ' {} BETWEEN {} AND {}'.format(word, lon_bounds[0], lon_bounds[1])
c.execute(my_string)
results = c.fetchall()
row_res = conn.execute('select * from taxi')
names = list(map(lambda x: x[0], row_res.description))
all_taxi = pd.DataFrame(results)
all_taxi.columns = names
all_taxi.head()
3/4/2020 proj3 - Jupyter Notebook
localhost:8888/notebooks/proj3_otter/proj3.ipynb 4/30
In [4]:
The two small blobs outside of Manhattan with very high concentrations of taxi pick-ups are airports.
Question 1b
Create a DataFrame called clean_taxi that only includes trips with a positive passenger count, a positive distance, a
duration of at least 1 minute and at most 1 hour, and an average speed of at most 100 miles per hour. Inequalities
should not be strict (e.g., <= instead of < ) unless comparing to 0.
def pickup_scatter(t):
 plt.scatter(t['pickup_lon'], t['pickup_lat'], s=2, alpha=0.2)
 plt.xlabel('Longitude')
 plt.ylabel('Latitude')
 plt.title('Pickup locations')

plt.figure(figsize=(8, 8))
pickup_scatter(all_taxi)
3/4/2020 proj3 - Jupyter Notebook
localhost:8888/notebooks/proj3_otter/proj3.ipynb 5/30
The provided tests check that you have constructed clean_taxi correctly.
In [5]:
In [6]:
Question 1c (challenging)
Create a DataFrame called manhattan_taxi that only includes trips from clean_taxi that start and end within a
polygon that defines the boundaries of Manhattan Island
(https://www.google.com/maps/place/Manhattan,+New+York,+NY/@40.7590402,-74.0394431,12z/data=!3m1!4b1!4m5!3m
73.9712488).
The vertices of this polygon are defined in manhattan.csv as (latitude, longitude) pairs, which are published here
(https://gist.github.com/baygross/5430626).
An efficient way to test if a point is contained within a polygon is described on this page
(http://alienryderflex.com/polygon/). There are even implementations on that page (though not in Python). Even with an
efficient approach, the process of checking each point can take several minutes. It's best to test your work on a small
sample of clean_taxi before processing the whole thing. (To check if your code is working, draw a scatter diagram of
the (lon, lat) pairs of the result; the scatter diagram should have the shape of Manhattan.)
The provided tests check that you have constructed manhattan_taxi correctly. It's not required that you implement
the in_manhattan helper function, but that's recommended. If you cannot solve this problem, you can still continue
with the project; see the instructions below the answer cell.
Out[5]:
pickup_datetime dropoff_datetime pickup_lon pickup_lat dropoff_lon dropoff_lat passengers distance duratio
0
2016-01-30
22:47:32
2016-01-30
23:03:53 -73.988251 40.743542 -74.015251 40.709808 1 3.99 98
1
2016-01-04
04:30:48
2016-01-04
04:36:08 -73.995888 40.760010 -73.975388 40.782200 1 2.03 32
2
2016-01-07
21:52:24
2016-01-07
21:57:23 -73.990440 40.730469 -73.985542 40.738510 1 0.70 29
3
2016-01-01
04:13:41
2016-01-01
04:19:24 -73.944725 40.714539 -73.955421 40.719173 1 0.80 34
4
2016-01-08
18:46:10
2016-01-08
18:54:00 -74.004494 40.706989 -74.010155 40.716751 5 0.97 47
... ... ... ... ... ... ... ... ...
97687 2016-01-31
02:59:16
2016-01-31
03:09:23 -73.997391 40.721027 -73.978447 40.745277 1 2.17 60
97688 2016-01-14
22:48:10
2016-01-14
22:51:27 -73.988037 40.718761 -73.983337 40.726162 1 0.60 19
97689 2016-01-08
04:46:37
2016-01-08
04:50:12 -73.984390 40.754978 -73.985909 40.751820 4 0.79 21
97690 2016-01-31
12:55:54
2016-01-31
13:01:07 -74.008675 40.725979 -74.009598 40.716003 1 0.85 31
97691 2016-01-05
08:28:16
2016-01-05
08:54:04 -73.968086 40.799915 -73.972290 40.765533 5 3.30 154
96445 rows × 9 columns
Out[6]: All tests passed!
clean_taxi = all_taxi[(all_taxi["passengers"] > 0) & (all_taxi["distance"] > 0.)
 & (all_taxi["duration"] >= 60) & (all_taxi["duration"] <= 3600)
 & ((all_taxi["distance"] / (all_taxi["duration"] / 3600)) <= 100)]
clean_taxi
grader.check("q1b")
3/4/2020 proj3 - Jupyter Notebook
localhost:8888/notebooks/proj3_otter/proj3.ipynb 6/30
3/4/2020 proj3 - Jupyter Notebook
localhost:8888/notebooks/proj3_otter/proj3.ipynb 7/30
In [7]: polygon = pd.read_csv('manhattan.csv')
# Recommended: First develop and test a function that takes a position
# and returns whether it's in Manhattan.
def in_manhattan(x, y):
 """Whether a longitude-latitude (x, y) pair is in the Manhattan polygon."""
 manhattan_Y = [
 40.700292,
 40.707580,
 40.710443,
 40.721762,
 40.729568,
 40.733503,
 40.746834,
 40.775114,
 40.778884,
 40.781906,
 40.785351,
 40.789640,
 40.793149,
 40.795228,
 40.801141,
 40.804877,
 40.810496,
 40.834074,
 40.855371,
 40.870690,
 40.878348,
 40.851151,
 40.844074,
 40.828229,
 40.754019,
 40.719941,
 40.718575,
 40.718802,
 40.704977,
 40.700553
 ]

 manhattan_X = [
 -74.010773,
 -73.999271,
 -73.978758,
 -73.971977,
 -73.971291,
 -73.973994,
 -73.968072,
 -73.941936,
 -73.942580,
 -73.943589,
 -73.939362,
 -73.936272,
 -73.932238,
 -73.929491,
 -73.928976,
 -73.930907,
 -73.934298,
 -73.934383,
 -73.922281,
 -73.908892,
 -73.928289,
 -73.947258,
 -73.947086,
 -73.955498,
 -74.008713,
 -74.013863,
 -74.013605,
3/4/2020 proj3 - Jupyter Notebook
localhost:8888/notebooks/proj3_otter/proj3.ipynb 8/30
In [8]:
If you are unable to solve the problem above, have trouble with the tests, or want to work on the rest of the project
before solving it, run the following cell to load the cleaned Manhattan data directly. (Note that you may not solve the
previous problem just by loading this data file; you have to actually write the code.)
In [9]:
Out[7]:
pickup_datetime dropoff_datetime pickup_lon pickup_lat dropoff_lon dropoff_lat passengers distance duratio
0
2016-01-30
22:47:32
2016-01-30
23:03:53 -73.988251 40.743542 -74.015251 40.709808 1 3.99 98
1
2016-01-04
04:30:48
2016-01-04
04:36:08 -73.995888 40.760010 -73.975388 40.782200 1 2.03 32
2
2016-01-07
21:52:24
2016-01-07
21:57:23 -73.990440 40.730469 -73.985542 40.738510 1 0.70 29
4
2016-01-08
18:46:10
2016-01-08
18:54:00 -74.004494 40.706989 -74.010155 40.716751 5 0.97 47
5
2016-01-02
12:39:57
2016-01-02
12:53:29 -73.958214 40.760525 -73.983360 40.760406 1 1.70 81
... ... ... ... ... ... ... ... ...
97687 2016-01-31
02:59:16
2016-01-31
03:09:23 -73.997391 40.721027 -73.978447 40.745277 1 2.17 60
97688 2016-01-14
22:48:10
2016-01-14
22:51:27 -73.988037 40.718761 -73.983337 40.726162 1 0.60 19
97689 2016-01-08
04:46:37
2016-01-08
04:50:12 -73.984390 40.754978 -73.985909 40.751820 4 0.79 21
97690 2016-01-31
12:55:54
2016-01-31
13:01:07 -74.008675 40.725979 -74.009598 40.716003 1 0.85 31
97691 2016-01-05
08:28:16
2016-01-05
08:54:04 -73.968086 40.799915 -73.972290 40.765533 5 3.30 154
82800 rows × 9 columns
Out[8]: All tests passed!
 -74.017038,
 -74.020042,
 -74.016438
 ]

 j = len(manhattan_X) - 1
 ret = False
 for i in range(len(manhattan_X)):
 if((manhattan_Y[i] < y and manhattan_Y[j] >= y) or (manhattan_Y[j] < y and manhattan_Y[i]
 if((manhattan_X[i] + (y - manhattan_Y[i]) / (manhattan_Y[j] - manhattan_Y[i]) * (manha
 < x):
 ret = not ret
 j = i

 return ret

# Recommended: Then, apply this function to every trip to filter clean_taxi.
mask = clean_taxi.apply(lambda x: in_manhattan(x["pickup_lon"], x["pickup_lat"]) &
 in_manhattan(x["dropoff_lon"], x["dropoff_lat"]), axis=1)
manhattan_taxi = clean_taxi[mask]
manhattan_taxi
grader.check("q1c")
manhattan_taxi = pd.read_csv('manhattan_taxi.csv')
3/4/2020 proj3 - Jupyter Notebook
localhost:8888/notebooks/proj3_otter/proj3.ipynb 9/30
A scatter diagram of only Manhattan taxi rides has the familiar shape of Manhattan Island.
In [10]:
Question 1d
Print a summary of the data selection and cleaning you performed. Your Python code should not include any
number literals, but instead should refer to the shape of all_taxi , clean_taxi , and manhattan_taxi .
plt.figure(figsize=(8, 16))
pickup_scatter(manhattan_taxi)
3/4/2020 proj3 - Jupyter Notebook
localhost:8888/notebooks/proj3_otter/proj3.ipynb 10/30
E.g., you should print something like: "Of the original 1000 trips, 21 anomalous trips (2.1%) were removed through data
cleaning, and then the 600 trips within Manhattan were selected for further analysis."
(Note that the numbers in the example above are not accurate.)
One way to do this is with Python's f-strings. For instance,
name = "Joshua"
print(f"Hi {name}, how are you?")
prints out Hi Joshua, how are you?.
Please ensure that your Python code does not contain any very long lines, or we can't grade it.
Your response will be scored based on whether you generate an accurate description and do not include any number
literals in your Python expression, but instead refer to the dataframes you have created.
In [11]:
Part 2: Exploratory Data Analysis
In this part, you'll choose which days to include as training data in your regression model.
Your goal is to develop a general model that could potentially be used for future taxi rides. There is no guarantee that
future distributions will resemble observed distributions, but some effort to limit training data to typical examples can
help ensure that the training data are representative of future observations.
January 2016 had some atypical days. New Year's Day (January 1) fell on a Friday. MLK Day was on Monday, January
18. A historic blizzard (https://en.wikipedia.org/wiki/January_2016_United_States_blizzard) passed through New York
that month. Using this dataset to train a general regression model for taxi trip times must account for these unusual
phenomena, and one way to account for them is to remove atypical days from the training data.
Question 2a
Of the original 97692 trips, 1247 (1.276%) anomalous trips were removed through data cleaning, a
nd 96445 trips were selected for further analysis. We removed data where the number of passenger
s was less than 1, the distance traveled was a negative value, the trip duration was shorter tha
n 1 minute or longer than 1 hour, and the average speed of the whole trip was greater than 100 m
iles per hour.
Of the original 97692 trips, 13645 (13.967%) more anomalous trips were removed through data clea
ning, and 82800 trips were selected for further analysis. We removed data where either the picku
p location or dropoff location was not in Manhattan.
original_len = len(all_taxi)
cleaned_len = len(clean_taxi)
manhat_len = len(manhattan_taxi)
print("Of the original {} trips, {} ({}%) anomalous trips were removed through data cleaning, and
 "further analysis. We removed data where the number of passengers was less than 1, the dist
 "value, the trip duration was shorter than 1 minute or longer than 1 hour, and the average
 "greater than 100 miles per hour.\n".format(original_len,
 original_len - cleaned_len,
 round((original_len - cleaned_len) * 100 / original_len,
 cleaned_len))
print("Of the original {} trips, {} ({}%) more anomalous trips were removed through data cleaning
 "for further analysis. We removed data where either the pickup location or dropoff location
 .format(original_len,
 cleaned_len - manhat_len,
 round((cleaned_len - manhat_len) * 100 / original_len, 3),
 manhat_len))
3/4/2020 proj3 - Jupyter Notebook
localhost:8888/notebooks/proj3_otter/proj3.ipynb 11/30
Add a column labeled date to manhattan_taxi that contains the date (but not the time) of pickup, formatted as a
datetime.date value (docs (https://docs.python.org/3/library/datetime.html#date-objects)).
The provided tests check that you have extended manhattan_taxi correctly.
In [12]:
In [13]:
Question 2b
Create a data visualization that allows you to identify which dates were affected by the historic blizzard of January 2016.
Make sure that the visualization type is appropriate for the visualized data.
As a hint, consider how taxi usage might change on a day with a blizzard. How could you visualize/plot this?
Out[12]:
pickup_datetime dropoff_datetime pickup_lon pickup_lat dropoff_lon dropoff_lat passengers distance duration
0
2016-01-30
22:47:32
2016-01-30
23:03:53 -73.988251 40.743542 -74.015251 40.709808 2 3.99 981 2
0
1
2016-01-04
04:30:48
2016-01-04
04:36:08 -73.995888 40.760010 -73.975388 40.782200 1 2.03 320 2
0
2
2016-01-07
21:52:24
2016-01-07
21:57:23 -73.990440 40.730469 -73.985542 40.738510 1 0.70 299 2
0
3
2016-01-08
18:46:10
2016-01-08
18:54:00 -74.004494 40.706989 -74.010155 40.716751 5 0.97 470 2
0
4
2016-01-02
12:39:57
2016-01-02
12:53:29 -73.958214 40.760525 -73.983360 40.760406 1 1.70 812 2
0
Out[13]: All tests passed!
import datetime
years = manhattan_taxi["pickup_datetime"].str[0:4]
months = manhattan_taxi["pickup_datetime"].str[5:7]
days = manhattan_taxi["pickup_datetime"].str[8:10]
date = []
for i in range(len(years)):
 date.append(datetime.date(int(years[i]), int(months[i]), int(days[i])))
manhattan_taxi["date"] = np.array(date)
manhattan_taxi.head()
grader.check("q2a")
3/4/2020 proj3 - Jupyter Notebook
localhost:8888/notebooks/proj3_otter/proj3.ipynb 12/30
In [14]:
Finally, we have generated a list of dates that should have a fairly typical distribution of taxi rides, which excludes
holidays and blizzards. The cell below assigns final_taxi to the subset of manhattan_taxi that is on these days.
(No changes are needed; just run this cell.)
Out[14]: Text(0.5, 1.0, 'Number of Taxi Rides in Manhattan per Day in January 2016')
counts_per_day = pd.value_counts(manhattan_taxi["date"]).to_frame()
counts_per_day.reset_index(inplace=True)
counts_per_day.rename(columns={"index": "date", "date": "count"}, inplace=True)
dates = counts_per_day["date"]
days = []
for i in range(len(dates)):
 days.append(dates[i].day)
counts_per_day["day"] = days
counts_per_day.sort_values(by="day", inplace=True)
counts_per_day.plot.bar(x="date", y="count")
plt.xlabel("Date")
plt.ylabel("Number of Taxi Rides")
plt.title("Number of Taxi Rides in Manhattan per Day in January 2016")
3/4/2020 proj3 - Jupyter Notebook
localhost:8888/notebooks/proj3_otter/proj3.ipynb 13/30
In [15]:
You are welcome to perform more exploratory data analysis, but your work will not be scored. Here's a blank cell to use
if you wish. In practice, further exploration would be warranted at this point, but the project is already pretty long.
In [ ]:
Part 3: Feature Engineering
In this part, you'll create a design matrix (i.e., feature matrix) for your linear regression model. This is analagous to the
pipelines you've built already in class: you'll be adding features, removing labels, and scaling among other things.
You decide to predict trip duration from the following inputs: start location, end location, trip distance, time of day, and
day of the week (Monday, Tuesday, etc.).
You will ensure that the process of transforming observations into a design matrix is expressed as a Python function
called design_matrix , so that it's easy to make predictions for different samples in later parts of the project.
Because you are going to look at the data in detail in order to define features, it's best to split the data into training and
test sets now, then only inspect the training set.
In [16]:
Question 3a
Create a box plot that compares the distributions of taxi trip durations for each day using train only. Individual dates
shoud appear on the horizontal axis, and duration values should appear on the vertical axis. Your plot should look like
the one below.
You can generate this type of plot using sns.boxplot
Typical dates:
 January 2016
Mo Tu We Th Fr Sa Su

 4 5 6 7 8 9 10
11 12 13 14 15 16 17
 19 20 21 22
 27 28 29 30 31
Train: (53680, 10) Test: (13421, 10)
import calendar
import re
from datetime import date
atypical = [1, 2, 3, 18, 23, 24, 25, 26]
typical_dates = [date(2016, 1, n) for n in range(1, 32) if n not in atypical]
typical_dates
print('Typical dates:\n')
pat = ' [1-3]|18 | 23| 24|25 |26 '
print(re.sub(pat, ' ', calendar.month(2016, 1)))
final_taxi = manhattan_taxi[manhattan_taxi['date'].isin(typical_dates)]
# Optional: More EDA here
import sklearn.model_selection
train, test = sklearn.model_selection.train_test_split(
 final_taxi, train_size=0.8, test_size=0.2, random_state=42)
print('Train:', train.shape, 'Test:', test.shape)
3/4/2020 proj3 - Jupyter Notebook
localhost:8888/notebooks/proj3_otter/proj3.ipynb 14/30
g yp p g p
In [17]:
Question 3b
In one or two sentences, describe the assocation between the day of the week and the duration of a taxi trip. Your
answer should be supported by your boxplot above.
Note: The end of Part 2 showed a calendar for these dates and their corresponding days of the week.
Write your answer here, replacing this text.
Below, the provided augment function adds various columns to a taxi ride dataframe.
hour : The integer hour of the pickup time. E.g., a 3:45pm taxi ride would have 15 as the hour. A 12:20am ride
would have 0 .
Out[17]: Text(0.5, 1.0, 'Duration of Manhattan Taxi Trips by Date')
boxplot = sns.boxplot(x="date", y="duration", data=manhattan_taxi)
boxplot.set_xticklabels(boxplot.get_xticklabels(), rotation=90)
plt.title("Duration of Manhattan Taxi Trips by Date")
3/4/2020 proj3 - Jupyter Notebook
localhost:8888/notebooks/proj3_otter/proj3.ipynb 15/30
day : The day of the week with Monday=0, Sunday=6.
weekend : 1 if and only if the day is Saturday or Sunday.
period : 1 for early morning (12am-6am), 2 for daytime (6am-6pm), and 3 for night (6pm-12pm).
speed : Average speed in miles per hour.
No changes are required; just run this cell.
In [18]:
Question 3c
Use sns.distplot to create an overlaid histogram comparing the distribution of average speeds for taxi rides that
start in the early morning (12am-6am), day (6am-6pm; 12 hours), and night (6pm-12am; 6 hours). Your plot should look
like this:
Out[18]: pickup_datetime 2016-01-21 18:02:20
dropoff_datetime 2016-01-21 18:27:54
pickup_lon -73.9942
pickup_lat 40.751
dropoff_lon -73.9637
dropoff_lat 40.7711
passengers 1
distance 2.77
duration 1534
date 2016-01-21
hour 18
day 3
weekend 0
period 3
speed 6.50065
Name: 14043, dtype: object
def speed(t):
 """Return a column of speeds in miles per hour."""
 return t['distance'] / t['duration'] * 60 * 60
def augment(t):
 """Augment a dataframe t with additional columns."""
 u = t.copy()
 pickup_time = pd.to_datetime(t['pickup_datetime'])
 u.loc[:, 'hour'] = pickup_time.dt.hour
 u.loc[:, 'day'] = pickup_time.dt.weekday
 u.loc[:, 'weekend'] = (pickup_time.dt.weekday >= 5).astype(int)
 u.loc[:, 'period'] = np.digitize(pickup_time.dt.hour, [0, 6, 18])
 u.loc[:, 'speed'] = speed(t)
 return u

train = augment(train)
test = augment(test)
train.iloc[0,:] # An example row
3/4/2020 proj3 - Jupyter Notebook
localhost:8888/notebooks/proj3_otter/proj3.ipynb 16/30
In [19]:
It looks like the time of day is associated with the average speed of a taxi ride.
Question 3d
Manhattan can roughly be divided into Lower, Midtown, and Upper regions. Instead of studying a map, let's approximate
by finding the first principal component of the pick-up location (latitude and longitude).
Out[19]: <matplotlib.legend.Legend at 0x13251d70b48>
target_1 = train[train["period"] == 1]
target_2 = train[train["period"] == 2]
target_3 = train[train["period"] == 3]
sns.distplot(target_1[["speed"]], kde_kws={"shade": True}, label="Early Morning")
sns.distplot(target_2[["speed"]], kde_kws={"shade": True}, label="Day")
sns.distplot(target_3[["speed"]], kde_kws={"shade": True}, label="Night")
plt.xlabel("speed")
plt.legend()
3/4/2020 proj3 - Jupyter Notebook
localhost:8888/notebooks/proj3_otter/proj3.ipynb 17/30
Principal component analysis (https://en.wikipedia.org/wiki/Principal_component_analysis) (PCA) is a technique that
finds new axes as linear combinations of your current axes. These axes are found such that the first returned axis (the
first principal component) explains the most variation in values, the 2nd the second most, etc.
Add a region column to train that categorizes each pick-up location as 0, 1, or 2 based on the value of each
point's first principal component, such that an equal number of points fall into each region.
Read the documentation of pd.qcut (https://pandas.pydata.org/pandasdocs/version/0.23.4/generated/pandas.qcut.html), which categorizes points in a distribution into equal-frequency bins.
You don't need to add any lines to this solution. Just fill in the assignment statements to complete the implementation.
Before implementing PCA, it is important to scale and shift your values. The line with np.linalg.svd will return your
transformation matrix, among other things. You can then use this matrix to convert points in (lat, lon) space into (PC1,
PC2) space.
Hint: If you are failing the tests, try visualizing your processed data to understand what your code might be doing wrong.
The provided tests ensure that you have answered the question correctly.
In [20]:
In [21]:
Let's see how PCA divided the trips into three groups. These regions do roughly correspond to Lower Manhattan (below
14th street), Midtown Manhattan (between 14th and the park), and Upper Manhattan (bordering Central Park). No prior
knowledge of New York geography was required!
Out[21]: All tests passed!
# Find the first principle component
D = train[["pickup_lon", "pickup_lat"]]
pca_n = len(train)
pca_means = D.mean(axis=0)
X = (D - pca_means) / np.sqrt(pca_n)
u, s, vt = np.linalg.svd(X, full_matrices=False)
def add_region(t):
 """Add a region column to t based on vt above."""
 D = t[['pickup_lon', 'pickup_lat']]
 assert D.shape[0] == t.shape[0], 'You set D using the incorrect table'
 # Always use the same data transformation used to compute vt
 X = (D - pca_means) / np.sqrt(pca_n)
 first_pc = (X["pickup_lon"] * vt[0][0]) + (X["pickup_lat"] * vt[0][1])
 t.loc[:,'region'] = pd.qcut(first_pc, 3, labels=[0, 1, 2])

add_region(train)
add_region(test)
grader.check("q3d")
3/4/2020 proj3 - Jupyter Notebook
localhost:8888/notebooks/proj3_otter/proj3.ipynb 18/30
In [22]: plt.figure(figsize=(8, 16))
for i in [0, 1, 2]:
 pickup_scatter(train[train['region'] == i])
3/4/2020 proj3 - Jupyter Notebook
localhost:8888/notebooks/proj3_otter/proj3.ipynb 19/30
Question 3e (ungraded)
Use sns.distplot to create an overlaid histogram comparing the distribution of speeds for nighttime taxi rides (6pm12am) in the three different regions defined above. Does it appear that there is an association between region and
average speed during the night?
In [23]:
Finally, we create a design matrix that includes many of these features. Quantitative features are converted to standard
units, while categorical features are converted to dummy variables using one-hot encoding. The period is not
included because it is a linear combination of the hour . The weekend variable is not included because it is a linear
combination of the day . The speed is not included because it was computed from the duration ; it's impossible to
know the speed without knowing the duration, given that you know the distance.
Out[23]: <matplotlib.legend.Legend at 0x1324c3a9088>
target_0 = train[(train["region"] == 0) & (train["period"] == 3)]
target_1 = train[(train["region"] == 1) & (train["period"] == 3)]
target_2 = train[(train["region"] == 2) & (train["period"] == 3)]
sns.distplot(target_0[["speed"]], kde_kws={"shade": True}, label="Lower Manhattan")
sns.distplot(target_1[["speed"]], kde_kws={"shade": True}, label="Midtown Manhattan")
sns.distplot(target_2[["speed"]], kde_kws={"shade": True}, label="Upper Manhattan")
plt.xlabel("speed")
plt.legend()
3/4/2020 proj3 - Jupyter Notebook
localhost:8888/notebooks/proj3_otter/proj3.ipynb 20/30
In [24]:
Part 4: Model Selection
In this part, you will select a regression model to predict the duration of a taxi ride.
Important: Tests in this part do not confirm that you have answered correctly. Instead, they check that you're somewhat
close in order to detect major errors. It is up to you to calculate the results correctly based on the question descriptions.
Out[24]: pickup_lon -0.805821
pickup_lat -0.171761
dropoff_lon 0.954062
dropoff_lat 0.624203
distance 0.626326
hour_1 0.000000
hour_2 0.000000
hour_3 0.000000
hour_4 0.000000
hour_5 0.000000
hour_6 0.000000
hour_7 0.000000
hour_8 0.000000
hour_9 0.000000
hour_10 0.000000
hour_11 0.000000
hour_12 0.000000
hour_13 0.000000
hour_14 0.000000
hour_15 0.000000
hour_16 0.000000
hour_17 0.000000
hour_18 1.000000
hour_19 0.000000
hour_20 0.000000
hour_21 0.000000
hour_22 0.000000
hour_23 0.000000
day_1 0.000000
day_2 0.000000
day_3 1.000000
day_4 0.000000
day_5 0.000000
day_6 0.000000
region_1 1.000000
region_2 0.000000
Name: 14043, dtype: float64
from sklearn.preprocessing import StandardScaler
num_vars = ['pickup_lon', 'pickup_lat', 'dropoff_lon', 'dropoff_lat', 'distance']
cat_vars = ['hour', 'day', 'region']
scaler = StandardScaler()
scaler.fit(train[num_vars])
def design_matrix(t):
 """Create a design matrix from taxi ride dataframe t."""
 scaled = t[num_vars].copy()
 scaled.iloc[:,:] = scaler.transform(scaled) # Convert to standard units
 categoricals = [pd.get_dummies(t[s], prefix=s, drop_first=True) for s in cat_vars]
 return pd.concat([scaled] + categoricals, axis=1)
# This processes the full train set, then gives us the first item
# Use this function to get a processed copy of the dataframe passed in
# for training / evaluation
design_matrix(train).iloc[0,:]
3/4/2020 proj3 - Jupyter Notebook
localhost:8888/notebooks/proj3_otter/proj3.ipynb 21/30
Question 4a
Assign constant_rmse to the root mean squared error on the test set for a constant model that always predicts the
mean duration of all training set taxi rides.
In [25]:
In [26]:
Question 4b
Assign simple_rmse to the root mean squared error on the test set for a simple linear regression model that uses only
the distance of the taxi ride as a feature (and includes an intercept).
Terminology Note: Simple linear regression means that there is only one covariate. Multiple linear regression means
that there is more than one. In either case, you can use the LinearRegression model from sklearn to fit the
parameters to data.
In [27]:
In [28]:
Question 4c
Assign linear_rmse to the root mean squared error on the test set for a linear regression model fitted to the training
set without regularization, using the design matrix defined by the design_matrix function from Part 3.
The provided tests check that you have answered the question correctly and that your design_matrix function is
working as intended.
Out[25]: 406.6717335660125
Out[26]: All tests passed!
Out[27]: 276.7841105000342
Out[28]: All tests passed!
def rmse(errors):
 """Return the root mean squared error."""
 return np.sqrt(np.mean(errors ** 2))
constant_rmse = rmse(train["duration"] - train["duration"].mean())
constant_rmse
grader.check("q4a")
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(train["distance"].to_numpy().reshape(-1, 1), train["duration"].to_numpy().reshape(-1, 1
simple_rmse = rmse(test["duration"].to_numpy().reshape(-1, 1) - model.predict(test["distance"].to_
simple_rmse
grader.check("q4b")
3/4/2020 proj3 - Jupyter Notebook
localhost:8888/notebooks/proj3_otter/proj3.ipynb 22/30
In [29]:
In [30]:
Question 4d
For each possible value of period , fit an unregularized linear regression model to the subset of the training set in that
period . Assign period_rmse to the root mean squared error on the test set for a model that first chooses linear
regression parameters based on the observed period of the taxi ride, then predicts the duration using those parameters.
Again, fit to the training set and use the design_matrix function for features.
In [31]:
In [32]:
This approach is a simple form of decision tree regression, where a different regression function is estimated for each
possible choice among a collection of choices. In this case, the depth of the tree is only 1.
Question 4e
Out[29]: 255.19146631882776
Out[30]: All tests passed!
Out[31]: 246.62868831165173
Out[32]: All tests passed!
model = LinearRegression()
X_train = design_matrix(train)
y_train = train["duration"]
model.fit(X_train, y_train)
X_test = design_matrix(test)
y_test = test["duration"]
y_pred = model.predict(X_test)
linear_rmse = rmse(y_pred - y_test)
linear_rmse
grader.check("q4c")
model = LinearRegression()
errors = []
for v in np.unique(train['period']):
 train_period = train[train["period"] == v]
 X_train = design_matrix(train_period)
 y_train = train_period["duration"]

 model.fit(X_train, y_train)
 test_period = test[test["period"] == v]
 X_test = design_matrix(test_period)
 y_test = test_period["duration"]

 y_pred = model.predict(X_test)
 errors_li = y_pred - y_test
 for i in errors_li:
 errors.append(i)

period_rmse = rmse(np.array(errors))
period_rmse
grader.check("q4d")
3/4/2020 proj3 - Jupyter Notebook
localhost:8888/notebooks/proj3_otter/proj3.ipynb 23/30
In one or two sentences, explain how the period regression model above could possibly outperform linear regression
when the design matrix for linear regression already includes one feature for each possible hour, which can be
combined linearly to determine the period value.
Because the design matrix includes one feature for each possible hour, it has many more features, which means that
the complexity of the prediction function increases. This could possibly result in overfitting in the model, and thus the
accuracy on the test data may become lower.
Question 4f
Instead of predicting duration directly, an alternative is to predict the average speed of the taxi ride using linear
regression, then compute an estimate of the duration from the predicted speed and observed distance for each ride.
Assign speed_rmse to the root mean squared error in the duration predicted by a model that first predicts speed as a
linear combination of features from the design_matrix function, fitted on the training set, then predicts duration from
the predicted speed and observed distance.
Hint: Speed is in miles per hour, but duration is measured in seconds. You'll need the fact that there are 60 * 60 = 3,600
seconds in an hour.
In [33]:
In [34]:
Optional: Explain why predicting speed leads to a more accurate regression model than predicting duration directly. You
don't need to write this down.
Question 4g
Finally, complete the function tree_regression_errors (and helper function speed_error ) that combines the
ideas from the two previous models and generalizes to multiple categorical variables.
The tree_regression_errors should:
Find a different linear regression model for each possible combination of the variables in choices ;
Fit to the specified outcome (on train) and predict that outcome (on test) for each combination ( outcome will be
'duration' or 'speed' );
Use the specified error_fn (either duration_error or speed_error ) to compute the error in predicted
duration using the predicted outcome;
Aggregate those errors over the whole test set and return them.
Out[33]: 243.01798368514952
Out[34]: All tests passed!
model = LinearRegression()
X_train = design_matrix(train)
y_train = train["speed"]
model.fit(X_train, y_train)
X_test = design_matrix(test)
speed_pred = model.predict(X_test)
dur_pred = (test["distance"] / speed_pred) * 3600
y_test = test["duration"]
errors1 = np.array(dur_pred - y_test)
speed_rmse = rmse(errors1)
speed_rmse
grader.check("q4f")
3/4/2020 proj3 - Jupyter Notebook
localhost:8888/notebooks/proj3_otter/proj3.ipynb 24/30
You should find that including each of period , region , and weekend improves prediction accuracy, and that
predicting speed rather than duration leads to more accurate duration predictions.
If you're stuck, try putting print statements in the skeleton code to see what it's doing.
In [35]:
In [36]:
Here's a summary of your results:
Duration: 240.33952192703526
Speed: 226.90793945018308
Out[36]: All tests passed!
model = LinearRegression()
choices = ['period', 'region', 'weekend']
def duration_error(predictions, observations):
 """Error between duration predictions (array) and observations (data frame)"""
 return predictions - observations['duration']
def speed_error(predictions, observations):
 """Duration error between speed predictions and duration observations"""
 dur_preds = (observations["distance"] / predictions) * 3600
 return dur_preds - observations["duration"]

def tree_regression_errors(outcome='duration', error_fn=duration_error):
 """Return errors for all examples in test using a tree regression model."""
 errors = []
 for vs in train.groupby(choices).size().index:
 v_train, v_test = train, test
 for v, c in zip(vs, choices):
 v_train = v_train[v_train[c] == v]
 v_test = v_test[v_test[c] == v]

 y_train = v_train[outcome]
 model.fit(design_matrix(v_train), y_train)
 y_pred = model.predict(design_matrix(v_test))
 curr_err = error_fn(y_pred, v_test)
 for e in curr_err:
 errors.append(e)

 return errors
errors = tree_regression_errors()
errors_via_speed = tree_regression_errors('speed', speed_error)
tree_rmse = rmse(np.array(errors))
tree_speed_rmse = rmse(np.array(errors_via_speed))
print('Duration:', tree_rmse, '\nSpeed:', tree_speed_rmse)
grader.check("q4g")
3/4/2020 proj3 - Jupyter Notebook
localhost:8888/notebooks/proj3_otter/proj3.ipynb 25/30
In [37]:
Part 5: Building on your own
In this part you'll build a regression model of your own design, with the goal of achieving even higher performance than
you've seen already. You will be graded on your performance relative to others in the class, with higher performance
(lower RMSE) receiving more points.
Question 5a
In the below cell (feel free to add your own additional cells), train a regression model of your choice on the same train
dataset split used above. The model can incorporate anything you've learned from the class so far.
The model you train will be used for questions 5b and 5c
In [38]:
models = ['constant', 'simple', 'linear', 'period', 'speed', 'tree', 'tree_speed']
pd.DataFrame.from_dict({
 'Model': models,
 'Test RMSE': [eval(m + '_rmse') for m in models]
}).set_index('Model').plot(kind='barh');
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras import regularizers
from tensorflow.keras.layers import Dropout
3/4/2020 proj3 - Jupyter Notebook
localhost:8888/notebooks/proj3_otter/proj3.ipynb 26/30
In [43]:
In [44]:
Question 5b
Print a summary of your model's performance. You must include the RMSE on the train and test sets. Do not hardcode
any values or you won't receive credit.
Don't include any long lines or we won't be able to grade your response.
WARNING:tensorflow:Falling back from v2 loop because of error: Failed to find data adapter th
at can handle input: <class 'pandas.core.frame.DataFrame'>, <class 'NoneType'>
Train on 42944 samples, validate on 10736 samples
Epoch 1/20
42944/42944 [==============================] - ETA: 4:13 - loss: 138.4323 - mse: 138.432 - ET
A: 13s - loss: 32.9388 - mse: 32.9388 - ETA: 8s - loss: 24.7549 - mse: 24.7549 - ETA: 6s - l
oss: 21.2754 - mse: 21.275 - ETA: 5s - loss: 19.4522 - mse: 19.452 - ETA: 5s - loss: 18.2801
- mse: 18.280 - ETA: 4s - loss: 19.3591 - mse: 19.359 - ETA: 4s - loss: 18.8831 - mse: 18.883
- ETA: 4s - loss: 18.3205 - mse: 18.320 - ETA: 4s - loss: 17.6100 - mse: 17.610 - ETA: 3s - l
oss: 17.0648 - mse: 17.064 - ETA: 3s - loss: 16.4921 - mse: 16.492 - ETA: 3s - loss: 16.0860
- mse: 16.086 - ETA: 3s - loss: 15.8120 - mse: 15.812 - ETA: 3s - loss: 15.6052 - mse: 15.605
- ETA: 3s - loss: 15.4007 - mse: 15.400 - ETA: 3s - loss: 15.2774 - mse: 15.277 - ETA: 2s - l
oss: 15.1353 - mse: 15.135 - ETA: 2s - loss: 14.9044 - mse: 14.904 - ETA: 2s - loss: 14.7852
- mse: 14.785 - ETA: 2s - loss: 14.7116 - mse: 14.711 - ETA: 2s - loss: 14.5891 - mse: 14.589
- ETA: 2s - loss: 14.4513 - mse: 14.451 - ETA: 2s - loss: 14.4029 - mse: 14.402 - ETA: 2s - l
oss: 14.2633 - mse: 14.263 - ETA: 2s - loss: 14.1300 - mse: 14.130 - ETA: 2s - loss: 14.0714
- mse: 14.071 - ETA: 2s - loss: 13.9124 - mse: 13.912 - ETA: 2s - loss: 13.8624 - mse: 13.862
- ETA: 2s - loss: 13.7593 - mse: 13.759 - ETA: 2s - loss: 13.6969 - mse: 13.696 - ETA: 1s - l
oss: 13.6274 - mse: 13.627 - ETA: 1s - loss: 13.5721 - mse: 13.572 - ETA: 1s - loss: 13.4977
13 497 ETA 1 l 13 3627 13 362 ETA 1 l 13 3173 13 317
WARNING:tensorflow:Falling back from v2 loop because of error: Failed to find data adapter that
can handle input: <class 'pandas.core.frame.DataFrame'>, <class 'NoneType'>
Out[44]: 194.46067829204708
# try changing number of layers, nodes in each layer, regularization between layers
model = Sequential([
 Dense(64, activation='relu', input_shape=(36,)),
 Dense(64, activation='relu'),
 Dense(1),
])
model.compile(optimizer='sgd',
 loss='mse',
 metrics=['mse'])
# could perhaps try not using design matrix?
X_train = design_matrix(train)
y_train = train["speed"]
X_test = design_matrix(test)
y_test = test["speed"]
# fit the model
# try training with different number of epochs
hist = model.fit(X_train, y_train, batch_size=32, epochs=20, validation_split=0.2)
# predict duration from speeds
speed_pred = model.predict(X_test)
speed_pred = speed_pred.flatten()
dur_labels = test["duration"]
dur_pred = (test["distance"] / speed_pred) * 3600
error = rmse(dur_pred - dur_labels)
error
3/4/2020 proj3 - Jupyter Notebook
localhost:8888/notebooks/proj3_otter/proj3.ipynb 27/30
In [45]:
Question 5c
Describe why you selected the model you did and what you did to try and improve performance over the models in
section 4.
Responses should be at most a few sentences
I selected a Neural Network model, because I know that it is currently the most accurate machine learning algorithm. I
used the Keras library, which was part of the Tensorflow library. Although neural networks are currently the best models
to use, it is still important to know how to tune the hyperparameters.
WARNING:tensorflow:Falling back from v2 loop because of error: Failed to find data adapter that
can handle input: <class 'pandas.core.frame.DataFrame'>, <class 'NoneType'>
Train set RMSE: 187.5146677223602
Test set RMSE: 194.46067829204708
# Plotting the losses
plt.plot(hist.history["loss"], label="Train Set Loss")
plt.plot(hist.history["val_loss"], label="Validation Set Loss")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.title("Model Loss")
plt.legend()
train_speed_pred = model.predict(X_train)
distances = train["distance"]
dist1 = distances[: 26840]
dist2 = distances[26840: ]
spd1 = train_speed_pred[: 26840].flatten()
spd2 = train_speed_pred[26840: ].flatten()
pred1 = (dist1 / spd1) * 3600
pred2 = (dist2 / spd2) * 3600
train_dur_labels = train["duration"]
train_dur_pred = pred1.append(pred2)
train_err = rmse(train_dur_pred - train_dur_labels)
print("Train set RMSE: {}".format(train_err))
print("Test set RMSE: {}".format(error))
3/4/2020 proj3 - Jupyter Notebook
localhost:8888/notebooks/proj3_otter/proj3.ipynb 28/30
First, I tested the model's performance using different combinations of the number of nodes vs. the number of layers.
The number of layers I tried ranged from 2 to 7, and the number of nodes I tried ranged from 32 to 96 (in increments of
32). It turned out that a combination of 4 layers of 64 nodes each produced the best results.
Then I tested the model's performance using different methods of regularization. Regularization is a method that is used
to reduce overfitting, because it constrains the weights of the nodes to a small value. There are two methods of
regularization that I tried: L2 regularization and node dropout. I tested node dropout of rates of 0.2 - 0.4 in between each
layer. I also tested L2 regularization values of 0.1 - 0.0001 (by factors of 10). I found that both dropout and L2
regularization actually worsened my results.
Finally, I noticed that the model was becoming overfitted after many iterations or epochs. As a result, I decided use the
early stopping technique, where I stopped training after fewer epochs. This method yielded the best results.
Congratulations! You've carried out the entire data science lifecycle for a challenging regression problem.
In Part 1 on data selection, you solved a domain-specific programming problem relevant to the analysis when choosing
only those taxi rides that started and ended in Manhattan.
In Part 2 on EDA, you used the data to assess the impact of a historical event---the 2016 blizzard---and filtered the data
accordingly.
In Part 3 on feature engineering, you used PCA to divide up the map of Manhattan into regions that roughly
corresponded to the standard geographic description of the island.
In Part 4 on model selection, you found that using linear regression in practice can involve more than just choosing a
design matrix. Tree regression made better use of categorical variables than linear regression. The domain knowledge
that duration is a simple function of distance and speed allowed you to predict duration more accurately by first
predicting speed.
In Part 5, you made your own model using techniques you've learned throughout the course.
Hopefully, it is apparent that all of these steps are required to reach a reliable conclusion about what inputs and model
structure are helpful in predicting the duration of a taxi ride in Manhattan.
Future Work
Here are some questions to ponder:
The regression model would have been more accurate if we had used the date itself as a feature instead of just the
day of the week. Why didn't we do that?
Does collecting this information about every taxi ride introduce a privacy risk? The original data also included the
total fare; how could someone use this information combined with an individual's credit card records to determine
their location?
Why did we treat hour as a categorical variable instead of a quantitative variable? Would a similar treatment be
beneficial for latitude and longitude?
Why are Google Maps estimates of ride time much more accurate than our estimates?
Here are some possible extensions to the project:
An alternative to throwing out atypical days is to condition on a feature that makes them atypical, such as the
weather or holiday calendar. How would you do that?
Training a different linear regression model for every possible combination of categorical variables can overfit. How
would you select which variables to include in a decision tree instead of just using them all?
Your models use the observed distance as an input, but the distance is only observed after the ride is over. How
could you estimate the distance from the pick-up and drop-off locations?
How would you incorporate traffic data into the model?
3/4/2020 proj3 - Jupyter Notebook
localhost:8888/notebooks/proj3_otter/proj3.ipynb 29/30
In [47]:
Your file has been exported. Download it here (proj3.pdf)!
[W:pyppeteer.chromium_downloader] start chromium download.
Download may take a few minutes.
Task exception was never retrieved
future: <Task finished coro=<notebook_to_pdf() done, defined at C:\Users\shado\Anaconda3\lib\sit
e-packages\nbpdfexport\__init__.py:36> exception=MaxRetryError('HTTPSConnectionPool(host=\'stora
ge.googleapis.com\', port=443): Max retries exceeded with url: /chromium-browser-snapshots/Win_x
64/575458/chrome-win32.zip (Caused by SSLError(SSLError("bad handshake: Error([(\'SSL routines
\', \'tls_process_server_certificate\', \'certificate verify failed\')])")))')>
Traceback (most recent call last):
 File "C:\Users\shado\Anaconda3\lib\site-packages\urllib3\contrib\pyopenssl.py", line 485, in w
rap_socket
 cnx.do_handshake()
 File "C:\Users\shado\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1915, in do_handshake
 self._raise_ssl_error(self._ssl, result)
 File "C:\Users\shado\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1647, in _raise_ssl_err
or
 _raise_current_error()
 File "C:\Users\shado\Anaconda3\lib\site-packages\OpenSSL\_util.py", line 54, in exception_from
_error_queue
 raise exception_type(errors)
OpenSSL.SSL.Error: [('SSL routines', 'tls_process_server_certificate', 'certificate verify faile
d')]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
 File "C:\Users\shado\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 672, in urlo
pen
 chunked=chunked,
 File "C:\Users\shado\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 376, in _mak
e_request
 self._validate_conn(conn)
 File "C:\Users\shado\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 994, in _val
idate_conn
 conn.connect()
 File "C:\Users\shado\Anaconda3\lib\site-packages\urllib3\connection.py", line 394, in connect
 ssl_context=context,
 File "C:\Users\shado\Anaconda3\lib\site-packages\urllib3\util\ssl_.py", line 370, in ssl_wrap_
socket
 return context.wrap_socket(sock, server_hostname=server_hostname)
 File "C:\Users\shado\Anaconda3\lib\site-packages\urllib3\contrib\pyopenssl.py", line 491, in w
rap_socket
 raise ssl.SSLError("bad handshake: %r" % e)
ssl.SSLError: ("bad handshake: Error([('SSL routines', 'tls_process_server_certificate', 'certif
icate verify failed')])",)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
 File "C:\Users\shado\Anaconda3\lib\site-packages\nbpdfexport\__init__.py", line 46, in noteboo
k_to_pdf
 await html_to_pdf(f.name, pdf_path)
 File "C:\Users\shado\Anaconda3\lib\site-packages\nbpdfexport\__init__.py", line 20, in html_to
_pdf
 browser = await launch(args=['--no-sandbox'])
 File "C:\Users\shado\Anaconda3\lib\site-packages\pyppeteer\launcher.py", line 311, in launch
 return await Launcher(options, **kwargs).launch()
 File "C:\Users\shado\Anaconda3\lib\site-packages\pyppeteer\launcher.py", line 125, in __init__
 download_chromium()
# Save your notebook first, then run this cell to generate a PDF.
# Note, the download link will likely not work.
# Find the pdf in the same directory as your proj3.ipynb
grader.export("proj3.ipynb", filtering=False)
3/4/2020 proj3 - Jupyter Notebook
localhost:8888/notebooks/proj3_otter/proj3.ipynb 30/30
 File "C:\Users\shado\Anaconda3\lib\site-packages\pyppeteer\chromium_downloader.py", line 136,
in download_chromium
 extract_zip(download_zip(get_url()), DOWNLOADS_FOLDER / REVISION)
 File "C:\Users\shado\Anaconda3\lib\site-packages\pyppeteer\chromium_downloader.py", line 78, i
n download_zip
 data = http.request('GET', url, preload_content=False)
 File "C:\Users\shado\Anaconda3\lib\site-packages\urllib3\request.py", line 76, in request
 method, url, fields=fields, headers=headers, **urlopen_kw
 File "C:\Users\shado\Anaconda3\lib\site-packages\urllib3\request.py", line 97, in request_enco
de_url
 return self.urlopen(method, url, **extra_kw)
 File "C:\Users\shado\Anaconda3\lib\site-packages\urllib3\poolmanager.py", line 330, in urlopen
 response = conn.urlopen(method, u.request_uri, **kw)
 File "C:\Users\shado\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 760, in urlo
pen
 **response_kw
 File "C:\Users\shado\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 760, in urlo
pen
 **response_kw
 File "C:\Users\shado\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 760, in urlo
pen
 **response_kw
 File "C:\Users\shado\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 720, in urlo
pen
 method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
 File "C:\Users\shado\Anaconda3\lib\site-packages\urllib3\util\retry.py", line 436, in incremen
t
 raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='storage.googleapis.com', port=443):
Max retries exceeded with url: /chromium-browser-snapshots/Win_x64/575458/chrome-win32.zip (Caus
ed by SSLError(SSLError("bad handshake: Error([('SSL routines', 'tls_process_server_certificat
e', 'certificate verify failed')])")))