Wednesday, December 12, 2018

Hubway Capstone Project-- Demand Over Time of Day

Hubway -- Demand Over Time of Day

Hubway Capstone Project-- Demand Over Time of Day

Hubway is a bike-share program collectively owned by the metro Boston cities; Boston, Cambridge, Somerville, and Brookline. It is operated by Motivate, who manages similar initiatives in NYC, Portland, Chicago, Washington DC, and several other metro areas in Ohio, Tennessee, and New Jersey. They are opening up operations in San Francisco during the month of June, 2017. Hubway currently exists as a system of 188 stations with 1,800 bikes.
  • For this project, I investigated shared data for the months of January, May, June, July, and October during the years of 2015 and 2016.
  • Of concern were the questions of;
    • How do riders use the bike-share service?
    • Are the bikes used as a conveyance or for recreation?
    • What type of customer uses the service?
Below is a look into demand over the course of all the days in the dataset.
Import Libraries
In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import datetime
import warnings
warnings.filterwarnings("ignore")
Read data csv file and created data frame
In [2]:
hubway_csv = pd.read_csv('./hubway.csv')
hubway_df = pd.DataFrame(hubway_csv)
hubway_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 946473 entries, 0 to 946472
Data columns (total 19 columns):
tripduration               946473 non-null int64
starttime                  946473 non-null object
stoptime                   946473 non-null object
start station id           946473 non-null int64
start station name         946473 non-null object
start station latitude     946473 non-null float64
start station longitude    946473 non-null float64
end station id             946473 non-null int64
end station name           946473 non-null object
end station latitude       946473 non-null float64
end station longitude      946473 non-null float64
bikeid                     946473 non-null int64
age decile                 946473 non-null int64
male                       946473 non-null int64
end station category       946473 non-null int64
start station category     946473 non-null int64
day_of_week                946473 non-null object
usertype_Customer          946473 non-null int64
usertype_Subscriber        946473 non-null int64
dtypes: float64(4), int64(10), object(5)
memory usage: 137.2+ MB
Create dataframe of just the start times, stop times, and end station category
In [3]:
dayhub = hubway_df[['starttime','stoptime', 'end station category']]
dayhub.head()
Out[3]:
starttime stoptime end station category
0 2015-01-01 00:21:44 2015-01-01 00:30:47 4
1 2015-01-01 00:53:46 2015-01-01 01:00:58 4
2 2015-01-04 14:29:05 2015-01-04 14:38:45 4
3 2015-01-08 16:17:04 2015-01-08 16:29:39 4
4 2015-01-10 11:40:49 2015-01-10 11:51:57 4
Covert objects to actual dates and times, while droping date information and the seconds
In [4]:
dayhub['srt_time'] = pd.to_datetime(dayhub['starttime'])
#dayhub['srt_time'] = dayhub.index.map(lambda x: x.replace(second=0))
dayhub['stp_time'] = pd.to_datetime(dayhub['stoptime'])
#dayhub['stp_time'] = dayhub['stp_time'].values.astype('<M8[m]')
dayhub.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 946473 entries, 0 to 946472
Data columns (total 5 columns):
starttime               946473 non-null object
stoptime                946473 non-null object
end station category    946473 non-null int64
srt_time                946473 non-null datetime64[ns]
stp_time                946473 non-null datetime64[ns]
dtypes: datetime64[ns](2), int64(1), object(2)
memory usage: 36.1+ MB
In [5]:
dayhub['Time'] = [d.time() for d in dayhub['stp_time']]
dayhub['Time'] = dayhub['Time'].map(lambda x: x.replace(second=0))
dayhub.tail(5)
Out[5]:
starttime stoptime end station category srt_time stp_time Time
946468 2016-10-26 18:07:59 2016-10-26 18:11:12 2 2016-10-26 18:07:59 2016-10-26 18:11:12 18:11:00
946469 2016-10-27 07:40:49 2016-10-27 07:43:55 2 2016-10-27 07:40:49 2016-10-27 07:43:55 07:43:00
946470 2016-10-30 15:21:23 2016-10-30 15:29:57 2 2016-10-30 15:21:23 2016-10-30 15:29:57 15:29:00
946471 2016-10-30 15:21:28 2016-10-30 15:29:57 2 2016-10-30 15:21:28 2016-10-30 15:29:57 15:29:00
946472 2016-10-14 13:05:51 2016-10-14 13:10:14 4 2016-10-14 13:05:51 2016-10-14 13:10:14 13:10:00
Drop unneeded information leaving only the end station category and the time
In [6]:
dayhub_vis = dayhub.drop(['starttime', 'stoptime', 'srt_time', 'stp_time'], 1)
dayhub_vis.tail()
Out[6]:
end station category Time
946468 2 18:11:00
946469 2 07:43:00
946470 2 15:29:00
946471 2 15:29:00
946472 4 13:10:00
In [7]:
dayhub_vis.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 946473 entries, 0 to 946472
Data columns (total 2 columns):
end station category    946473 non-null int64
Time                    946473 non-null object
dtypes: int64(1), object(1)
memory usage: 14.4+ MB
In [8]:
hubway_demand = dayhub_vis['Time'].value_counts()
dayhub_dmnd = pd.DataFrame(hubway_demand)
dayhub_dmnd.columns = ['Demand']
dayhub_dmnd.tail()
Out[8]:
Demand
03:32:00 8
04:24:00 8
03:46:00 8
04:08:00 8
04:06:00 6
In [10]:
ax = dayhub_dmnd.plot(kind='line', title ="Total demand over the course of the day",
                      figsize=(15,10), legend=True, fontsize=12)

plt.show()
Clearly the highest demand is during rush hour which points to a commuter-type customer
Back to Executive Summary

2 comments:

  1. Replies
    1. Thanks so much-- I do have to take it down, and rework the features to get rid of the overfit in some of the models. The program has really taken off here in Boston, Mass. and I want to update it with new data as well.

      Delete