Wednesday, December 12, 2018

Hubway Capstone Project-- Insights

Hubway Capstone Insights

Hubway Capstone Project: Insights

Hubway is a bike-share program collectively owned by the metro Boston cities; Boston, Cambridge, Somerville, and Brookline. It is operated by Motivate, who manages similar initiatives in NYC, Portland, Chicago, Washington DC, and several other metro areas in Ohio, Tennessee, and New Jersey. They are opening up operations in San Francisco during the month of June, 2017. Hubway currently exists as a system of 188 stations with 1,800 bikes.
  • For this project, I investigated shared data for the months of January, May, June, July, and October during the years of 2015 and 2016.
  • Of concern were the questions of;
    • How do riders use the bike-share service?
    • Are the bikes used as a conveyance or for recreation?
    • What type of customer uses the service?
Below are some initial empirical data analysis and insights.
Import Libraries
In [1]:
import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets
from datetime import datetime
Below is a dataframe of the Hubway data. Features added to fit the project were an age decile based on reported birth year, a column that shows reported gender in a binary fashion, station categories, and the day of week the trip occured.
More on the categories:
  • category 1 represents stations located on side-sreets and smaller public parks.
  • category 2 represents squares mixed with dense retail and residential population, such as Inman Square in Cambridge, and Union Square in Allston and Somerville.
  • category 3 represents recreational or tourist areas, such as the Esplanade or Faneuil Hall.
  • category 4 is a mix of business areas, educational or civic institutions, and transportation hubs such as North and South Stations. It is meant to represent the user who is a commuter.
  • category 5 represents shopping areas with large anchor stores, such as the Whole Foods at the Ink Block, Trader Joe's on Memorial Dr, and Shaw's at Porter Sq.
In [2]:
hubway_df = pd.read_csv('./hubway.csv')
hubway_df.head(1)
Out[2]:
tripduration starttime stoptime start station id start station name start station latitude start station longitude end station id end station name end station latitude end station longitude bikeid age decile male end station category start station category day_of_week usertype_Customer usertype_Subscriber
0 542 2015-01-01 00:21:44 2015-01-01 00:30:47 115 Porter Square Station 42.387995 -71.119084 96 Cambridge Main Library at Broadway / Trowbridg... 42.373379 -71.111075 277 30 1 4 5 Thurs 0 1
Create timestamps for start and stop times.
In [3]:
hubway_df['starttime'] = pd.to_datetime(hubway_df.starttime)
#hubway_df.dtypes
hubway_df['stoptime'] = pd.to_datetime(hubway_df.stoptime)
hubway_df.dtypes
Out[3]:
tripduration                        int64
starttime                  datetime64[ns]
stoptime                   datetime64[ns]
start station id                    int64
start station name                 object
start station latitude            float64
start station longitude           float64
end station id                      int64
end station name                   object
end station latitude              float64
end station longitude             float64
bikeid                              int64
age decile                          int64
male                                int64
end station category                int64
start station category              int64
day_of_week                        object
usertype_Customer                   int64
usertype_Subscriber                 int64
dtype: object
Create series of integers to represent the day of week, and set the range of times from 6am to 9pm. Create a new column of just the starting time without date.
In [4]:
hubway_df['day_of_week_int'] = hubway_df['stoptime'].dt.dayofweek
In [5]:
import datetime
rush_start = datetime.time(6)
rush_end = datetime.time(21)
In [6]:
hubway_df['time'] = [d.time() for d in hubway_df['starttime']]
In [ ]:
 
In [7]:
hubway_df.head(1)
Out[7]:
tripduration starttime stoptime start station id start station name start station latitude start station longitude end station id end station name end station latitude ... bikeid age decile male end station category start station category day_of_week usertype_Customer usertype_Subscriber day_of_week_int time
0 542 2015-01-01 00:21:44 2015-01-01 00:30:47 115 Porter Square Station 42.387995 -71.119084 96 Cambridge Main Library at Broadway / Trowbridg... 42.373379 ... 277 30 1 4 5 Thurs 0 1 3 00:21:44
1 rows × 21 columns
Create a database to develop data dictionary and sort through times.
In [8]:
import sqlite3
from pandas.io import sql
from sqlalchemy import create_engine
In [9]:
#using SQLAlchemy to create a db engine to contain datbase.
#engine = create_engine('sqlite:///C:\Users\Owner\Documents\Capstone\database\hubway.db', echo=True)
In [10]:
#save dataframe to a SQL database.
#hubway_df.to_sql(name = 'hubway', con = engine, if_exists = 'replace', index = False)
In [11]:
hubway = './database/hubway.db'
conn = sqlite3.connect(hubway) 
c = conn.cursor()

Data dictionary

In [12]:
sql_query = """
PRAGMA table_info(hubway)
"""

pd.read_sql(sql_query, con=conn)
Out[12]:
cid name type notnull dflt_value pk
0 0 tripduration BIGINT 0 None 0
1 1 starttime DATETIME 0 None 0
2 2 stoptime DATETIME 0 None 0
3 3 start station id BIGINT 0 None 0
4 4 start station name TEXT 0 None 0
5 5 start station latitude FLOAT 0 None 0
6 6 start station longitude FLOAT 0 None 0
7 7 end station id BIGINT 0 None 0
8 8 end station name TEXT 0 None 0
9 9 end station latitude FLOAT 0 None 0
10 10 end station longitude FLOAT 0 None 0
11 11 bikeid BIGINT 0 None 0
12 12 usertype TEXT 0 None 0
13 13 age decile BIGINT 0 None 0
14 14 male BIGINT 0 None 0
15 15 end station category BIGINT 0 None 0
16 16 start station category BIGINT 0 None 0
17 17 day_of_week TEXT 0 None 0
18 18 day_of_week_int BIGINT 0 None 0
19 19 time TIME 0 None 0
Sorting to the range in time from 6am to 9pm and checking the descriptive statistics of the target variable "end station category."
  • note the business and institution category [4] occupies the second and third quartiles, while residential and retail squares [2] occupy the first quartile. The mean of 3.25 reflects this influence of the quartiles.
In [13]:
hubday = pd.read_sql('''SELECT * FROM hubway WHERE "time" BETWEEN "06%" AND "21%"''', con = conn)
hubday['end station category'].describe()
Out[13]:
count    867199.000000
mean          3.251253
std           1.267759
min           1.000000
25%           2.000000
50%           4.000000
75%           4.000000
max           5.000000
Name: end station category, dtype: float64
Now let's look at the percentages of trips between the start station categories and the end station ones.
--Note that starting station category is the index and ending stations are the columns.
In [14]:
catsp_csv = pd.read_csv('./database/hubway-cats-prob.csv')
catsp_df = pd.DataFrame(catsp_csv)
catsp_df = catsp_df.drop('Unnamed: 0', 1)
catsp_df.head()
Out[14]:
Categories Percent end_side_streets end_mixed_squares end_recreation end_business_institution end_major_shopping
0 start_side_streets 0.114010 0.197736 0.135013 0.408610 0.144632
1 start_mixed_squares 0.138742 0.236436 0.100348 0.376485 0.147988
2 start_recreation 0.116273 0.121135 0.247220 0.373013 0.142358
3 start_business_institution 0.128534 0.173401 0.151474 0.402990 0.143601
4 start_major_shopping 0.130734 0.179296 0.153727 0.379662 0.156582
In [15]:
catsp_df = catsp_df.set_index('Categories Percent')

catsp_df.head()
Out[15]:
end_side_streets end_mixed_squares end_recreation end_business_institution end_major_shopping
Categories Percent
start_side_streets 0.114010 0.197736 0.135013 0.408610 0.144632
start_mixed_squares 0.138742 0.236436 0.100348 0.376485 0.147988
start_recreation 0.116273 0.121135 0.247220 0.373013 0.142358
start_business_institution 0.128534 0.173401 0.151474 0.402990 0.143601
start_major_shopping 0.130734 0.179296 0.153727 0.379662 0.156582
The business and institution category [4] seems to the most active. Establishing a baseline probability that the end station will be a category 4.
In [16]:
print 'Project baseline:', np.mean(catsp_df.end_business_institution)
Project baseline: 0.3881519011
This is the target variable: the ending station category, and the business and instituion category in particular.
In [17]:
ax = sns.heatmap(catsp_df.T, linewidths=.5)
plt.show()
Looking at the probabilities for the user type, either subscriber to to the service or customers who pay at the station per trip, and reported or unreported gender.
-- There was no data dictionary, but nearly all gender(0) were customers or subscribers who did not disclose their birth year either. For the purposes of this project, gender(1) is considered male.
In [18]:
gndsr_csv = pd.read_csv('./database/hubway-g-u-percent.csv')
gndsr_df = pd.DataFrame(gndsr_csv)
#gndsr_df = gndsr_df.drop('Unnamed: 11', 1)
gndsr_df.head()
Out[18]:
End Category Customer End Subscriber End Gender(0) End Gender(1) End Gender(2) End
0 end_side_streets 0.127103 0.126786 0.105590 0.129084 0.143775
1 end_mixed_squares 0.159643 0.188436 0.155309 0.183183 0.201574
2 end_recreation 0.198834 0.139543 0.233586 0.136850 0.123022
3 end_business_institution 0.358619 0.402499 0.332738 0.414207 0.387037
4 end_major_shopping 0.155801 0.142735 0.172777 0.136676 0.144592
In [19]:
gndsr_df = gndsr_df.set_index('End Category')
ax = sns.heatmap(gndsr_df[['Customer End', 
                          'Subscriber End']], linewidths=.5)
plt.show()
In [20]:
ax = gndsr_df[['Customer End', 
              'Subscriber End']].T.plot(kind='bar', title ="User by Category",figsize=(15,10), legend=True, fontsize=15)
ax.set_xlabel("User Type",fontsize=12)
ax.set_ylabel("Category",fontsize=12)
ax.legend(bbox_to_anchor=(1.1, 1.05))
plt.show()
The business and institution category as well as the mixed squares category are strong with subscribers, while recreation and shopping areas are more popular with customers.
In [21]:
ax = sns.heatmap(gndsr_df[['Gender(0) End', 
               'Gender(1) End', 'Gender(2) End']], linewidths=.5)
plt.show()
In [22]:
ax = gndsr_df[['Gender(0) End', 
               'Gender(1) End', 'Gender(2) End']].plot(kind='bar', 
                title ="User by Gender",figsize=(15,10), legend=True, fontsize=15)
ax.set_xlabel("Category",fontsize=12)
ax.set_ylabel("Gender Category",fontsize=12)
plt.show()
Again the business and institution category is the strongest among [assumed] men, then women, then customers or nondisclosed. Note recreation stations have no captured gender data, which may reflect customers as a user type. Mixed squares are most popular with presumed women, and nondisclosed gender leds in shopping areas.

Looking at the days of the week.

In [23]:
days_df = hubway_df[['end station category','day_of_week']]
days = days_df.groupby('end station category')['day_of_week'].value_counts()
dayscats_df = pd.DataFrame(days)
In [24]:
dayscats_df.columns = dayscats_df.columns.get_level_values(0)
#dayscats_df.set_index('day_of_week')
#dayscats_df.drop('day_of_week')
dayscats_df
Out[24]:
day_of_week
end station category day_of_week
1 Thurs 21749
Weds 21489
Tues 20429
Fri 20047
Mon 18927
Sat 12181
Sun 10980
2 Thurs 29974
Weds 29862
Tues 28609
Fri 28060
Mon 26079
Sat 18639
Sun 16691
3 Thurs 23244
Weds 22436
Fri 20839
Tues 20608
Mon 18761
Sat 10992
Sun 9265
4 Thurs 71884
Weds 69413
Fri 65333
Tues 65236
Mon 58914
Sat 29043
Sun 25474
5 Thurs 22714
Weds 21762
Fri 21232
Tues 20981
Mon 19039
Sat 13526
Sun 12061
Thursday clearly has the highest volume, particularly in the business and institution category, in fact there are about 40% less trips to that category on the weekend but the trips count is still higher to this category then all other categories. The order of descending trip counts per day follows the same pattern for each category except on Mondays and Tuesdays to side streets and squares.
In [26]:
ax = dayscats_df.plot(kind='bar', title ="Value Counts per Catagory per Day",figsize=(15,10), legend=False,
                    color=['black', 'red', 'green', 'blue', 'yellow', 'orange', 'violet'], fontsize=12)

plt.show()
End Station Catagories: 1=end_side_streets, 2=end_mixed_squares, 3=end_recreation, 4=end_business_institution, 5=end_major_shopping
Again; stations in catagory 4, businesses and institutions, clearly dominates as the end point for nearly every day. The weeekday numbers for commercial and residential squares barely surpass catagory 4 on the weekend.
Back to Executive Summary

No comments:

Post a Comment