Hubway Capstone Project: Insights¶

Hubway is a bike-share program collectively owned by the metro Boston cities; Boston, Cambridge, Somerville, and Brookline. It is operated by Motivate, who manages similar initiatives in NYC, Portland, Chicago, Washington DC, and several other metro areas in Ohio, Tennessee, and New Jersey. They are opening up operations in San Francisco during the month of June, 2017. Hubway currently exists as a system of 188 stations with 1,800 bikes.

For this project, I investigated shared data for the months of January, May, June, July, and October during the years of 2015 and 2016.
Of concern were the questions of;
- How do riders use the bike-share service?
- Are the bikes used as a conveyance or for recreation?
- What type of customer uses the service?

Below are some initial empirical data analysis and insights.

Import Libraries

import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets
from datetime import datetime

Below is a dataframe of the Hubway data. Features added to fit the project were an age decile based on reported birth year, a column that shows reported gender in a binary fashion, station categories, and the day of week the trip occured.
More on the categories:

category 1 represents stations located on side-sreets and smaller public parks.
category 2 represents squares mixed with dense retail and residential population, such as Inman Square in Cambridge, and Union Square in Allston and Somerville.
category 3 represents recreational or tourist areas, such as the Esplanade or Faneuil Hall.
category 4 is a mix of business areas, educational or civic institutions, and transportation hubs such as North and South Stations. It is meant to represent the user who is a commuter.
category 5 represents shopping areas with large anchor stores, such as the Whole Foods at the Ink Block, Trader Joe's on Memorial Dr, and Shaw's at Porter Sq.

hubway_df = pd.read_csv('./hubway.csv')
hubway_df.head(1)

Create timestamps for start and stop times.

hubway_df['starttime'] = pd.to_datetime(hubway_df.starttime)
#hubway_df.dtypes
hubway_df['stoptime'] = pd.to_datetime(hubway_df.stoptime)
hubway_df.dtypes

tripduration                        int64
starttime                  datetime64[ns]
stoptime                   datetime64[ns]
start station id                    int64
start station name                 object
start station latitude            float64
start station longitude           float64
end station id                      int64
end station name                   object
end station latitude              float64
end station longitude             float64
bikeid                              int64
age decile                          int64
male                                int64
end station category                int64
start station category              int64
day_of_week                        object
usertype_Customer                   int64
usertype_Subscriber                 int64
dtype: object

Create series of integers to represent the day of week, and set the range of times from 6am to 9pm. Create a new column of just the starting time without date.

hubway_df['day_of_week_int'] = hubway_df['stoptime'].dt.dayofweek

import datetime
rush_start = datetime.time(6)
rush_end = datetime.time(21)

hubway_df['time'] = [d.time() for d in hubway_df['starttime']]

hubway_df.head(1)

Create a database to develop data dictionary and sort through times.

import sqlite3
from pandas.io import sql
from sqlalchemy import create_engine

#using SQLAlchemy to create a db engine to contain datbase.
#engine = create_engine('sqlite:///C:\Users\Owner\Documents\Capstone\database\hubway.db', echo=True)

#save dataframe to a SQL database.
#hubway_df.to_sql(name = 'hubway', con = engine, if_exists = 'replace', index = False)

hubway = './database/hubway.db'
conn = sqlite3.connect(hubway) 
c = conn.cursor()

Data dictionary¶

sql_query = """
PRAGMA table_info(hubway)
"""

pd.read_sql(sql_query, con=conn)

Sorting to the range in time from 6am to 9pm and checking the descriptive statistics of the target variable "end station category."

note the business and institution category [4] occupies the second and third quartiles, while residential and retail squares [2] occupy the first quartile. The mean of 3.25 reflects this influence of the quartiles.

hubday = pd.read_sql('''SELECT * FROM hubway WHERE "time" BETWEEN "06%" AND "21%"''', con = conn)
hubday['end station category'].describe()

count    867199.000000
mean          3.251253
std           1.267759
min           1.000000
25%           2.000000
50%           4.000000
75%           4.000000
max           5.000000
Name: end station category, dtype: float64

Now let's look at the percentages of trips between the start station categories and the end station ones.
--Note that starting station category is the index and ending stations are the columns.

catsp_csv = pd.read_csv('./database/hubway-cats-prob.csv')
catsp_df = pd.DataFrame(catsp_csv)
catsp_df = catsp_df.drop('Unnamed: 0', 1)
catsp_df.head()

catsp_df = catsp_df.set_index('Categories Percent')

catsp_df.head()

The business and institution category [4] seems to the most active. Establishing a baseline probability that the end station will be a category 4.

print 'Project baseline:', np.mean(catsp_df.end_business_institution)

Project baseline: 0.3881519011

This is the target variable: the ending station category, and the business and instituion category in particular.

ax = sns.heatmap(catsp_df.T, linewidths=.5)
plt.show()

Looking at the probabilities for the user type, either subscriber to to the service or customers who pay at the station per trip, and reported or unreported gender.
-- There was no data dictionary, but nearly all gender(0) were customers or subscribers who did not disclose their birth year either. For the purposes of this project, gender(1) is considered male.

gndsr_csv = pd.read_csv('./database/hubway-g-u-percent.csv')
gndsr_df = pd.DataFrame(gndsr_csv)
#gndsr_df = gndsr_df.drop('Unnamed: 11', 1)
gndsr_df.head()

gndsr_df = gndsr_df.set_index('End Category')
ax = sns.heatmap(gndsr_df[['Customer End', 
                          'Subscriber End']], linewidths=.5)
plt.show()

ax = gndsr_df[['Customer End', 
              'Subscriber End']].T.plot(kind='bar', title ="User by Category",figsize=(15,10), legend=True, fontsize=15)
ax.set_xlabel("User Type",fontsize=12)
ax.set_ylabel("Category",fontsize=12)
ax.legend(bbox_to_anchor=(1.1, 1.05))
plt.show()

The business and institution category as well as the mixed squares category are strong with subscribers, while recreation and shopping areas are more popular with customers.

ax = sns.heatmap(gndsr_df[['Gender(0) End', 
               'Gender(1) End', 'Gender(2) End']], linewidths=.5)
plt.show()

ax = gndsr_df[['Gender(0) End', 
               'Gender(1) End', 'Gender(2) End']].plot(kind='bar', 
                title ="User by Gender",figsize=(15,10), legend=True, fontsize=15)
ax.set_xlabel("Category",fontsize=12)
ax.set_ylabel("Gender Category",fontsize=12)
plt.show()

Again the business and institution category is the strongest among [assumed] men, then women, then customers or nondisclosed. Note recreation stations have no captured gender data, which may reflect customers as a user type. Mixed squares are most popular with presumed women, and nondisclosed gender leds in shopping areas.

Looking at the days of the week.¶

days_df = hubway_df[['end station category','day_of_week']]
days = days_df.groupby('end station category')['day_of_week'].value_counts()
dayscats_df = pd.DataFrame(days)

dayscats_df.columns = dayscats_df.columns.get_level_values(0)
#dayscats_df.set_index('day_of_week')
#dayscats_df.drop('day_of_week')
dayscats_df

Thursday clearly has the highest volume, particularly in the business and institution category, in fact there are about 40% less trips to that category on the weekend but the trips count is still higher to this category then all other categories. The order of descending trip counts per day follows the same pattern for each category except on Mondays and Tuesdays to side streets and squares.

ax = dayscats_df.plot(kind='bar', title ="Value Counts per Catagory per Day",figsize=(15,10), legend=False,
                    color=['black', 'red', 'green', 'blue', 'yellow', 'orange', 'violet'], fontsize=12)

plt.show()

End Station Catagories: 1=end_side_streets, 2=end_mixed_squares, 3=end_recreation, 4=end_business_institution, 5=end_major_shopping

Again; stations in catagory 4, businesses and institutions, clearly dominates as the end point for nearly every day. The weeekday numbers for commercial and residential squares barely surpass catagory 4 on the weekend.

	cid	name	type	dflt_value
0	0	tripduration	BIGINT	None
1	1	starttime	DATETIME	None
2	2	stoptime	DATETIME	None
3	3	start station id	BIGINT	None
4	4	start station name	TEXT	None
5	5	start station latitude	FLOAT	None
6	6	start station longitude	FLOAT	None
7	7	end station id	BIGINT	None
8	8	end station name	TEXT	None
9	9	end station latitude	FLOAT	None
10	10	end station longitude	FLOAT	None
11	11	bikeid	BIGINT	None
12	12	usertype	TEXT	None
13	13	age decile	BIGINT	None
14	14	male	BIGINT	None
15	15	end station category	BIGINT	None
16	16	start station category	BIGINT	None
17	17	day_of_week	TEXT	None
18	18	day_of_week_int	BIGINT	None
19	19	time	TIME	None

	Categories Percent	end_side_streets	end_mixed_squares	end_recreation	end_business_institution	end_major_shopping
0	start_side_streets	0.114010	0.197736	0.135013	0.408610	0.144632
1	start_mixed_squares	0.138742	0.236436	0.100348	0.376485	0.147988
2	start_recreation	0.116273	0.121135	0.247220	0.373013	0.142358
3	start_business_institution	0.128534	0.173401	0.151474	0.402990	0.143601
4	start_major_shopping	0.130734	0.179296	0.153727	0.379662	0.156582

	end_side_streets	end_mixed_squares	end_recreation	end_business_institution	end_major_shopping
Categories Percent
start_side_streets	0.114010	0.197736	0.135013	0.408610	0.144632
start_mixed_squares	0.138742	0.236436	0.100348	0.376485	0.147988
start_recreation	0.116273	0.121135	0.247220	0.373013	0.142358
start_business_institution	0.128534	0.173401	0.151474	0.402990	0.143601
start_major_shopping	0.130734	0.179296	0.153727	0.379662	0.156582

	End Category	Customer End	Subscriber End	Gender(0) End	Gender(1) End	Gender(2) End
0	end_side_streets	0.127103	0.126786	0.105590	0.129084	0.143775
1	end_mixed_squares	0.159643	0.188436	0.155309	0.183183	0.201574
2	end_recreation	0.198834	0.139543	0.233586	0.136850	0.123022
3	end_business_institution	0.358619	0.402499	0.332738	0.414207	0.387037
4	end_major_shopping	0.155801	0.142735	0.172777	0.136676	0.144592

		day_of_week
end station category	day_of_week
1	Thurs	21749
	Weds	21489
	Tues	20429
	Fri	20047
	Mon	18927
	Sat	12181
	Sun	10980
2	Thurs	29974
	Weds	29862
	Tues	28609
	Fri	28060
	Mon	26079
	Sat	18639
	Sun	16691
3	Thurs	23244
	Weds	22436
	Fri	20839
	Tues	20608
	Mon	18761
	Sat	10992
	Sun	9265
4	Thurs	71884
	Weds	69413
	Fri	65333
	Tues	65236
	Mon	58914
	Sat	29043
	Sun	25474
5	Thurs	22714
	Weds	21762
	Fri	21232
	Tues	20981
	Mon	19039
	Sat	13526
	Sun	12061

Erik Ellis // Technical Communications || Data Analytics

Wednesday, December 12, 2018