Hubway Capstone Project: Insights¶
Hubway is a bike-share program collectively owned by the metro Boston cities; Boston, Cambridge, Somerville, and Brookline. It is operated by Motivate, who manages similar initiatives in NYC, Portland, Chicago, Washington DC, and several other metro areas in Ohio, Tennessee, and New Jersey. They are opening up operations in San Francisco during the month of June, 2017. Hubway currently exists as a system of 188 stations with 1,800 bikes.- For this project, I investigated shared data for the months of January, May, June, July, and October during the years of 2015 and 2016.
- Of concern were the questions of;
- How do riders use the bike-share service?
- Are the bikes used as a conveyance or for recreation?
- What type of customer uses the service?
- How do riders use the bike-share service?
Import Libraries
In [1]:
import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets
from datetime import datetime
Below is a dataframe of the Hubway data. Features added to fit the project were an age decile based on reported birth year, a column that shows reported gender in a binary fashion, station categories, and the day of week the trip occured.
More on the categories:
More on the categories:
- category 1 represents stations located on side-sreets and smaller public parks.
- category 2 represents squares mixed with dense retail and residential population, such as Inman Square in Cambridge, and Union Square in Allston and Somerville.
- category 3 represents recreational or tourist areas, such as the Esplanade or Faneuil Hall.
- category 4 is a mix of business areas, educational or civic institutions, and transportation hubs such as North and South Stations. It is meant to represent the user who is a commuter.
- category 5 represents shopping areas with large anchor stores, such as the Whole Foods at the Ink Block, Trader Joe's on Memorial Dr, and Shaw's at Porter Sq.
In [2]:
hubway_df = pd.read_csv('./hubway.csv')
hubway_df.head(1)
Out[2]:
Create timestamps for start and stop times.
In [3]:
hubway_df['starttime'] = pd.to_datetime(hubway_df.starttime)
#hubway_df.dtypes
hubway_df['stoptime'] = pd.to_datetime(hubway_df.stoptime)
hubway_df.dtypes
Out[3]:
Create series of integers to represent the day of week, and set the range of times from 6am to 9pm. Create a new column of just the starting time without date.
In [4]:
hubway_df['day_of_week_int'] = hubway_df['stoptime'].dt.dayofweek
In [5]:
import datetime
rush_start = datetime.time(6)
rush_end = datetime.time(21)
In [6]:
hubway_df['time'] = [d.time() for d in hubway_df['starttime']]
In [ ]:
In [7]:
hubway_df.head(1)
Out[7]:
Create a database to develop data dictionary and sort through times.
In [8]:
import sqlite3
from pandas.io import sql
from sqlalchemy import create_engine
In [9]:
#using SQLAlchemy to create a db engine to contain datbase.
#engine = create_engine('sqlite:///C:\Users\Owner\Documents\Capstone\database\hubway.db', echo=True)
In [10]:
#save dataframe to a SQL database.
#hubway_df.to_sql(name = 'hubway', con = engine, if_exists = 'replace', index = False)
In [11]:
hubway = './database/hubway.db'
conn = sqlite3.connect(hubway)
c = conn.cursor()
Data dictionary¶
In [12]:
sql_query = """
PRAGMA table_info(hubway)
"""
pd.read_sql(sql_query, con=conn)
Out[12]:
Sorting to the range in time from 6am to 9pm and checking the descriptive statistics of the target variable "end station category."
- note the business and institution category [4] occupies the second and third quartiles, while residential and retail squares [2] occupy the first quartile. The mean of 3.25 reflects this influence of the quartiles.
In [13]:
hubday = pd.read_sql('''SELECT * FROM hubway WHERE "time" BETWEEN "06%" AND "21%"''', con = conn)
hubday['end station category'].describe()
Out[13]:
Now let's look at the percentages of trips between the start station categories and the end station ones.
--Note that starting station category is the index and ending stations are the columns.
--Note that starting station category is the index and ending stations are the columns.
In [14]:
catsp_csv = pd.read_csv('./database/hubway-cats-prob.csv')
catsp_df = pd.DataFrame(catsp_csv)
catsp_df = catsp_df.drop('Unnamed: 0', 1)
catsp_df.head()
Out[14]:
In [15]:
catsp_df = catsp_df.set_index('Categories Percent')
catsp_df.head()
Out[15]:
The business and institution category [4] seems to the most active. Establishing a baseline probability that the end station will be a category 4.
In [16]:
print 'Project baseline:', np.mean(catsp_df.end_business_institution)
This is the target variable: the ending station category, and the business and instituion category in particular.
In [17]:
ax = sns.heatmap(catsp_df.T, linewidths=.5)
plt.show()
Looking at the probabilities for the user type, either subscriber to to the service or customers who pay at the station per trip, and reported or unreported gender.
-- There was no data dictionary, but nearly all gender(0) were customers or subscribers who did not disclose their birth year either. For the purposes of this project, gender(1) is considered male.
-- There was no data dictionary, but nearly all gender(0) were customers or subscribers who did not disclose their birth year either. For the purposes of this project, gender(1) is considered male.
In [18]:
gndsr_csv = pd.read_csv('./database/hubway-g-u-percent.csv')
gndsr_df = pd.DataFrame(gndsr_csv)
#gndsr_df = gndsr_df.drop('Unnamed: 11', 1)
gndsr_df.head()
Out[18]:
In [19]:
gndsr_df = gndsr_df.set_index('End Category')
ax = sns.heatmap(gndsr_df[['Customer End',
'Subscriber End']], linewidths=.5)
plt.show()
In [20]:
ax = gndsr_df[['Customer End',
'Subscriber End']].T.plot(kind='bar', title ="User by Category",figsize=(15,10), legend=True, fontsize=15)
ax.set_xlabel("User Type",fontsize=12)
ax.set_ylabel("Category",fontsize=12)
ax.legend(bbox_to_anchor=(1.1, 1.05))
plt.show()
The business and institution category as well as the mixed squares category are strong with subscribers, while recreation and shopping areas are more popular with customers.
In [21]:
ax = sns.heatmap(gndsr_df[['Gender(0) End',
'Gender(1) End', 'Gender(2) End']], linewidths=.5)
plt.show()
In [22]:
ax = gndsr_df[['Gender(0) End',
'Gender(1) End', 'Gender(2) End']].plot(kind='bar',
title ="User by Gender",figsize=(15,10), legend=True, fontsize=15)
ax.set_xlabel("Category",fontsize=12)
ax.set_ylabel("Gender Category",fontsize=12)
plt.show()
Again the business and institution category is the strongest among [assumed] men, then women, then customers or nondisclosed. Note recreation stations have no captured gender data, which may reflect customers as a user type. Mixed squares are most popular with presumed women, and nondisclosed gender leds in shopping areas.
Looking at the days of the week.¶
In [23]:
days_df = hubway_df[['end station category','day_of_week']]
days = days_df.groupby('end station category')['day_of_week'].value_counts()
dayscats_df = pd.DataFrame(days)
In [24]:
dayscats_df.columns = dayscats_df.columns.get_level_values(0)
#dayscats_df.set_index('day_of_week')
#dayscats_df.drop('day_of_week')
dayscats_df
Out[24]:
Thursday clearly has the highest volume, particularly in the business and institution category, in fact there are about 40% less trips to that category on the weekend but the trips count is still higher to this category then all other categories. The order of descending trip counts per day follows the same pattern for each category except on Mondays and Tuesdays to side streets and squares.
In [26]:
ax = dayscats_df.plot(kind='bar', title ="Value Counts per Catagory per Day",figsize=(15,10), legend=False,
color=['black', 'red', 'green', 'blue', 'yellow', 'orange', 'violet'], fontsize=12)
plt.show()
End Station Catagories: 1=end_side_streets, 2=end_mixed_squares, 3=end_recreation, 4=end_business_institution, 5=end_major_shopping
Again; stations in catagory 4, businesses and institutions, clearly dominates as the end point for nearly every day. The weeekday numbers for commercial and residential squares barely surpass catagory 4 on the weekend.
No comments:
Post a Comment