Wednesday, December 12, 2018

Hubway Capstone Project-- User Type & Gender-- Category Breakdown

Hubway -- User Type & Gender-- Category Breakdown

Hubway Capstone Project-- User Type & Gender-- Category Breakdown

Hubway is a bike-share program collectively owned by the metro Boston cities; Boston, Cambridge, Somerville, and Brookline. It is operated by Motivate, who manages similar initiatives in NYC, Portland, Chicago, Washington DC, and several other metro areas in Ohio, Tennessee, and New Jersey. They are opening up operations in San Francisco during the month of June, 2017. Hubway currently exists as a system of 188 stations with 1,800 bikes.
  • For this project, I investigated shared data for the months of January, May, June, July, and October during the years of 2015 and 2016.
  • Of concern were the questions of;
    • How do riders use the bike-share service?
    • Are the bikes used as a conveyance or for recreation?
    • What type of customer uses the service?
Below is a continuation of the emperical data analysis looking into user type and gender. Also the category matrix of the probabilities of where a trip begins and ends is presented.
Import libraries
In [2]:
import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
#from sklearn import datasets
#from sklearn.linear_model import LogisticRegression
#from sklearn.multiclass import OneVsRestClassifier
#import statsmodels.api as sm
Here is the category matrix of probabilities of where trips begin and end, establishing the first look into what the baseline values are prior to running the models.
In [3]:
catsp_csv = pd.read_csv('./database/hubway-cats-prob.csv')
catsp_df = pd.DataFrame(catsp_csv)
catsp_df.head()
Out[3]:
Unnamed: 0 Categories Percent end_side_streets end_mixed_squares end_recreation end_business_institution end_major_shopping
0 0 start_side_streets 0.114010 0.197736 0.135013 0.408610 0.144632
1 1 start_mixed_squares 0.138742 0.236436 0.100348 0.376485 0.147988
2 2 start_recreation 0.116273 0.121135 0.247220 0.373013 0.142358
3 3 start_business_institution 0.128534 0.173401 0.151474 0.402990 0.143601
4 4 start_major_shopping 0.130734 0.179296 0.153727 0.379662 0.156582
In [4]:
catsp_df = catsp_df.drop('Unnamed: 0', 1)
catsp_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
Categories Percent          5 non-null object
end_side_streets            5 non-null float64
end_mixed_squares           5 non-null float64
end_recreation              5 non-null float64
end_business_institution    5 non-null float64
end_major_shopping          5 non-null float64
dtypes: float64(5), object(1)
memory usage: 312.0+ bytes
In [6]:
catsp_df = catsp_df.set_index('Categories Percent')
catsp_df.head()
Out[6]:
end_side_streets end_mixed_squares end_recreation end_business_institution end_major_shopping
Categories Percent
start_side_streets 0.114010 0.197736 0.135013 0.408610 0.144632
start_mixed_squares 0.138742 0.236436 0.100348 0.376485 0.147988
start_recreation 0.116273 0.121135 0.247220 0.373013 0.142358
start_business_institution 0.128534 0.173401 0.151474 0.402990 0.143601
start_major_shopping 0.130734 0.179296 0.153727 0.379662 0.156582
Here the heatmap shows how strong the end_business_institution is a probable end point.
In [7]:
ax = sns.heatmap(catsp_df, linewidths=.5)
plt.show()
In [8]:
catsp_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, start_side_streets to start_major_shopping
Data columns (total 5 columns):
end_side_streets            5 non-null float64
end_mixed_squares           5 non-null float64
end_recreation              5 non-null float64
end_business_institution    5 non-null float64
end_major_shopping          5 non-null float64
dtypes: float64(5)
memory usage: 240.0+ bytes
Now presented are the probabilities for user type and gender for each ending category.
In [9]:
gndsr_csv = pd.read_csv('./database/hubway-g-u-percent.csv')
gndsr_df = pd.DataFrame(gndsr_csv)
gndsr = gndsr_df.set_index('End Category')
gndsr.head()
Out[9]:
Customer End Subscriber End Gender(0) End Gender(1) End Gender(2) End
End Category
end_side_streets 0.127103 0.126786 0.105590 0.129084 0.143775
end_mixed_squares 0.159643 0.188436 0.155309 0.183183 0.201574
end_recreation 0.198834 0.139543 0.233586 0.136850 0.123022
end_business_institution 0.358619 0.402499 0.332738 0.414207 0.387037
end_major_shopping 0.155801 0.142735 0.172777 0.136676 0.144592
In [12]:
cust_sub = gndsr_df.drop(['Gender(0) End', 'Gender(1) End', 'Gender(2) End'], 1)

cust_sub.index = ['end_side_streets', 'end_mixed_squares', 'end_recreation', 
                  'end_business_institution', 'end_major_shopping']
cust_sub.info()
<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, end_side_streets to end_major_shopping
Data columns (total 3 columns):
End Category      5 non-null object
Customer End      5 non-null float64
Subscriber End    5 non-null float64
dtypes: float64(2), object(1)
memory usage: 160.0+ bytes
The service has two types of users, subscribers to the service, and customers who pay at the docking stations.
Here in this heatmap the strongest value is for subscribers going to businesses and institutions.
In [14]:
ax = sns.heatmap(cust_sub[['Customer End', 'Subscriber End']] , linewidths=.5)
plt.show()
In [19]:
gndsr_df.set_index('End Category')
gndsr_df.index = index = [32
                         22
                         
                         3333333333
                         ]
gndsr_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, end_side_streets to end_major_shopping
Data columns (total 6 columns):
End Category      5 non-null object
Customer End      5 non-null float64
Subscriber End    5 non-null float64
Gender(0) End     5 non-null float64
Gender(1) End     5 non-null float64
Gender(2) End     5 non-null float64
dtypes: float64(5), object(1)
memory usage: 280.0+ bytes
In [20]:
ax = gndsr_df[['Customer End', 
              'Subscriber End']].plot(kind='bar', title ="User by Category",figsize=(15,10), legend=True, fontsize=12)
ax.set_xlabel("Category",fontsize=12)
ax.set_ylabel("User Type",fontsize=12)
plt.show()
Clearly most users are going to a business or institution, but note that recreation is much higher with customers, slightly higher with major shopping centers, and even with residential side street. Also the strength of subscribers that live or shop in the area's many neighborhood squares is to be expected.
The Hubway data did not come with a dictionary, and lists three gender types (0, 1, 2). In this section it is assumed that customers are zero (0), since their gender would be unknown at time of service, or subscribers that do not declare their gender. (1) is assumed male, and (2) is assumed female.
In [25]:
ax = sns.heatmap(gndsr_df[['Gender(0) End', 
               'Gender(1) End', 'Gender(2) End']], linewidths=.5)
plt.show()
In [28]:
gndsr_df = gndsr_df.drop(['Customer End', 'Subscriber End'], 1)
#[['Gender(0) Start', 'Gender(0) End', 'Gender(1) Start', 
               #'Gender(1) End', 'Gender(2) Start', 'Gender(2) End']]
ax = gndsr_df.plot(kind='bar', title ="User by Gender",figsize=(15,10), legend=True, fontsize=12)
ax.set_xlabel("Category",fontsize=12)
ax.set_ylabel("Gender Category",fontsize=12)
plt.show()
This is interesting; most males are going to business and institutions, while females edge out males in most other categories. Recreation users and tourists are mostly (0), which coincides with the characteristics of a customer, lending supporting evidence that (0) is gender unknown.
In [ ]:
 
Back to Executive Summary

No comments:

Post a Comment