Wednesday, December 12, 2018

Hubway Capstone Project-- Random-Forest

Hubway Capstone-- Random Forest

Hubway Capstone Random Forest

Hubway is a bike-share program collectively owned by the metro Boston cities; Boston, Cambridge, Somerville, and Brookline. It is operated by Motivate, who manages similar initiatives in NYC, Portland, Chicago, Washington DC, and several other metro areas in Ohio, Tennessee, and New Jersey. They are opening up operations in San Francisco during the month of June, 2017. Hubway currently exists as a system of 188 stations with 1,800 bikes.
  • For this project, I investigated shared data for the months of January, May, June, July, and October during the years of 2015 and 2016.
  • Of concern were the questions of;
    • How do riders use the bike-share service?
    • Are the bikes used as a conveyance or for recreation?
    • What type of customer uses the service?
Below is a Random Forest model in an attempt to better predict the user's profile .
Import libraries
In [17]:
import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
#import sqlite3
from scipy import stats
Load the data
In [18]:
hubway_csv = pd.read_csv('./database/data/hubway.csv')
hubway_df = pd.DataFrame(hubway_csv)
hubway_df.head(5)
Out[18]:
end station category usertype_Customer usertype_Subscriber day_of_week_Fri day_of_week_Mon day_of_week_Sat day_of_week_Sun day_of_week_Thurs day_of_week_Tues day_of_week_Weds ... end station id_217 end station id_218 age decile_10 age decile_20 age decile_30 age decile_40 age decile_50 age decile_60 age decile_70 age decile_80
0 4 0 1 0 0 0 0 1 0 0 ... 0 0 0 0 1 0 0 0 0 0
1 4 0 1 0 0 0 0 1 0 0 ... 0 0 0 1 0 0 0 0 0 0
2 4 0 1 0 0 0 1 0 0 0 ... 0 0 0 1 0 0 0 0 0 0
3 4 1 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 1 0 0 0
4 4 0 1 0 0 1 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 0
5 rows × 403 columns
In [20]:
hubway_df.shape
Out[20]:
(946473, 403)
Get dummies; this file already has dummies
In [23]:
#hubdum = pd.get_dummies(hubway_df, columns = ['day_of_week', 'start station category',
                                               #'start station id','male', 'end station id', 
hubdum = hubway_df                                               #'age decile'])
In [24]:
hubdum.head(4)
Out[24]:
end station category usertype_Customer usertype_Subscriber day_of_week_Fri day_of_week_Mon day_of_week_Sat day_of_week_Sun day_of_week_Thurs day_of_week_Tues day_of_week_Weds ... end station id_217 end station id_218 age decile_10 age decile_20 age decile_30 age decile_40 age decile_50 age decile_60 age decile_70 age decile_80
0 4 0 1 0 0 0 0 1 0 0 ... 0 0 0 0 1 0 0 0 0 0
1 4 0 1 0 0 0 0 1 0 0 ... 0 0 0 1 0 0 0 0 0 0
2 4 0 1 0 0 0 1 0 0 0 ... 0 0 0 1 0 0 0 0 0 0
3 4 1 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 1 0 0 0
4 rows × 403 columns
Check column names
In [25]:
hubdum.columns
Out[25]:
Index([u'end station category', u'usertype_Customer', u'usertype_Subscriber',
       u'day_of_week_Fri', u'day_of_week_Mon', u'day_of_week_Sat',
       u'day_of_week_Sun', u'day_of_week_Thurs', u'day_of_week_Tues',
       u'day_of_week_Weds',
       ...
       u'end station id_217', u'end station id_218', u'age decile_10',
       u'age decile_20', u'age decile_30', u'age decile_40', u'age decile_50',
       u'age decile_60', u'age decile_70', u'age decile_80'],
      dtype='object', length=403)
Subset features and the predictor
In [27]:
y = hubdum.iloc[:,0]
X = hubdum.iloc[:,1:]
In [36]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model = model.fit(X,y)
In [37]:
print 'Random Forest model score:', model.score(X, y)
Random Forest model score: 0.997910135841
Again, like with the logistic regression, the concern is that overfitting occured. Perhaps getting dummies for all of the features has caused this.
In [ ]:
 
Back to Executive Summary

No comments:

Post a Comment