Hubway Capstone Random Forest¶

Hubway is a bike-share program collectively owned by the metro Boston cities; Boston, Cambridge, Somerville, and Brookline. It is operated by Motivate, who manages similar initiatives in NYC, Portland, Chicago, Washington DC, and several other metro areas in Ohio, Tennessee, and New Jersey. They are opening up operations in San Francisco during the month of June, 2017. Hubway currently exists as a system of 188 stations with 1,800 bikes.

For this project, I investigated shared data for the months of January, May, June, July, and October during the years of 2015 and 2016.
Of concern were the questions of;
- How do riders use the bike-share service?
- Are the bikes used as a conveyance or for recreation?
- What type of customer uses the service?

Below is a Random Forest model in an attempt to better predict the user's profile .

Import libraries

import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
#import sqlite3
from scipy import stats

Load the data

hubway_csv = pd.read_csv('./database/data/hubway.csv')
hubway_df = pd.DataFrame(hubway_csv)
hubway_df.head(5)

hubway_df.shape

(946473, 403)

Get dummies; this file already has dummies

#hubdum = pd.get_dummies(hubway_df, columns = ['day_of_week', 'start station category',
                                               #'start station id','male', 'end station id', 
hubdum = hubway_df                                               #'age decile'])

hubdum.head(4)

Check column names

hubdum.columns

Index([u'end station category', u'usertype_Customer', u'usertype_Subscriber',
       u'day_of_week_Fri', u'day_of_week_Mon', u'day_of_week_Sat',
       u'day_of_week_Sun', u'day_of_week_Thurs', u'day_of_week_Tues',
       u'day_of_week_Weds',
       ...
       u'end station id_217', u'end station id_218', u'age decile_10',
       u'age decile_20', u'age decile_30', u'age decile_40', u'age decile_50',
       u'age decile_60', u'age decile_70', u'age decile_80'],
      dtype='object', length=403)

Subset features and the predictor

y = hubdum.iloc[:,0]
X = hubdum.iloc[:,1:]

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model = model.fit(X,y)

print 'Random Forest model score:', model.score(X, y)

Random Forest model score: 0.997910135841

Again, like with the logistic regression, the concern is that overfitting occured. Perhaps getting dummies for all of the features has caused this.

Erik Ellis // Technical Communications || Data Analytics

Wednesday, December 12, 2018

Hubway Capstone Project-- Random-Forest

Hubway Capstone Random Forest¶

No comments:

Post a Comment

	end station category	usertype_Customer	usertype_Subscriber	day_of_week_Sat	day_of_week_Sun	day_of_week_Thurs	...	age decile_20	age decile_30	age decile_50
0	4	0	1	0	0	1	...	0	1	0
1	4	0	1	0	0	1	...	1	0	0
2	4	0	1	0	1	0	...	1	0	0
3	4	1	0	0	0	1	...	0	0	1
4	4	0	1	1	0	0	...	0	1	0

	end station category	usertype_Customer	usertype_Subscriber	day_of_week_Sun	day_of_week_Thurs	...	age decile_20	age decile_30	age decile_50
0	4	0	1	0	1	...	0	1	0
1	4	0	1	0	1	...	1	0	0
2	4	0	1	1	0	...	1	0	0
3	4	1	0	0	1	...	0	0	1

	end station category	usertype_Customer	usertype_Subscriber	day_of_week_Sat	day_of_week_Sun	day_of_week_Thurs	...	age decile_20	age decile_30	age decile_50
0	4	0	1	0	0	1	...	0	1	0
1	4	0	1	0	0	1	...	1	0	0
2	4	0	1	0	1	0	...	1	0	0
3	4	1	0	0	0	1	...	0	0	1
4	4	0	1	1	0	0	...	0	1	0

	end station category	usertype_Customer	usertype_Subscriber	day_of_week_Sun	day_of_week_Thurs	...	age decile_20	age decile_30	age decile_50
0	4	0	1	0	1	...	0	1	0
1	4	0	1	0	1	...	1	0	0
2	4	0	1	1	0	...	1	0	0
3	4	1	0	0	1	...	0	0	1

	end station category	usertype_Customer	usertype_Subscriber	day_of_week_Sat	day_of_week_Sun	day_of_week_Thurs	...	age decile_20	age decile_30	age decile_50
0	4	0	1	0	0	1	...	0	1	0
1	4	0	1	0	0	1	...	1	0	0
2	4	0	1	0	1	0	...	1	0	0
3	4	1	0	0	0	1	...	0	0	1
4	4	0	1	1	0	0	...	0	1	0

	end station category	usertype_Customer	usertype_Subscriber	day_of_week_Sun	day_of_week_Thurs	...	age decile_20	age decile_30	age decile_50
0	4	0	1	0	1	...	0	1	0
1	4	0	1	0	1	...	1	0	0
2	4	0	1	1	0	...	1	0	0
3	4	1	0	0	1	...	0	0	1