Hubway Capstone Random Forest¶
Hubway is a bike-share program collectively owned by the metro Boston cities; Boston, Cambridge, Somerville, and Brookline. It is operated by Motivate, who manages similar initiatives in NYC, Portland, Chicago, Washington DC, and several other metro areas in Ohio, Tennessee, and New Jersey. They are opening up operations in San Francisco during the month of June, 2017. Hubway currently exists as a system of 188 stations with 1,800 bikes.- For this project, I investigated shared data for the months of January, May, June, July, and October during the years of 2015 and 2016.
- Of concern were the questions of;
- How do riders use the bike-share service?
- Are the bikes used as a conveyance or for recreation?
- What type of customer uses the service?
- How do riders use the bike-share service?
Import libraries
In [17]:
import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
#import sqlite3
from scipy import stats
Load the data
In [18]:
hubway_csv = pd.read_csv('./database/data/hubway.csv')
hubway_df = pd.DataFrame(hubway_csv)
hubway_df.head(5)
Out[18]:
In [20]:
hubway_df.shape
Out[20]:
Get dummies; this file already has dummies
In [23]:
#hubdum = pd.get_dummies(hubway_df, columns = ['day_of_week', 'start station category',
#'start station id','male', 'end station id',
hubdum = hubway_df #'age decile'])
In [24]:
hubdum.head(4)
Out[24]:
Check column names
In [25]:
hubdum.columns
Out[25]:
Subset features and the predictor
In [27]:
y = hubdum.iloc[:,0]
X = hubdum.iloc[:,1:]
In [36]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model = model.fit(X,y)
In [37]:
print 'Random Forest model score:', model.score(X, y)
Again, like with the logistic regression, the concern is that overfitting occured. Perhaps getting dummies for all of the features has caused this.
In [ ]:
No comments:
Post a Comment