Hubway Capstone-- AdaBoost¶
Hubway is a bike-share program collectively owned by the metro Boston cities; Boston, Cambridge, Somerville, and Brookline. It is operated by Motivate, who manages similar initiatives in NYC, Portland, Chicago, Washington DC, and several other metro areas in Ohio, Tennessee, and New Jersey. They are opening up operations in San Francisco during the month of June, 2017. Hubway currently exists as a system of 188 stations with 1,800 bikes.- For this project, I investigated shared data for the months of January, May, June, July, and October during the years of 2015 and 2016.
- Of concern were the questions of;
- How do riders use the bike-share service?
- Are the bikes used as a conveyance or for recreation?
- What type of customer uses the service?
- How do riders use the bike-share service?
Import libraries and load file
In [1]:
import pandas as pd
from sklearn import model_selection
from sklearn.ensemble import AdaBoostClassifier
import matplotlib.pyplot as plt
hubway_csv = pd.read_csv('./database/data/hubway.csv')
hubway = pd.DataFrame(hubway_csv)
Check shape and index columns for predictors 'X' and 'y'.
In [6]:
hubway.shape
Out[6]:
In [ ]:
"""
Predictors ('X') and target variable ('y') are as in other models
"""
#y = hubway['end station category']
#predictors = ['day_of_week', 'start station category', 'start station id','male', 'end station id', 'age decile']
#X = hubway[predictors]
In [7]:
y = hubway.iloc[:,0]
X = hubway.iloc[:,1:]
X.shape
Out[7]:
In [8]:
y.head()
Out[8]:
In [21]:
baseline = float(len(y[y == 4])) / len(y)
print 'Baseline model for Business and Institutions area:', baseline
Again, the baseline model for businesses and institutions is ~40.70%.
Set parameters for AdaBoost Classifier.
In [15]:
seed = 60
num_trees = 30
kfold = model_selection.KFold(n_splits=5, shuffle=True, random_state=seed)
model = AdaBoostClassifier(n_estimators=num_trees, random_state=seed)
results = model_selection.cross_val_score(model, X, y, cv=kfold)
AdaBoost Classifier and percent change scores:
In [25]:
print 'Prediction results for businesses and institutions through AdaBoost:', results.mean()
In [29]:
model_improve = (((results.mean()-baseline)/baseline)*100)
print 'Percent change between baseline model and AdaBoosting:', model_improve,'%'
AdaBoosting was able to predict the user type's trip with an increase of 41.88% over the baseline model; keep in mind that the baseline (40.7%), calculated through empirical data analysis was also nearly twice random guessing since there were five total end station categories (20%). Even more so, Adaboosting was 188.8% more accurate than random guessing. These results are more realistic than the Logistic Regression and the Random Forest. It is assumed that AdaBoosting is a better model for this application than the Logistic Regression or the Random Forest.
No comments:
Post a Comment