Wednesday, December 12, 2018

Hubway Capstone Project-- AdaBoost

Hubway Capstone-- AdaBoost

Hubway Capstone-- AdaBoost

Hubway is a bike-share program collectively owned by the metro Boston cities; Boston, Cambridge, Somerville, and Brookline. It is operated by Motivate, who manages similar initiatives in NYC, Portland, Chicago, Washington DC, and several other metro areas in Ohio, Tennessee, and New Jersey. They are opening up operations in San Francisco during the month of June, 2017. Hubway currently exists as a system of 188 stations with 1,800 bikes.
  • For this project, I investigated shared data for the months of January, May, June, July, and October during the years of 2015 and 2016.
  • Of concern were the questions of;
    • How do riders use the bike-share service?
    • Are the bikes used as a conveyance or for recreation?
    • What type of customer uses the service?
Below is a AdaBoost model in an attempt to better predict the user's profile .
Import libraries and load file
In [1]:
import pandas as pd
from sklearn import model_selection
from sklearn.ensemble import AdaBoostClassifier
import matplotlib.pyplot as plt


hubway_csv = pd.read_csv('./database/data/hubway.csv')
hubway = pd.DataFrame(hubway_csv)
Check shape and index columns for predictors 'X' and 'y'.
In [6]:
hubway.shape
Out[6]:
(946473, 403)
In [ ]:
"""
Predictors ('X') and target variable ('y') are as in other models
"""
#y = hubway['end station category']

#predictors = ['day_of_week', 'start station category', 'start station id','male', 'end station id', 'age decile']
#X = hubway[predictors]
In [7]:
y = hubway.iloc[:,0]
X = hubway.iloc[:,1:]
X.shape
Out[7]:
(946473, 402)
In [8]:
y.head()
Out[8]:
0    4
1    4
2    4
3    4
4    4
Name: end station category, dtype: int64
In [21]:
baseline = float(len(y[y == 4])) / len(y)
print 'Baseline model for Business and Institutions area:', baseline
Baseline model for Business and Institutions area: 0.407087154097
Again, the baseline model for businesses and institutions is ~40.70%.
Set parameters for AdaBoost Classifier.
In [15]:
seed = 60
num_trees = 30
kfold = model_selection.KFold(n_splits=5, shuffle=True, random_state=seed)
model = AdaBoostClassifier(n_estimators=num_trees,  random_state=seed)
results = model_selection.cross_val_score(model, X, y, cv=kfold)
AdaBoost Classifier and percent change scores:
In [25]:
print 'Prediction results for businesses and institutions through AdaBoost:', results.mean()
Prediction results for businesses and institutions through AdaBoost: 0.577575903867
In [29]:
model_improve = (((results.mean()-baseline)/baseline)*100)
print 'Percent change between baseline model and AdaBoosting:', model_improve,'%'
Percent change between baseline model and AdaBoosting: 41.8801595811 %
AdaBoosting was able to predict the user type's trip with an increase of 41.88% over the baseline model; keep in mind that the baseline (40.7%), calculated through empirical data analysis was also nearly twice random guessing since there were five total end station categories (20%). Even more so, Adaboosting was 188.8% more accurate than random guessing. These results are more realistic than the Logistic Regression and the Random Forest. It is assumed that AdaBoosting is a better model for this application than the Logistic Regression or the Random Forest.
Back to Executive Summary

No comments:

Post a Comment