Hubway Capstone-- AdaBoost¶

Hubway is a bike-share program collectively owned by the metro Boston cities; Boston, Cambridge, Somerville, and Brookline. It is operated by Motivate, who manages similar initiatives in NYC, Portland, Chicago, Washington DC, and several other metro areas in Ohio, Tennessee, and New Jersey. They are opening up operations in San Francisco during the month of June, 2017. Hubway currently exists as a system of 188 stations with 1,800 bikes.

For this project, I investigated shared data for the months of January, May, June, July, and October during the years of 2015 and 2016.
Of concern were the questions of;
- How do riders use the bike-share service?
- Are the bikes used as a conveyance or for recreation?
- What type of customer uses the service?

Below is a AdaBoost model in an attempt to better predict the user's profile .

Import libraries and load file

import pandas as pd
from sklearn import model_selection
from sklearn.ensemble import AdaBoostClassifier
import matplotlib.pyplot as plt


hubway_csv = pd.read_csv('./database/data/hubway.csv')
hubway = pd.DataFrame(hubway_csv)

Check shape and index columns for predictors 'X' and 'y'.

hubway.shape

(946473, 403)

"""
Predictors ('X') and target variable ('y') are as in other models
"""
#y = hubway['end station category']

#predictors = ['day_of_week', 'start station category', 'start station id','male', 'end station id', 'age decile']
#X = hubway[predictors]

y = hubway.iloc[:,0]
X = hubway.iloc[:,1:]
X.shape

(946473, 402)

y.head()

0    4
1    4
2    4
3    4
4    4
Name: end station category, dtype: int64

baseline = float(len(y[y == 4])) / len(y)
print 'Baseline model for Business and Institutions area:', baseline

Baseline model for Business and Institutions area: 0.407087154097

Again, the baseline model for businesses and institutions is ~40.70%.

Set parameters for AdaBoost Classifier.

seed = 60
num_trees = 30
kfold = model_selection.KFold(n_splits=5, shuffle=True, random_state=seed)
model = AdaBoostClassifier(n_estimators=num_trees,  random_state=seed)
results = model_selection.cross_val_score(model, X, y, cv=kfold)

AdaBoost Classifier and percent change scores:

print 'Prediction results for businesses and institutions through AdaBoost:', results.mean()

Prediction results for businesses and institutions through AdaBoost: 0.577575903867

model_improve = (((results.mean()-baseline)/baseline)*100)
print 'Percent change between baseline model and AdaBoosting:', model_improve,'%'

Percent change between baseline model and AdaBoosting: 41.8801595811 %

AdaBoosting was able to predict the user type's trip with an increase of 41.88% over the baseline model; keep in mind that the baseline (40.7%), calculated through empirical data analysis was also nearly twice random guessing since there were five total end station categories (20%). Even more so, Adaboosting was 188.8% more accurate than random guessing. These results are more realistic than the Logistic Regression and the Random Forest. It is assumed that AdaBoosting is a better model for this application than the Logistic Regression or the Random Forest.

Erik Ellis // Technical Communications || Data Analytics

Wednesday, December 12, 2018

Hubway Capstone Project-- AdaBoost

Hubway Capstone-- AdaBoost¶

No comments:

Post a Comment