Hubway Capstone-- Logistic Regression & Cross Validation

Hubway Capstone-- Logistic Regression & Cross Validation¶

Hubway is a bike-share program collectively owned by the metro Boston cities; Boston, Cambridge, Somerville, and Brookline. It is operated by Motivate, who manages similar initiatives in NYC, Portland, Chicago, Washington DC, and several other metro areas in Ohio, Tennessee, and New Jersey. They are opening up operations in San Francisco during the month of June, 2017. Hubway currently exists as a system of 188 stations with 1,800 bikes.

For this project, I investigated shared data for the months of January, May, June, July, and October during the years of 2015 and 2016.
Of concern were the questions of;
- How do riders use the bike-share service?
- Are the bikes used as a conveyance or for recreation?
- What type of customer uses the service?

Below is a Random Forest model in an attempt to better predict the user's profile .

Import needed libraries

In [3]:

import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
import statsmodels.api as sm

Load CSV file and check head

In [4]:

hubway_df = pd.read_csv('./data/hubway.csv')

In [5]:

hubway_df = hubway_df.drop(['Unnamed: 0', 'Unnamed: 1'], axis=1)

In [6]:

hubway_df.head()

Out[6]:

	start station id	end station id	age decile	male	end station category	start station category	day_of_week	usertype_Customer	usertype_Subscriber
0	115	96	30	1	4	5	Thurs	0	1
1	115	96	20	1	4	5	Thurs	0	1
2	115	96	20	1	4	5	Sun	0	1
3	115	96	50	2	4	5	Thurs	1	0
4	115	96	30	1	4	5	Sat	0	1

This is the Dataframe after the munging process below; this is what the regression was run on

The code has been commented out, because the CSV file that has been loaded is the one which the model was run on

In [7]:

#Gender, End Station, Usertype, Age and/or Trip Duration.
#predictors = ['gender','end station id', 'usertype','birth year','tripduration']
#for entry in predictors:
#    print hubway_df[entry].value_counts()

Drop the null values from the birth year

In [8]:

#mask = hubway_df['birth year'] == '\N'
#drop_index = hubway_df[mask].index
#hubway = hubway_df.drop(drop_index)

Turn them to numeric and drop values that are obvious errors ar incorrect; the cutoff was 1927

In [9]:

#hubway['birth year'] = pd.to_numeric(hubway['birth year'])
#mask = hubway['birth year'] < 1927
#print len(hubway[mask])
#drop_index = hubway[mask].index
#hubway = hubway.drop(drop_index)

Turn birth year into age value

In [10]:

#hubway['birth year'] = hubway['birth year'].map(lambda x: 2016 - x)
#hubway['birth year'].head()

Create age decile

In [9]:

#mask = hubway['birth year'] < 23
#hubway['college_age'] = hubway[mask]
#mask = (hubway['birth year'] > 22) & (hubway['birth year'] < 31)
#hubway['young_professional'] = hubway[mask]
#mask = (hubway['birth year'] > 30) & (hubway['birth year'] < 55)
#hubway['working_professional'] = hubway[mask]
#mask = (hubway['birth year'] > 54)
#hubway['pension'] = hubway[mask]
#hubway['birth year'] = ((hubway['birth year'] // 10) * 10).astype(str)

Rename columns and drop those not needed

In [11]:

#hubway = hubway.rename(columns={'birth year':'age decile', 'gender':'male'})
#hubway = pd.get_dummies(hubway, columns = ['usertype'])
#hubway.drop(['tripduration', 'starttime', 'stoptime', 
            #'start station name', 'start station latitude', 'start station longitude',  
            #'end station name', 'end station latitude', 'end station longitude', 'bikeid',
            #'start station category', 'day_of_week'], axis=1)

hubway_df.head(1)

Out[11]:

	start station id	end station id	age decile	male	end station category	start station category	day_of_week	usertype_Customer	usertype_Subscriber
0	115	96	30	1	4	5	Thurs	0	1

Assign '1' to men and '0' to women

In [12]:

#women made up 32% of riders in 2012 so men are 1 and women are either 0 or 2 - let's assume they're 0 and drop 2
#mask = hubway['male'] == 0
#drop_index = hubway[mask].index
#hubway = hubway.drop(drop_index)

In [13]:

#hubway['start station category'] = hubway['start station category'].astype(str)

In [14]:

#hubway.drop(['tripduration', 'starttime', 'stoptime', 
#            'start station name', 'start station latitude', 'start station longitude', 
#            'end station name', 'end station latitude', 'end station longitude', 'bikeid',
#            'start station category', 'day_of_week'], axis=1)

cols = hubway_df.columns.tolist
cols

Out[14]:

<bound method Index.tolist of Index([u'start station id', u'end station id', u'age decile', u'male',
       u'end station category', u'start station category', u'day_of_week',
       u'usertype_Customer', u'usertype_Subscriber'],
      dtype='object')>

In [15]:

# get a list of columns
cols = list(hubway_df)
# move the column to head of list using index, pop and insert
cols.insert(0, cols.pop(cols.index('end station category')))
cols

#In [28]:
# use ix to reorder
#df = df.ix[:, cols]
#df

Out[15]:

['end station category',
 'start station id',
 'end station id',
 'age decile',
 'male',
 'start station category',
 'day_of_week',
 'usertype_Customer',
 'usertype_Subscriber']

Get dummy variables for each of the predictors: 'day_of_week', 'start station category', 'start station id','male', 'end station id', 'age decile'

In [17]:

hubway = pd.get_dummies(hubway_df, columns = ['day_of_week', 'start station category',
                                              'start station id','male', 'end station id', 
                                              'age decile'] )
hubway.head(1)

Out[17]:

	end station category	usertype_Customer	usertype_Subscriber	day_of_week_Fri	day_of_week_Mon	day_of_week_Sat	day_of_week_Sun	day_of_week_Thurs	day_of_week_Tues	day_of_week_Weds	...	end station id_217	end station id_218	age decile_10	age decile_20	age decile_30	age decile_40	age decile_50	age decile_60	age decile_70	age decile_80
0	4	0	1	0	0	0	0	1	0	0	...	0	0	0	0	1	0	0	0	0	0

1 rows × 403 columns

Save final file to CSV

In [18]:

#hubway.sample(frac=1)
#hubway.to_csv('./database/data/hubway.csv', index=False)
hubway.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 946473 entries, 0 to 946472
Columns: 403 entries, end station category to age decile_80
dtypes: int64(3), uint8(400)
memory usage: 382.7 MB

Create 'y' & 'X' by indexing; check the shape of 'X' and the head of 'y' to ensure they are correct

In [19]:

#Gender, Start Station, Usertype, Age and/or Trip Duration.
#sample_hubway = hubway.sample(n=500000)
y = hubway.iloc[:,0]
X = hubway.iloc[:,1:]

#predictors = ['male', 'end station id',  'usertype_Subscriber','age decile']
#X = hubway[predictors]
X.shape

Out[19]:

(946473, 402)

In [20]:

y.head()

Out[20]:

0    4
1    4
2    4
3    4
4    4
Name: end station category, dtype: int64

Calculate baseline perceentage for model based on peercentage of trips ending in Business and Institutions area

In [21]:

#baseline model - 41.4% of values are type 4
print 'Baseline model for Business and Institutions area:', float(len(y[y == 4])) / len(y)

Baseline model for Business and Institutions area: 0.407087154097

Begin Logistic Regression

Import libraries
Select solver (sag) since this is a multiclass regresion, one versus rest (ovr)
Set up train/test split at 33%

In [22]:

from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

lr = LogisticRegression(solver='sag', multi_class='ovr')

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=5)
                                                    
lr.fit(X_train, y_train)
predicted = lr.predict(X_test)

print 'Multi-class Logistic Regression:', lr.score (X_test,y_test)

Multi-class Logistic Regression: 0.997326605557

This score of 99.73% is far too high, and overfitting is suspected. Let's do the cross validation
Note: the parameters were manipulated several times to reach the score below

In [24]:

rfc = RandomForestClassifier(n_estimators= 85, max_depth = 110)
scores = cross_val_score(rfc, X, y, cv=5)

print'Cross Validation Score:', scores

Cross Validation Score: [ 0.98661884  0.97646543  0.99239807  0.9880292   0.981769  ]

Calculating the accuracy

In [25]:

print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.99 (+/- 0.01)

Classification report

In [26]:

from sklearn.metrics import classification_report
print(classification_report(y_test, predicted, target_names=['end_side_streets','end_mixed_squares','end_recreation',
                                                             'end_business_institution','end_major_shopping']))

                          precision    recall  f1-score   support

        end_side_streets       1.00      1.00      1.00     41350
       end_mixed_squares       1.00      0.99      0.99     58868
          end_recreation       1.00      1.00      1.00     41585
end_business_institution       0.99      1.00      1.00    127257
      end_major_shopping       1.00      1.00      1.00     43277

             avg / total       1.00      1.00      1.00    312337

Note the support-- its total is 312337 which is 33% of the sample size of 946473 records which also represents our test split. The 127257 records of the 946473 are 40% of the total test split, which was also the baseline arrived at prior to running the model.

I posted a question on Quora regarding interpreting what happened with the high scores: How do I interpret the scores of a logistic regression and cross validation when they are both very high, ~99% or ~98% after tweaking the parameters?

In [ ]:

Back to Executive Summary

Erik Ellis // Technical Communications || Data Analytics

Wednesday, December 12, 2018

Hubway Capstone-- Logistic Regression & Cross Validation

Hubway Capstone-- Logistic Regression & Cross Validation¶

No comments:

Post a Comment