Wednesday, December 12, 2018

Hubway Capstone-- Logistic Regression & Cross Validation

Hubway Capstone-- Logistic Regression & Cross Validation

Hubway Capstone-- Logistic Regression & Cross Validation

Hubway is a bike-share program collectively owned by the metro Boston cities; Boston, Cambridge, Somerville, and Brookline. It is operated by Motivate, who manages similar initiatives in NYC, Portland, Chicago, Washington DC, and several other metro areas in Ohio, Tennessee, and New Jersey. They are opening up operations in San Francisco during the month of June, 2017. Hubway currently exists as a system of 188 stations with 1,800 bikes.
  • For this project, I investigated shared data for the months of January, May, June, July, and October during the years of 2015 and 2016.
  • Of concern were the questions of;
    • How do riders use the bike-share service?
    • Are the bikes used as a conveyance or for recreation?
    • What type of customer uses the service?
Below is a Random Forest model in an attempt to better predict the user's profile .
Import needed libraries
In [3]:
import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
import statsmodels.api as sm
Load CSV file and check head
In [4]:
hubway_df = pd.read_csv('./data/hubway.csv')
In [5]:
hubway_df = hubway_df.drop(['Unnamed: 0', 'Unnamed: 1'], axis=1)
In [6]:
hubway_df.head()
Out[6]:
start station id end station id age decile male end station category start station category day_of_week usertype_Customer usertype_Subscriber
0 115 96 30 1 4 5 Thurs 0 1
1 115 96 20 1 4 5 Thurs 0 1
2 115 96 20 1 4 5 Sun 0 1
3 115 96 50 2 4 5 Thurs 1 0
4 115 96 30 1 4 5 Sat 0 1
This is the Dataframe after the munging process below; this is what the regression was run on
The code has been commented out, because the CSV file that has been loaded is the one which the model was run on
In [7]:
#Gender, End Station, Usertype, Age and/or Trip Duration.
#predictors = ['gender','end station id', 'usertype','birth year','tripduration']
#for entry in predictors:
#    print hubway_df[entry].value_counts()
Drop the null values from the birth year
In [8]:
#mask = hubway_df['birth year'] == '\N'
#drop_index = hubway_df[mask].index
#hubway = hubway_df.drop(drop_index)
Turn them to numeric and drop values that are obvious errors ar incorrect; the cutoff was 1927
In [9]:
#hubway['birth year'] = pd.to_numeric(hubway['birth year'])
#mask = hubway['birth year'] < 1927
#print len(hubway[mask])
#drop_index = hubway[mask].index
#hubway = hubway.drop(drop_index)
Turn birth year into age value
In [10]:
#hubway['birth year'] = hubway['birth year'].map(lambda x: 2016 - x)
#hubway['birth year'].head()
Create age decile
In [9]:
#mask = hubway['birth year'] < 23
#hubway['college_age'] = hubway[mask]
#mask = (hubway['birth year'] > 22) & (hubway['birth year'] < 31)
#hubway['young_professional'] = hubway[mask]
#mask = (hubway['birth year'] > 30) & (hubway['birth year'] < 55)
#hubway['working_professional'] = hubway[mask]
#mask = (hubway['birth year'] > 54)
#hubway['pension'] = hubway[mask]
#hubway['birth year'] = ((hubway['birth year'] // 10) * 10).astype(str)
Rename columns and drop those not needed
In [11]:
#hubway = hubway.rename(columns={'birth year':'age decile', 'gender':'male'})
#hubway = pd.get_dummies(hubway, columns = ['usertype'])
#hubway.drop(['tripduration', 'starttime', 'stoptime', 
            #'start station name', 'start station latitude', 'start station longitude',  
            #'end station name', 'end station latitude', 'end station longitude', 'bikeid',
            #'start station category', 'day_of_week'], axis=1)

hubway_df.head(1)
Out[11]:
start station id end station id age decile male end station category start station category day_of_week usertype_Customer usertype_Subscriber
0 115 96 30 1 4 5 Thurs 0 1
Assign '1' to men and '0' to women
In [12]:
#women made up 32% of riders in 2012 so men are 1 and women are either 0 or 2 - let's assume they're 0 and drop 2
#mask = hubway['male'] == 0
#drop_index = hubway[mask].index
#hubway = hubway.drop(drop_index)
In [13]:
#hubway['start station category'] = hubway['start station category'].astype(str)
In [14]:
#hubway.drop(['tripduration', 'starttime', 'stoptime', 
#            'start station name', 'start station latitude', 'start station longitude', 
#            'end station name', 'end station latitude', 'end station longitude', 'bikeid',
#            'start station category', 'day_of_week'], axis=1)

cols = hubway_df.columns.tolist
cols
            
            
Out[14]:
<bound method Index.tolist of Index([u'start station id', u'end station id', u'age decile', u'male',
       u'end station category', u'start station category', u'day_of_week',
       u'usertype_Customer', u'usertype_Subscriber'],
      dtype='object')>
In [15]:
# get a list of columns
cols = list(hubway_df)
# move the column to head of list using index, pop and insert
cols.insert(0, cols.pop(cols.index('end station category')))
cols

#In [28]:
# use ix to reorder
#df = df.ix[:, cols]
#df
Out[15]:
['end station category',
 'start station id',
 'end station id',
 'age decile',
 'male',
 'start station category',
 'day_of_week',
 'usertype_Customer',
 'usertype_Subscriber']
Get dummy variables for each of the predictors: 'day_of_week', 'start station category', 'start station id','male', 'end station id', 'age decile'
In [17]:
hubway = pd.get_dummies(hubway_df, columns = ['day_of_week', 'start station category',
                                              'start station id','male', 'end station id', 
                                              'age decile'] )
hubway.head(1)
Out[17]:
end station category usertype_Customer usertype_Subscriber day_of_week_Fri day_of_week_Mon day_of_week_Sat day_of_week_Sun day_of_week_Thurs day_of_week_Tues day_of_week_Weds ... end station id_217 end station id_218 age decile_10 age decile_20 age decile_30 age decile_40 age decile_50 age decile_60 age decile_70 age decile_80
0 4 0 1 0 0 0 0 1 0 0 ... 0 0 0 0 1 0 0 0 0 0
1 rows × 403 columns
Save final file to CSV
In [18]:
#hubway.sample(frac=1)
#hubway.to_csv('./database/data/hubway.csv', index=False)
hubway.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 946473 entries, 0 to 946472
Columns: 403 entries, end station category to age decile_80
dtypes: int64(3), uint8(400)
memory usage: 382.7 MB
Create 'y' & 'X' by indexing; check the shape of 'X' and the head of 'y' to ensure they are correct
In [19]:
#Gender, Start Station, Usertype, Age and/or Trip Duration.
#sample_hubway = hubway.sample(n=500000)
y = hubway.iloc[:,0]
X = hubway.iloc[:,1:]

#predictors = ['male', 'end station id',  'usertype_Subscriber','age decile']
#X = hubway[predictors]
X.shape
Out[19]:
(946473, 402)
In [20]:
y.head()
Out[20]:
0    4
1    4
2    4
3    4
4    4
Name: end station category, dtype: int64
Calculate baseline perceentage for model based on peercentage of trips ending in Business and Institutions area
In [21]:
#baseline model - 41.4% of values are type 4
print 'Baseline model for Business and Institutions area:', float(len(y[y == 4])) / len(y)
Baseline model for Business and Institutions area: 0.407087154097
Begin Logistic Regression
  • Import libraries
  • Select solver (sag) since this is a multiclass regresion, one versus rest (ovr)
  • Set up train/test split at 33%
In [22]:
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

lr = LogisticRegression(solver='sag', multi_class='ovr')

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=5)
                                                    
lr.fit(X_train, y_train)
predicted = lr.predict(X_test)

print 'Multi-class Logistic Regression:', lr.score (X_test,y_test)
Multi-class Logistic Regression: 0.997326605557
This score of 99.73% is far too high, and overfitting is suspected. Let's do the cross validation
Note: the parameters were manipulated several times to reach the score below
In [24]:
rfc = RandomForestClassifier(n_estimators= 85, max_depth = 110)
scores = cross_val_score(rfc, X, y, cv=5)

print'Cross Validation Score:', scores 
 
Cross Validation Score: [ 0.98661884  0.97646543  0.99239807  0.9880292   0.981769  ]
Calculating the accuracy
In [25]:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Accuracy: 0.99 (+/- 0.01)
Classification report
In [26]:
from sklearn.metrics import classification_report
print(classification_report(y_test, predicted, target_names=['end_side_streets','end_mixed_squares','end_recreation',
                                                             'end_business_institution','end_major_shopping']))
                          precision    recall  f1-score   support

        end_side_streets       1.00      1.00      1.00     41350
       end_mixed_squares       1.00      0.99      0.99     58868
          end_recreation       1.00      1.00      1.00     41585
end_business_institution       0.99      1.00      1.00    127257
      end_major_shopping       1.00      1.00      1.00     43277

             avg / total       1.00      1.00      1.00    312337

Note the support-- its total is 312337 which is 33% of the sample size of 946473 records which also represents our test split. The 127257 records of the 946473 are 40% of the total test split, which was also the baseline arrived at prior to running the model.
In [ ]:
 
Back to Executive Summary

No comments:

Post a Comment