Hubway Capstone-- Logistic Regression & Cross Validation¶
- For this project, I investigated shared data for the months of January, May, June, July, and October during the years of 2015 and 2016.
- Of concern were the questions of;
- How do riders use the bike-share service?
- Are the bikes used as a conveyance or for recreation?
- What type of customer uses the service?
- How do riders use the bike-share service?
Import needed libraries
In [3]:
import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
import statsmodels.api as sm
Load CSV file and check head
In [4]:
hubway_df = pd.read_csv('./data/hubway.csv')
In [5]:
hubway_df = hubway_df.drop(['Unnamed: 0', 'Unnamed: 1'], axis=1)
In [6]:
hubway_df.head()
Out[6]:
This is the Dataframe after the munging process below; this is what the regression was run on
The code has been commented out, because the CSV file that has been loaded is the one which the model was run on
In [7]:
#Gender, End Station, Usertype, Age and/or Trip Duration.
#predictors = ['gender','end station id', 'usertype','birth year','tripduration']
#for entry in predictors:
# print hubway_df[entry].value_counts()
Drop the null values from the birth year
In [8]:
#mask = hubway_df['birth year'] == '\N'
#drop_index = hubway_df[mask].index
#hubway = hubway_df.drop(drop_index)
Turn them to numeric and drop values that are obvious errors ar incorrect; the cutoff was 1927
In [9]:
#hubway['birth year'] = pd.to_numeric(hubway['birth year'])
#mask = hubway['birth year'] < 1927
#print len(hubway[mask])
#drop_index = hubway[mask].index
#hubway = hubway.drop(drop_index)
Turn birth year into age value
In [10]:
#hubway['birth year'] = hubway['birth year'].map(lambda x: 2016 - x)
#hubway['birth year'].head()
Create age decile
In [9]:
#mask = hubway['birth year'] < 23
#hubway['college_age'] = hubway[mask]
#mask = (hubway['birth year'] > 22) & (hubway['birth year'] < 31)
#hubway['young_professional'] = hubway[mask]
#mask = (hubway['birth year'] > 30) & (hubway['birth year'] < 55)
#hubway['working_professional'] = hubway[mask]
#mask = (hubway['birth year'] > 54)
#hubway['pension'] = hubway[mask]
#hubway['birth year'] = ((hubway['birth year'] // 10) * 10).astype(str)
Rename columns and drop those not needed
In [11]:
#hubway = hubway.rename(columns={'birth year':'age decile', 'gender':'male'})
#hubway = pd.get_dummies(hubway, columns = ['usertype'])
#hubway.drop(['tripduration', 'starttime', 'stoptime',
#'start station name', 'start station latitude', 'start station longitude',
#'end station name', 'end station latitude', 'end station longitude', 'bikeid',
#'start station category', 'day_of_week'], axis=1)
hubway_df.head(1)
Out[11]:
Assign '1' to men and '0' to women
In [12]:
#women made up 32% of riders in 2012 so men are 1 and women are either 0 or 2 - let's assume they're 0 and drop 2
#mask = hubway['male'] == 0
#drop_index = hubway[mask].index
#hubway = hubway.drop(drop_index)
In [13]:
#hubway['start station category'] = hubway['start station category'].astype(str)
In [14]:
#hubway.drop(['tripduration', 'starttime', 'stoptime',
# 'start station name', 'start station latitude', 'start station longitude',
# 'end station name', 'end station latitude', 'end station longitude', 'bikeid',
# 'start station category', 'day_of_week'], axis=1)
cols = hubway_df.columns.tolist
cols
Out[14]:
In [15]:
# get a list of columns
cols = list(hubway_df)
# move the column to head of list using index, pop and insert
cols.insert(0, cols.pop(cols.index('end station category')))
cols
#In [28]:
# use ix to reorder
#df = df.ix[:, cols]
#df
Out[15]:
Get dummy variables for each of the predictors: 'day_of_week', 'start station category', 'start station id','male', 'end station id', 'age decile'
In [17]:
hubway = pd.get_dummies(hubway_df, columns = ['day_of_week', 'start station category',
'start station id','male', 'end station id',
'age decile'] )
hubway.head(1)
Out[17]:
Save final file to CSV
In [18]:
#hubway.sample(frac=1)
#hubway.to_csv('./database/data/hubway.csv', index=False)
hubway.info()
Create 'y' & 'X' by indexing; check the shape of 'X' and the head of 'y' to ensure they are correct
In [19]:
#Gender, Start Station, Usertype, Age and/or Trip Duration.
#sample_hubway = hubway.sample(n=500000)
y = hubway.iloc[:,0]
X = hubway.iloc[:,1:]
#predictors = ['male', 'end station id', 'usertype_Subscriber','age decile']
#X = hubway[predictors]
X.shape
Out[19]:
In [20]:
y.head()
Out[20]:
Calculate baseline perceentage for model based on peercentage of trips ending in Business and Institutions area
In [21]:
#baseline model - 41.4% of values are type 4
print 'Baseline model for Business and Institutions area:', float(len(y[y == 4])) / len(y)
Begin Logistic Regression
- Import libraries
- Select solver (sag) since this is a multiclass regresion, one versus rest (ovr)
- Set up train/test split at 33%
In [22]:
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
lr = LogisticRegression(solver='sag', multi_class='ovr')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=5)
lr.fit(X_train, y_train)
predicted = lr.predict(X_test)
print 'Multi-class Logistic Regression:', lr.score (X_test,y_test)
This score of 99.73% is far too high, and overfitting is suspected. Let's do the cross validation
Note: the parameters were manipulated several times to reach the score below
Note: the parameters were manipulated several times to reach the score below
In [24]:
rfc = RandomForestClassifier(n_estimators= 85, max_depth = 110)
scores = cross_val_score(rfc, X, y, cv=5)
print'Cross Validation Score:', scores
Calculating the accuracy
In [25]:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Classification report
In [26]:
from sklearn.metrics import classification_report
print(classification_report(y_test, predicted, target_names=['end_side_streets','end_mixed_squares','end_recreation',
'end_business_institution','end_major_shopping']))
Note the support-- its total is 312337 which is 33% of the sample size of 946473 records which also represents our test split. The 127257 records of the 946473 are 40% of the total test split, which was also the baseline arrived at prior to running the model.
I posted a question on Quora regarding interpreting what happened with the high scores: How do I interpret the scores of a logistic regression and cross validation when they are both very high, ~99% or ~98% after tweaking the parameters?
In [ ]:
No comments:
Post a Comment