Hubway Capstone-- Preliminary Exploration of Data¶
Hubway is a bike-share program collectively owned by the metro Boston cities; Boston, Cambridge, Somerville, and Brookline. It is operated by Motivate, who manages similar initiatives in NYC, Portland, Chicago, Washington DC, and several other metro areas in Ohio, Tennessee, and New Jersey. They are opening up operations in San Francisco during the month of June, 2017. Hubway currently exists as a system of 188 stations with 1,800 bikes.- For this project, I investigated shared data for the months of January, May, June, July, and October during the years of 2015 and 2016.
- Of concern were the questions of;
- How do riders use the bike-share service?
- Are the bikes used as a conveyance or for recreation?
- What type of customer uses the service?
- How do riders use the bike-share service?
Once the database has been created and verified, first observations are made abut the data's structure and characteristics, such as outliers and null values. Also, at this time there's consideration into what hypotheses can be or can't be proven.¶
Import libraries
In [2]:
import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
#import sqlite3
from scipy import stats
Load SQL database
In [3]:
import sqlite3
hubway_db = './database/hubway_full.db'
conn = sqlite3.connect(hubway_db)
c = conn.cursor()
In [6]:
#hubway_db.describe()
pd.options.display.max_rows = 1000
Looking at outliers in tripduration that were detected in various dataes
October 12, 2016
In [13]:
oct_12 = pd.read_sql('SELECT * FROM hubway_db WHERE "stoptime" LIKE "2016-10-12%"', con=conn)
oct_12 = oct_12.set_index('stoptime')
oct_12 = oct_12.drop(['starttime', 'start station name', 'start station id',
'start station latitude',
'start station longitude', 'end station name',
'end station id',
'end station latitude',
'end station longitude',
'bikeid', 'usertype',
'birth year', 'gender'], axis=1)
oct_12_list = oct_12.tripduration.values.tolist()
oct_12_list
Out[13]:
Boxplot of List
In [14]:
plt.figure()
plt.boxplot(oct_12_list, 0, '', 1, 0.75)
plt.show()
Plot for outliers
In [15]:
plt.figure()
plt.boxplot(oct_12_list, 0, 'rs', 1, 0.75)
plt.show()
January 14, 2016
In [16]:
jan_14 = pd.read_sql('SELECT * FROM hubway_db WHERE "stoptime" LIKE "2016-01-14%"', con=conn)
jan_14 = jan_14.set_index('stoptime')
jan_14 = jan_14.drop(['starttime', 'start station name', 'start station id',
'start station latitude',
'start station longitude', 'end station name',
'end station id',
'end station latitude',
'end station longitude',
'bikeid', 'usertype',
'birth year', 'gender'], axis=1)
jan_14_list = jan_14.tripduration.values.tolist()
In [18]:
plt.figure()
plt.boxplot(jan_14_list, 0, '', 1, 0.75)
plt.show()
In [27]:
plt.figure()
plt.boxplot(jan_14_list, 0, 'rs', 1, 0.75)
Out[27]:
July 23, 2016
In [20]:
jul_23 = pd.read_sql('SELECT * FROM hubway_db WHERE "stoptime" LIKE "2016-07-23%"', con=conn)
jul_23 = jul_23.set_index('stoptime')
jul_23 = jul_23.drop(['starttime', 'start station name', 'start station id',
'start station latitude',
'start station longitude', 'end station name',
'end station id',
'end station latitude',
'end station longitude',
'bikeid', 'usertype',
'birth year', 'gender'], axis=1)
jul_23_list = jul_23.tripduration.values.tolist()
In [21]:
plt.figure()
plt.boxplot(jul_23_list, 0, '', 1, 0.75)
plt.show()
In [24]:
plt.figure()
plt.boxplot(jul_23_list, 0, 'rs', 1, 0.75)
plt.show()
Look at descriptive statistics of database
In [25]:
hubway_df = pd.read_sql('SELECT * FROM hubway_db', con=conn)
hubway_df.describe()
Out[25]:
Look at the bike ids
In [26]:
stats.mode(hubway_df.bikeid)
Out[26]:
Bike with id 351 is the most used out of 1317 trips
In [28]:
most_traveled_bike_df = pd.read_sql('''SELECT * FROM hubway_db WHERE "bikeid" LIKE 351''', con = conn)
most_traveled_bike_df.head()
Out[28]:
Look at Birth Year
In [30]:
hubway_df["birth year"].unique()
Out[30]:
In [9]:
hubway_df["birth year"].value_counts()
Out[9]:
Obviously there needs to be some cleaning of the data, for null values and customers inputing incorrect information
Looking at the start station id
In [31]:
hubway_df["start station id"].unique()
Out[31]:
In [32]:
stats.mode(hubway_df["start station id"])
Out[32]:
Station id 22 (South Station) is the most used for begining trips, accounting for 32415 trips
In [34]:
most_visited_start_df = pd.read_sql('''SELECT * FROM hubway_db WHERE "start station id" LIKE 22''', con = conn)
most_visited_start_df.head()
Out[34]:
Looking at the end station id
In [35]:
stats.mode(hubway_df["end station id"])
Out[35]:
End station with id 22 is also the most used, ending 32783 trips, just slightly more than the start station count.
In [36]:
most_visited_end_df = pd.read_sql('''SELECT * FROM hubway_db WHERE "end station id" LIKE 22''', con = conn)
most_visited_end_df
Out[36]:
Looking at recreational or tourist riders
In [38]:
day_trip_df = pd.read_sql('''SELECT * FROM hubway_db WHERE "end station id" LIKE "start station id"''', con = conn)
day_trip_df.describe()
Out[38]:
Looking for recreational rider most used start station and end station
In [39]:
stats.mode(day_trip_df["start station id"])
Out[39]:
In [40]:
stats.mode(day_trip_df["end station id"])
Out[40]:
The station most used for recreation is 58, The Esplanade, for 1341 trips. This makes sense given the station location
In [41]:
day_trip_station_df = pd.read_sql('''SELECT * FROM hubway_db WHERE "start station id" LIKE 58 AND "end station id" LIKE 58''', con = conn)
day_trip_station_df.describe()
Out[41]:
In [42]:
day_trip_station_df.head(21)
Out[42]:
In [47]:
#CLeaning of the birth year feature actually occured during the logistic regression
#when the years were converted to an age decile
#hubway_df["birth year"] = hubway_df["birth year"].str.replace('''\\\\N''', '1970')
#hubway_df["birth year"] = hubway_df["birth year"].str.replace('\0','0')
#hubway_df["birth year"].unique()
In [48]:
#stats.mode(hubway_df['birth year'])
In [49]:
#np.mean(hubway_df['birth year'])
In [50]:
#if hubway_df['birth year'] <= '1974':
#hubway_df['birth year'].replace(1974)
#hubway_df["birth year"] = pd.to_numeric(hubway_df["birth year"])
Cambridge has the highest volume of traffic¶
In [51]:
camb_trip_station_df = pd.read_sql('''SELECT * FROM hubway_db WHERE "start station name" LIKE "%Cambridge%" OR "end station name" LIKE "%Cambridge%"''', con = conn)
camb_trip_station_df.tail()
Out[51]:
These results aren't accurate, SQL string needs to be refined-- for exapmle Cambridge Street in Boston is in the results
Qustions to investigate:¶
Peak usage can be tied to rush hour commuter usage¶
Involves breaking date, which is a string down to a time feature.Other peaks in usage involve traveling to 'squares' that are both commercial and residential¶
Involves setting up a classification feature for locations of stations
In [53]:
#conn.close()
Creation of categories¶
Categories for station will be classified as follows:
- 1: Residential side streets
- 2: Squares with businesses and residential elements
- 3: Recreational and tourist areas
- 4: Businesses (large enterprise) and institutions (hospitals, colleges, transit)
- 5: Major shopping areas and plazas
Isolate and export list of station names
In [61]:
cat_df = hubway_df['start station name'].unique()
In [63]:
#cat_csv = cat_df.to_csv(path_or_buf='./database/cat_csv.csv')
In [64]:
cat_db = pd.read_csv('./database/cat_csv.csv')
#cat_csv = pd.DataFrame(cat_csv)
type(cat_db)
Out[64]:
Join the category list to main dataframe and create new database
In [66]:
cat_db.head()
#hubway_cats_df = hubway_df.join(cat_csv, on='start station name', how='right')
#hubway_cats_df.tail(10)
#from sqlalchemy import create_engine
#engine = create_engine('sqlite:///C:\Users\Owner\Documents\Capstone\database\cats.db', echo=True)
#cats_db = cat_csv.to_sql(name = 'cats_db', con = engine, if_exists = 'replace', index = False)
Out[66]:
In [35]:
#hubway_start_cats_db = './database/cats.db'
#connct = sqlite3.connect('cats_db')
Create new table and merge categories to appropriate end station names
In [36]:
#hubway_cats_db = pd.read_sql('CREATE TABLE hubway_full.cats_db
#("start station name" VARCHAR "start station catagory INTEGER")', con=conn)
#hubway_cats_db
In [37]:
hubway_cats_df = pd.merge(hubway_df, cat_db, on='end station name')
In [38]:
hubway_cats_df.head()
Out[38]:
Perform same operation with start station names
In [39]:
cat2_db = pd.read_csv('./database/cat2_csv.csv')
#cat_csv = pd.DataFrame(cat_csv)
type(cat2_db)
Out[39]:
In [40]:
cat2_db.head()
Out[40]:
In [41]:
hubway_cats2_df = pd.merge(hubway_cats_df, cat2_db, on='start station name')
In [42]:
#hubway_cats2_df.to_csv(path_or_buf='./database/hubway_cats.csv')
In [43]:
hubway_cats2_df.head()
Out[43]:
Drop unneeded columns and check dataframe's structure
In [44]:
#hubway_cats2_df.drop('Unnamed: 0_x', 1)
#hubway_cats2_df.drop('Unnamed: 0_y', 1)
hubway_cats2_df.info()
Final save and close SQL connection
In [47]:
hubway_cats2_df.to_csv(path_or_buf='./database/hubway_cats.csv')
In [46]:
conn.close()
In [ ]:
No comments:
Post a Comment