Hubway Capstone Project-- Database Creation¶
Hubway is a bike-share program collectively owned by the metro Boston cities; Boston, Cambridge, Somerville, and Brookline. It is operated by Motivate, who manages similar initiatives in NYC, Portland, Chicago, Washington DC, and several other metro areas in Ohio, Tennessee, and New Jersey. They are opening up operations in San Francisco during the month of June, 2017. Hubway currently exists as a system of 188 stations with 18,000 bikes.- For this project, I investigated shared data for the months of January, May, June, July, and October during the years of 2015 and 2016.
- Of concern were the questions of;
- How do riders use the bike-share service?
- Are the bikes used as a conveyance or for recreation?
- What type of customer uses the service?
- How do riders use the bike-share service?
Note: the data can be found at: Bluebikes System Data | Blue Bikes Boston
Library Import
In [1]:
import numpy as np
#import scipy.stats as stats
#import seaborn as sns
#import matplotlib.pyplot as plt
import pandas as pd
import sqlite3
from pandas.io import sql
from sqlalchemy import create_engine
Using SQLAlchemy to create database
In [27]:
#SQL_STRING = '''PRAGMA encoding="UTF-8";'''
engine = create_engine('sqlite:///C:\Users\Owner\Documents\Capstone\database\hubway_full.db', echo=True)
#conn = sqlite3.connect("C:\Users\Owner\Documents\Capstone\database\hubway_full.db")
#c = conn.cursor()
#c.execute('''CREATE TABLE IF NOT EXISTS hubway_db
#(tripduration INTEGER, starttime DATETIME, stoptime DATETIME, start_station_id INTEGER, start_station_name VARCHAR,
#start_station_latitude FLOAT, start_station_longitude FLOAT, end_station_id INTEGER, end_station_name VARCHAR,
#end_station_latitude FLOAT, end_station_longitude FLOAT, bikeid INTEGER, usertype VARCHAR, birth_year INTEGER, gender INTEGER);''')
#SQL_STRING = '''PRAGMA table_info(hubway_db);'''
CSV file read and conversion to DataFrames
In [28]:
janfif = pd.read_csv('./data/201501-hubway-tripdata.csv', encoding = 'utf-8')
janfif_df = pd.DataFrame(janfif)
#janfif_df.to_sql(hubway_db, conn, if_exists='append', index=False)
mayfif = pd.read_csv('./data/201505-hubway-tripdata.csv', encoding = 'utf-8')
mayfif = pd.DataFrame(mayfif)
#mayfif.to_sql(hubway_db, conn, if_exists='append', index=False)
junfif = pd.read_csv('./data/201506-hubway-tripdata.csv', encoding = 'utf-8')
junfif = pd.DataFrame(junfif)
#junfif.to_sql(hubway_db, conn, if_exists='append', index=False)
julfif = pd.read_csv('./data/201507-hubway-tripdata.csv', encoding = 'utf-8')
julfif = pd.DataFrame(julfif)
#julfif.to_sql(hubway_db, conn, if_exists='append', index=False)
octfif = pd.read_csv('./data/201510-hubway-tripdata.csv', encoding = 'utf-8')
octfif = pd.DataFrame(octfif)
#octfif.to_sql(hubway_db, conn, if_exists='append', index=False)
jansix = pd.read_csv('./data/201601-hubway-tripdata.csv', encoding = 'utf-8')
jansix = pd.DataFrame(jansix)
#jansix.to_sql(hubway_db, conn, if_exists='append', index=False)
maysix = pd.read_csv('./data/201605-hubway-tripdata.csv', encoding = 'utf-8')
maysix = pd.DataFrame(maysix)
#maysix.to_sql(hubway_db, conn, if_exists='append', index=False)
junsix = pd.read_csv('./data/201606-hubway-tripdata.csv', encoding = 'utf-8')
junsix = pd.DataFrame(junsix)
#junsix.to_sql(hubway_db, conn, if_exists='append', index=False)
julsix = pd.read_csv('./data/201607-hubway-tripdata.csv', encoding = 'utf-8')
julsix = pd.DataFrame(julsix)
#julsix.to_sql(hubway_db, conn, if_exists='append', index=False)
octsix = pd.read_csv('./data/201610-hubway-tripdata.csv', encoding = 'utf-8')
octsix = pd.DataFrame(octsix)
#octsix.to_sql(hubway_db, conn, if_exists='append', index=False)
Checking conversion
In [29]:
janfif.head()
Out[29]:
In [30]:
janfif.info()
Concatenation of DataFrames and check
In [31]:
list_df= [janfif, mayfif, junfif, julfif, octfif, jansix, maysix, junsix, julsix, octsix]
hubway_df = pd.concat(list_df)
In [32]:
hubway_df.info()
In [33]:
hubway_df.tail()
Out[33]:
In [34]:
hubway_df.head()
Out[34]:
DataFrame to SQL table conversion
In [ ]:
hubway_df.to_sql(name = 'hubway_db', con = engine, if_exists = 'replace', index = False)
Testing table with a sort by the starting station
In [5]:
for start in c.execute('''SELECT DISTINCT "start station id", "start station name" FROM hubway_db ORDER BY "start station id"'''):
print start
In [ ]:
engine.close()
Relative to a laptop's resources, the database file is ~1.2M records 233 Mb is stored locally, but runs slowly and lags system.
No comments:
Post a Comment