Wednesday, December 12, 2018

Hubway Capstone Project-- Preliminary Exploration of Data

Hubway Capstone-- Preliminary Exploration of Data

Hubway Capstone-- Preliminary Exploration of Data

Hubway is a bike-share program collectively owned by the metro Boston cities; Boston, Cambridge, Somerville, and Brookline. It is operated by Motivate, who manages similar initiatives in NYC, Portland, Chicago, Washington DC, and several other metro areas in Ohio, Tennessee, and New Jersey. They are opening up operations in San Francisco during the month of June, 2017. Hubway currently exists as a system of 188 stations with 1,800 bikes.
  • For this project, I investigated shared data for the months of January, May, June, July, and October during the years of 2015 and 2016.
  • Of concern were the questions of;
    • How do riders use the bike-share service?
    • Are the bikes used as a conveyance or for recreation?
    • What type of customer uses the service?
Below is some preliminary exploration of the data that occured early in the project.

Once the database has been created and verified, first observations are made abut the data's structure and characteristics, such as outliers and null values. Also, at this time there's consideration into what hypotheses can be or can't be proven.

Import libraries
In [2]:
import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
#import sqlite3
from scipy import stats
Load SQL database
In [3]:
import sqlite3
hubway_db = './database/hubway_full.db'
conn = sqlite3.connect(hubway_db) 
c = conn.cursor()
In [6]:
#hubway_db.describe()
pd.options.display.max_rows = 1000
Looking at outliers in tripduration that were detected in various dataes
October 12, 2016
In [13]:
oct_12 = pd.read_sql('SELECT * FROM hubway_db WHERE "stoptime" LIKE "2016-10-12%"', con=conn)
oct_12 = oct_12.set_index('stoptime')
oct_12 = oct_12.drop(['starttime', 'start station name', 'start station id',
                            'start station latitude',
                            'start station longitude', 'end station name', 
                            'end station id',
                            'end station latitude',
                            'end station longitude', 
                            'bikeid', 'usertype',
                            'birth year', 'gender'], axis=1)
oct_12_list = oct_12.tripduration.values.tolist()
oct_12_list
Out[13]:
[7349465L,
 436691L,
 73506L,
 50253L,
 71101L,
 1187L,
 1394L,
 692L,
 856L,
 822L,
 271L,
 187L,
 355L,
 624L,
 346L,
 521L,
 497L,
 578L,
 346L,
 336L,
 ...
 602L,
 1230L,
 838L,
 703L,
 464L,
 319L,
 466L,
 627L,
 552L,
 735L,
 150L,
 499L,
 464L,
 628L,
 655L,
 1823L,
 196L,
 ...]
Boxplot of List
In [14]:
plt.figure()
plt.boxplot(oct_12_list, 0, '', 1, 0.75)
plt.show()
Plot for outliers
In [15]:
plt.figure()
plt.boxplot(oct_12_list, 0, 'rs', 1, 0.75)
plt.show()
January 14, 2016
In [16]:
jan_14 = pd.read_sql('SELECT * FROM hubway_db WHERE "stoptime" LIKE "2016-01-14%"', con=conn)
jan_14 = jan_14.set_index('stoptime')
jan_14 = jan_14.drop(['starttime', 'start station name', 'start station id',
                            'start station latitude',
                            'start station longitude', 'end station name', 
                            'end station id',
                            'end station latitude',
                            'end station longitude', 
                            'bikeid', 'usertype',
                            'birth year', 'gender'], axis=1)
jan_14_list = jan_14.tripduration.values.tolist()
In [18]:
plt.figure()
plt.boxplot(jan_14_list, 0, '', 1, 0.75)
plt.show()
In [27]:
plt.figure()
plt.boxplot(jan_14_list, 0, 'rs', 1, 0.75)
Out[27]:
{'boxes': [<matplotlib.lines.Line2D at 0x7f784ff7cf90>],
 'caps': [<matplotlib.lines.Line2D at 0x7f784ff53e90>,
  <matplotlib.lines.Line2D at 0x7f784ff6c510>],
 'fliers': [<matplotlib.lines.Line2D at 0x7f784ff571d0>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0x7f784ff6cb50>],
 'whiskers': [<matplotlib.lines.Line2D at 0x7f784ea22b50>,
  <matplotlib.lines.Line2D at 0x7f784ff53850>]}
July 23, 2016
In [20]:
jul_23 = pd.read_sql('SELECT * FROM hubway_db WHERE "stoptime" LIKE "2016-07-23%"', con=conn)
jul_23 = jul_23.set_index('stoptime')
jul_23 = jul_23.drop(['starttime', 'start station name', 'start station id',
                            'start station latitude',
                            'start station longitude', 'end station name', 
                            'end station id',
                            'end station latitude',
                            'end station longitude', 
                            'bikeid', 'usertype',
                            'birth year', 'gender'], axis=1)
jul_23_list = jul_23.tripduration.values.tolist()
In [21]:
plt.figure()
plt.boxplot(jul_23_list, 0, '', 1, 0.75)
plt.show()
<matplotlib.figure.Figure at 0xef8b7f0>
In [24]:
plt.figure()
plt.boxplot(jul_23_list, 0, 'rs', 1, 0.75)
plt.show()
Look at descriptive statistics of database
In [25]:
hubway_df = pd.read_sql('SELECT * FROM hubway_db', con=conn)
hubway_df.describe()
Out[25]:
tripduration start station id start station latitude start station longitude end station id end station latitude end station longitude bikeid gender
count 1.214424e+06 1.214424e+06 1.214424e+06 1.214424e+06 1.214424e+06 1.214424e+06 1.214424e+06 1.214424e+06 1.214424e+06
mean 1.230372e+03 7.125669e+01 4.235758e+01 -7.108546e+01 7.115549e+01 4.235760e+01 -7.108535e+01 7.958509e+02 9.813352e-01
std 2.757293e+04 4.452513e+01 4.133521e-02 6.922767e-02 4.443033e+01 4.131822e-02 6.921803e-02 4.586949e+02 6.491990e-01
min 6.100000e+01 1.000000e+00 0.000000e+00 -7.356692e+01 1.000000e+00 0.000000e+00 -7.356692e+01 1.000000e+00 0.000000e+00
25% 4.130000e+02 3.700000e+01 4.234871e+01 -7.110441e+01 3.700000e+01 4.234871e+01 -7.110441e+01 4.080000e+02 1.000000e+00
50% 6.800000e+02 6.700000e+01 4.235695e+01 -7.108592e+01 6.700000e+01 4.235714e+01 -7.108592e+01 7.970000e+02 1.000000e+00
75% 1.102000e+03 9.800000e+01 4.236610e+01 -7.106425e+01 9.800000e+01 4.236594e+01 -7.106335e+01 1.159000e+03 1.000000e+00
max 1.334659e+07 2.180000e+02 4.550509e+01 0.000000e+00 2.180000e+02 4.550509e+01 0.000000e+00 1.930000e+03 2.000000e+00
Look at the bike ids
In [26]:
stats.mode(hubway_df.bikeid)
Out[26]:
ModeResult(mode=array([351], dtype=int64), count=array([1317]))
Bike with id 351 is the most used out of 1317 trips
In [28]:
most_traveled_bike_df = pd.read_sql('''SELECT * FROM hubway_db WHERE "bikeid" LIKE 351''', con = conn)
most_traveled_bike_df.head()       
Out[28]:
tripduration starttime stoptime start station id start station name start station latitude start station longitude end station id end station name end station latitude end station longitude bikeid usertype birth year gender
0 316 2015-05-01 06:31:33 2015-05-01 06:36:49 42 Boylston St. at Arlington St. 42.352033 -71.070627 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 351 Subscriber \N 0
1 223 2015-05-01 07:26:23 2015-05-01 07:30:06 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 151 John F Fitzgerald - Surface Road at India Street 42.358155 -71.052163 351 Subscriber 1970 1
2 417 2015-05-01 09:58:49 2015-05-01 10:05:47 109 TD Garden - Causeway at Portal Park #1 42.365942 -71.060515 98 Charlestown - Warren St at Chelsea St 42.371848 -71.060292 351 Customer 1983 2
3 1150 2015-05-01 11:43:25 2015-05-01 12:02:35 98 Charlestown - Warren St at Chelsea St 42.371848 -71.060292 120 Charles St at Beacon St 42.356052 -71.069849 351 Subscriber \N 0
4 410 2015-05-01 12:34:09 2015-05-01 12:41:00 120 Charles St at Beacon St 42.356052 -71.069849 35 Franklin St. / Arch St. 42.355402 -71.058087 351 Customer 1958 1
Look at Birth Year
In [30]:
hubway_df["birth year"].unique()
Out[30]:
array([u'1984', u'1985', u'1974', u'1987', u'1986', u'1989', u'1964',
       u'1981', u'1993', u'1978', u'1988', u'1982', u'1991', u'1962',
       u'\\N', u'1977', u'1970', u'1960', u'1954', u'1980', u'1957',
       u'1990', u'1992', u'1973', u'1946', u'1967', u'1944', u'1983',
       u'1995', u'1966', u'1965', u'1948', u'1956', u'1972', u'1968',
       u'1975', u'1961', u'1979', u'1933', u'1971', u'1959', u'1963',
       u'1994', u'1947', u'1976', u'1969', u'1996', u'1953', u'1958',
       u'1949', u'1950', u'1942', u'1955', u'1900', u'1951', u'1952',
       u'1945', u'1899', u'1943', u'1997', u'1917', u'1940', u'1897',
       u'1939', u'1941', u'1937', u'1998', u'1999', u'1934', u'1901',
       u'1902', u'1886', u'1938', u'2000', u'1927'], dtype=object)
In [9]:
hubway_df["birth year"].value_counts()
Out[9]:
\N      265698
1988     54209
1989     50023
1986     49919
1990     48076
1987     47821
1984     44543
1985     43499
1991     39959
1983     37616
1992     35685
1982     34965
1993     28182
1981     27382
1980     26386
1994     23159
1978     22848
1979     20215
1995     16682
1977     15592
1974     15288
1970     14658
1975     14391
1976     13713
1972     12696
1971     12422
1996     11721
1973     11484
1969     11179
1961     11051
         ...  
1997      4370
1952      3326
1946      2668
1950      2301
1951      2196
1998      1596
1948      1537
1947      1244
1942       862
1945       822
1949       758
1944       740
1999       728
1900       331
1943       295
1934       199
2000       178
1886       132
1941        90
1937        84
1939        82
1940        54
1897        43
1933        32
1902        31
1899        29
1901        18
1917         9
1938         5
1927         1
Name: birth year, dtype: int64
Obviously there needs to be some cleaning of the data, for null values and customers inputing incorrect information
Looking at the start station id
In [31]:
hubway_df["start station id"].unique()
Out[31]:
array([115,  80,  91, 105,  88,  68, 117,  67,  75,  90, 116,  95, 110,
        73, 104, 140, 107,  74,  89, 143,  76, 141,  72,  96,  70,  87,
        97, 108, 118,  84, 142, 145,  36,   9, 138,  58, 100,  16,  17,
        47,  45,  71,  29, 109,  54,  27, 151,  46,  11,  10, 133,  41,
        42,  57,  25,  60,  12,  38,  98,  53,  21,  69, 114,  82,  40,
        94,  66,  61,  26,  78,  22,  20,  55,   4,   8,  39,  23,  31,
       122, 131,  33,   6, 130,  63,  86,  85,  32, 149,  64,  50,  81,
        43, 113, 137,  48, 120,  13,   3, 150, 134, 139, 119, 103, 112,
        99,  77,  51,   5, 152,  19, 121, 125,  44,  24,  14,  49,  35,
        52,  59, 126, 129, 111, 135, 123, 128, 124,  65,  30,  15, 136,
        93,   7, 102, 106,  37,  56,   1, 132,  79,  92, 161, 160, 162,
       167, 163, 170, 169, 174, 175, 159, 171, 178, 179, 177, 176, 180,
       189, 190, 192, 193, 186, 146, 194, 173, 195, 197, 196, 200, 199,
       153, 158, 201, 216, 209, 211, 183, 214, 215, 210, 217, 205, 208,
       212, 204, 202, 207, 203, 213, 218, 184, 185], dtype=int64)
In [32]:
stats.mode(hubway_df["start station id"])
Out[32]:
ModeResult(mode=array([22], dtype=int64), count=array([32415]))
Station id 22 (South Station) is the most used for begining trips, accounting for 32415 trips
In [34]:
most_visited_start_df = pd.read_sql('''SELECT * FROM hubway_db WHERE "start station id" LIKE 22''', con = conn)
most_visited_start_df.head()
Out[34]:
tripduration starttime stoptime start station id start station name start station latitude start station longitude end station id end station name end station latitude end station longitude bikeid usertype birth year gender
0 458 2015-05-01 06:21:41 2015-05-01 06:29:19 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 42 Boylston St. at Arlington St. 42.352033 -71.070627 1048 Subscriber 1971 1
1 293 2015-05-01 06:23:25 2015-05-01 06:28:19 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 48 Post Office Square 42.356755 -71.055407 800 Customer 1953 1
2 803 2015-05-01 06:39:45 2015-05-01 06:53:08 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 107 Ames St at Main St 42.362500 -71.088220 414 Subscriber 1967 1
3 605 2015-05-01 06:39:59 2015-05-01 06:50:05 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 135 ID Building East 42.344827 -71.028664 1120 Subscriber 1960 1
4 263 2015-05-01 06:40:30 2015-05-01 06:44:53 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 7 Fan Pier 42.353287 -71.044389 1186 Subscriber 1976 1
Looking at the end station id
In [35]:
stats.mode(hubway_df["end station id"])
Out[35]:
ModeResult(mode=array([22], dtype=int64), count=array([32783]))
End station with id 22 is also the most used, ending 32783 trips, just slightly more than the start station count.
In [36]:
most_visited_end_df = pd.read_sql('''SELECT * FROM hubway_db WHERE "end station id" LIKE 22''', con = conn)
most_visited_end_df
Out[36]:
tripduration starttime stoptime start station id start station name start station latitude start station longitude end station id end station name end station latitude end station longitude bikeid usertype birth year gender
0 938 2015-05-01 05:19:00 2015-05-01 05:34:38 90 Lechmere Station at Cambridge St / First St 42.370677 -71.076529 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 1120 Subscriber 1968 1
1 1070 2015-05-01 06:08:11 2015-05-01 06:26:01 11 Longwood Ave / Binney St 42.338629 -71.106500 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 1004 Subscriber \N 0
2 718 2015-05-01 06:20:21 2015-05-01 06:32:19 98 Charlestown - Warren St at Chelsea St 42.371848 -71.060292 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 1172 Customer 1961 1
3 337 2015-05-01 06:26:10 2015-05-01 06:31:48 4 Tremont St. at Berkeley St. 42.345392 -71.069616 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 190 Customer 1977 1
4 393 2015-05-01 06:27:02 2015-05-01 06:33:36 16 Back Bay / South End Station 42.348074 -71.076570 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 588 Customer 1988 1
5 316 2015-05-01 06:31:33 2015-05-01 06:36:49 42 Boylston St. at Arlington St. 42.352033 -71.070627 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 351 Subscriber \N 0
6 327 2015-05-01 06:33:23 2015-05-01 06:38:50 58 The Esplanade - Beacon St. at Arlington St. 42.355596 -71.072780 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 587 Customer 1986 1
7 880 2015-05-01 06:37:43 2015-05-01 06:52:23 94 Charlestown - Main St at Austin St 42.375603 -71.064608 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 980 Customer 1986 1
8 453 2015-05-01 07:03:18 2015-05-01 07:10:52 40 Lewis Wharf - Atlantic Ave. 42.363871 -71.050877 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 43 Subscriber 1957 1
9 879 2015-05-01 07:04:54 2015-05-01 07:19:34 45 Yawkey Way at Boylston St. 42.344706 -71.097515 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 450 Customer 1984 1
10 177 2015-05-01 07:05:04 2015-05-01 07:08:01 81 Boylston St / Washington St 42.352409 -71.062679 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 465 Customer 1972 1
11 832 2015-05-01 07:07:20 2015-05-01 07:21:12 113 Andrew Station - Dorchester Ave at Humboldt Pl 42.330716 -71.057044 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 246 Subscriber 1954 1
12 459 2015-05-01 07:20:48 2015-05-01 07:28:27 26 Washington St. at Waltham St. 42.341522 -71.068922 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 568 Customer 1971 1
13 343 2015-05-01 07:27:16 2015-05-01 07:32:59 20 Aquarium Station - 200 Atlantic Ave. 42.359770 -71.051601 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 27 Customer 1990 1
14 1048 2015-05-01 07:27:48 2015-05-01 07:45:17 53 Beacon St / Mass Ave 42.350851 -71.089886 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 772 Customer 1988 2
15 509 2015-05-01 07:32:03 2015-05-01 07:40:32 109 TD Garden - Causeway at Portal Park #1 42.365942 -71.060515 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 785 Customer 1973 1
16 412 2015-05-01 07:33:30 2015-05-01 07:40:23 20 Aquarium Station - 200 Atlantic Ave. 42.359770 -71.051601 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 1176 Customer 1987 2
17 1053 2015-05-01 07:38:09 2015-05-01 07:55:43 51 Washington St. at Lenox St. 42.334876 -71.079097 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 1327 Subscriber 1989 1
18 780 2015-05-01 07:41:39 2015-05-01 07:54:39 38 TD Garden - Causeway at Portal Park #2 42.366222 -71.059914 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 636 Subscriber 1970 1
19 505 2015-05-01 07:42:25 2015-05-01 07:50:51 61 Boylston at Fairfield 42.348762 -71.082383 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 483 Customer 1971 1
20 606 2015-05-01 07:42:40 2015-05-01 07:52:46 39 Washington St. at Rutland St. 42.338623 -71.074182 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 401 Customer 1988 1
21 372 2015-05-01 07:43:28 2015-05-01 07:49:40 24 Seaport Square - Seaport Blvd. at Boston Wharf 42.351205 -71.043983 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 1122 Customer 1990 1
22 500 2015-05-01 07:43:52 2015-05-01 07:52:13 47 Cross St. at Hanover St. 42.362811 -71.056067 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 324 Customer 1985 1
23 386 2015-05-01 07:44:02 2015-05-01 07:50:29 6 Cambridge St. at Joy St. 42.361174 -71.065142 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 369 Subscriber 1980 2
24 553 2015-05-01 07:44:09 2015-05-01 07:53:23 38 TD Garden - Causeway at Portal Park #2 42.366222 -71.059914 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 1017 Customer 1970 1
25 423 2015-05-01 07:47:31 2015-05-01 07:54:34 50 Boylston St / Berkeley St 42.350989 -71.073644 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 81 Customer 1974 2
26 786 2015-05-01 07:50:27 2015-05-01 08:03:34 39 Washington St. at Rutland St. 42.338623 -71.074182 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 1047 Customer 1963 1
27 484 2015-05-01 07:52:16 2015-05-01 08:00:20 47 Cross St. at Hanover St. 42.362811 -71.056067 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 570 Customer 1984 2
28 448 2015-05-01 08:03:05 2015-05-01 08:10:34 109 TD Garden - Causeway at Portal Park #1 42.365942 -71.060515 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 395 Customer 1984 1
29 619 2015-05-01 08:06:49 2015-05-01 08:17:09 38 TD Garden - Causeway at Portal Park #2 42.366222 -71.059914 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 1179 Customer 1957 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
32753 368 2016-10-31 17:03:25 2016-10-31 17:09:33 43 Rowes Wharf - Atlantic Ave 42.357143 -71.050699 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 1605 Subscriber 1973 1
32754 373 2016-10-31 17:03:54 2016-10-31 17:10:07 23 Mayor Martin J Walsh - 28 State St 42.358920 -71.057629 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 1091 Subscriber 1970 1
32755 1122 2016-10-31 17:04:03 2016-10-31 17:22:45 30 Brigham Cir / Huntington Ave 42.334073 -71.105221 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 1731 Subscriber 1994 1
32756 821 2016-10-31 17:05:24 2016-10-31 17:19:05 32 Landmark Centre 42.343912 -71.102221 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 1399 Subscriber 1986 1
32757 228 2016-10-31 17:06:47 2016-10-31 17:10:36 48 Post Office Square 42.356755 -71.055407 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 918 Subscriber 1961 1
32758 467 2016-10-31 17:07:39 2016-10-31 17:15:27 16 Back Bay / South End Station 42.348074 -71.076570 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 200 Subscriber 1987 1
32759 192 2016-10-31 17:07:56 2016-10-31 17:11:08 81 Boylston St / Washington St 42.352409 -71.062679 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 560 Subscriber 1957 1
32760 290 2016-10-31 17:08:04 2016-10-31 17:12:55 64 Congress / Sleeper 42.351100 -71.049600 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 1681 Subscriber 1953 1
32761 348 2016-10-31 17:08:55 2016-10-31 17:14:44 40 Lewis Wharf - Atlantic Ave. 42.363871 -71.050877 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 256 Subscriber 1992 1
32762 654 2016-10-31 17:12:01 2016-10-31 17:22:55 136 ID Building West 42.344796 -71.031614 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 1246 Subscriber 1957 2
32763 280 2016-10-31 17:12:08 2016-10-31 17:16:48 35 Franklin St. / Arch St. 42.355402 -71.058087 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 74 Customer \N 0
32764 834 2016-10-31 17:12:30 2016-10-31 17:26:25 12 Ruggles Station / Columbus Ave. 42.335911 -71.088496 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 1280 Subscriber 1966 1
32765 174 2016-10-31 17:15:06 2016-10-31 17:18:01 64 Congress / Sleeper 42.351100 -71.049600 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 252 Subscriber 1986 1
32766 772 2016-10-31 17:15:56 2016-10-31 17:28:48 136 ID Building West 42.344796 -71.031614 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 963 Customer \N 0
32767 569 2016-10-31 17:22:46 2016-10-31 17:32:15 36 Boston Public Library - 700 Boylston St. 42.349673 -71.077303 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 920 Subscriber 1986 1
32768 410 2016-10-31 17:26:14 2016-10-31 17:33:04 136 ID Building West 42.344796 -71.031614 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 1129 Subscriber 1961 1
32769 390 2016-10-31 17:31:24 2016-10-31 17:37:54 47 Cross St. at Hanover St. 42.362811 -71.056067 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 1883 Subscriber 1960 1
32770 1028 2016-10-31 17:31:46 2016-10-31 17:48:55 14 HMS / HSPH - Ave. Louis Pasteur at Longwood Ave. 42.337171 -71.102797 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 1474 Subscriber 1992 1
32771 636 2016-10-31 17:41:00 2016-10-31 17:51:36 61 Boylston at Fairfield 42.348762 -71.082383 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 928 Subscriber 1973 1
32772 141 2016-10-31 17:41:35 2016-10-31 17:43:57 192 Purchase St at Pearl St 42.354686 -71.053292 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 1023 Subscriber 1973 1
32773 231 2016-10-31 17:41:48 2016-10-31 17:45:39 23 Mayor Martin J Walsh - 28 State St 42.358920 -71.057629 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 851 Subscriber 1951 1
32774 619 2016-10-31 17:49:01 2016-10-31 17:59:20 21 Prudential Center / Belvidere 42.345959 -71.082578 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 208 Subscriber 1988 1
32775 61475 2016-10-31 18:09:20 2016-11-01 11:13:55 13 Boston Medical Center - East Concord at Harri... 42.336437 -71.073089 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 762 Subscriber 1980 2
32776 493 2016-10-31 18:18:07 2016-10-31 18:26:20 135 ID Building East 42.344827 -71.028664 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 1142 Subscriber 1956 1
32777 216 2016-10-31 19:17:15 2016-10-31 19:20:52 44 Faneuil Hall - Union St. at North St. 42.360613 -71.057452 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 216 Subscriber 1960 1
32778 1027 2016-10-31 19:41:00 2016-10-31 19:58:08 134 New Balance Store - Boylston at Dartmouth 42.350413 -71.076550 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 151 Subscriber 1981 2
32779 512 2016-10-31 20:16:37 2016-10-31 20:25:10 39 Washington St. at Rutland St. 42.338493 -71.074019 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 1023 Subscriber 1984 1
32780 173 2016-10-31 20:43:23 2016-10-31 20:46:17 64 Congress / Sleeper 42.351100 -71.049600 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 378 Subscriber 1963 1
32781 391 2016-10-31 21:06:04 2016-10-31 21:12:35 40 Lewis Wharf - Atlantic Ave. 42.363871 -71.050877 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 1303 Subscriber 1986 1
32782 401 2016-10-31 21:53:45 2016-10-31 22:00:27 151 John F Fitzgerald - Surface Road at India Street 42.358155 -71.052163 22 South Station - 700 Atlantic Ave. 42.352175 -71.055547 849 Subscriber 1972 2
32783 rows × 15 columns
Looking at recreational or tourist riders
In [38]:
day_trip_df = pd.read_sql('''SELECT * FROM hubway_db WHERE "end station id" LIKE "start station id"''', con = conn)
day_trip_df.describe()
Out[38]:
tripduration start station id start station latitude start station longitude end station id end station latitude end station longitude bikeid gender
count 3.910100e+04 39101.000000 39101.000000 39101.000000 39101.000000 39101.000000 39101.000000 39101.000000 39101.000000
mean 2.793716e+03 75.742129 42.357064 -71.085892 75.742129 42.357064 -71.085892 788.698729 0.549142
std 1.090920e+04 46.215144 0.023174 0.030698 46.215144 0.023174 0.030698 458.357826 0.686643
min 6.100000e+01 1.000000 42.303469 -73.566921 1.000000 42.303469 -73.566921 1.000000 0.000000
25% 6.170000e+02 41.000000 42.348203 -71.108279 41.000000 42.348203 -71.108279 401.000000 0.000000
50% 1.419000e+03 67.000000 42.356052 -71.085918 67.000000 42.356052 -71.085918 788.000000 0.000000
75% 2.947000e+03 107.000000 42.366222 -71.063348 107.000000 42.366222 -71.063348 1151.000000 1.000000
max 1.126543e+06 218.000000 45.505086 -71.006212 218.000000 45.505086 -71.006212 1930.000000 2.000000
Looking for recreational rider most used start station and end station
In [39]:
stats.mode(day_trip_df["start station id"])
Out[39]:
ModeResult(mode=array([58], dtype=int64), count=array([1341]))
In [40]:
stats.mode(day_trip_df["end station id"])
Out[40]:
ModeResult(mode=array([58], dtype=int64), count=array([1341]))
The station most used for recreation is 58, The Esplanade, for 1341 trips. This makes sense given the station location
In [41]:
day_trip_station_df = pd.read_sql('''SELECT * FROM hubway_db WHERE "start station id" LIKE 58 AND "end station id" LIKE 58''', con = conn)
day_trip_station_df.describe()
Out[41]:
tripduration start station id start station latitude start station longitude end station id end station latitude end station longitude bikeid gender
count 1341.000000 1341.0 1.341000e+03 1.341000e+03 1341.0 1.341000e+03 1.341000e+03 1341.000000 1341.000000
mean 3191.472782 58.0 4.235560e+01 -7.107278e+01 58.0 4.235560e+01 -7.107278e+01 785.177479 0.234154
std 7800.679654 0.0 8.103209e-13 1.265238e-12 0.0 8.103209e-13 1.265238e-12 439.940296 0.539815
min 61.000000 58.0 4.235560e+01 -7.107278e+01 58.0 4.235560e+01 -7.107278e+01 3.000000 0.000000
25% 1406.000000 58.0 4.235560e+01 -7.107278e+01 58.0 4.235560e+01 -7.107278e+01 425.000000 0.000000
50% 2412.000000 58.0 4.235560e+01 -7.107278e+01 58.0 4.235560e+01 -7.107278e+01 794.000000 0.000000
75% 3836.000000 58.0 4.235560e+01 -7.107278e+01 58.0 4.235560e+01 -7.107278e+01 1143.000000 0.000000
max 267608.000000 58.0 4.235560e+01 -7.107278e+01 58.0 4.235560e+01 -7.107278e+01 1928.000000 2.000000
In [42]:
day_trip_station_df.head(21)
Out[42]:
tripduration starttime stoptime start station id start station name start station latitude start station longitude end station id end station name end station latitude end station longitude bikeid usertype birth year gender
0 6780 2015-05-01 16:29:17 2015-05-01 18:22:18 58 The Esplanade - Beacon St. at Arlington St. 42.355596 -71.07278 58 The Esplanade - Beacon St. at Arlington St. 42.355596 -71.07278 133 Subscriber \N 0
1 1098 2015-05-02 08:02:20 2015-05-02 08:20:38 58 The Esplanade - Beacon St. at Arlington St. 42.355596 -71.07278 58 The Esplanade - Beacon St. at Arlington St. 42.355596 -71.07278 708 Customer 1983 1
2 505 2015-05-02 10:24:29 2015-05-02 10:32:54 58 The Esplanade - Beacon St. at Arlington St. 42.355596 -71.07278 58 The Esplanade - Beacon St. at Arlington St. 42.355596 -71.07278 1289 Customer 1972 1
3 1781 2015-05-02 10:48:07 2015-05-02 11:17:48 58 The Esplanade - Beacon St. at Arlington St. 42.355596 -71.07278 58 The Esplanade - Beacon St. at Arlington St. 42.355596 -71.07278 621 Subscriber \N 0
4 2967 2015-05-02 12:16:08 2015-05-02 13:05:36 58 The Esplanade - Beacon St. at Arlington St. 42.355596 -71.07278 58 The Esplanade - Beacon St. at Arlington St. 42.355596 -71.07278 1319 Subscriber \N 0
5 434 2015-05-02 12:17:57 2015-05-02 12:25:12 58 The Esplanade - Beacon St. at Arlington St. 42.355596 -71.07278 58 The Esplanade - Beacon St. at Arlington St. 42.355596 -71.07278 1221 Customer 1972 1
6 3562 2015-05-02 12:45:02 2015-05-02 13:44:25 58 The Esplanade - Beacon St. at Arlington St. 42.355596 -71.07278 58 The Esplanade - Beacon St. at Arlington St. 42.355596 -71.07278 600 Subscriber \N 0
7 3368 2015-05-02 12:48:16 2015-05-02 13:44:25 58 The Esplanade - Beacon St. at Arlington St. 42.355596 -71.07278 58 The Esplanade - Beacon St. at Arlington St. 42.355596 -71.07278 639 Subscriber \N 0
8 2759 2015-05-02 14:53:51 2015-05-02 15:39:50 58 The Esplanade - Beacon St. at Arlington St. 42.355596 -71.07278 58 The Esplanade - Beacon St. at Arlington St. 42.355596 -71.07278 639 Subscriber \N 0
9 1475 2015-05-02 15:15:18 2015-05-02 15:39:53 58 The Esplanade - Beacon St. at Arlington St. 42.355596 -71.07278 58 The Esplanade - Beacon St. at Arlington St. 42.355596 -71.07278 600 Subscriber \N 0
10 3542 2015-05-02 15:52:00 2015-05-02 16:51:03 58 The Esplanade - Beacon St. at Arlington St. 42.355596 -71.07278 58 The Esplanade - Beacon St. at Arlington St. 42.355596 -71.07278 989 Subscriber \N 0
11 4213 2015-05-02 15:59:23 2015-05-02 17:09:37 58 The Esplanade - Beacon St. at Arlington St. 42.355596 -71.07278 58 The Esplanade - Beacon St. at Arlington St. 42.355596 -71.07278 600 Subscriber \N 0
12 4186 2015-05-02 16:00:10 2015-05-02 17:09:57 58 The Esplanade - Beacon St. at Arlington St. 42.355596 -71.07278 58 The Esplanade - Beacon St. at Arlington St. 42.355596 -71.07278 614 Subscriber \N 0
13 3903 2015-05-02 16:04:43 2015-05-02 17:09:47 58 The Esplanade - Beacon St. at Arlington St. 42.355596 -71.07278 58 The Esplanade - Beacon St. at Arlington St. 42.355596 -71.07278 327 Subscriber \N 0
14 3876 2015-05-02 16:05:10 2015-05-02 17:09:47 58 The Esplanade - Beacon St. at Arlington St. 42.355596 -71.07278 58 The Esplanade - Beacon St. at Arlington St. 42.355596 -71.07278 32 Subscriber \N 0
15 3080 2015-05-03 07:59:38 2015-05-03 08:50:58 58 The Esplanade - Beacon St. at Arlington St. 42.355596 -71.07278 58 The Esplanade - Beacon St. at Arlington St. 42.355596 -71.07278 311 Customer 1952 2
16 847 2015-05-03 08:32:18 2015-05-03 08:46:26 58 The Esplanade - Beacon St. at Arlington St. 42.355596 -71.07278 58 The Esplanade - Beacon St. at Arlington St. 42.355596 -71.07278 1125 Customer 1983 1
17 7012 2015-05-03 11:20:55 2015-05-03 13:17:48 58 The Esplanade - Beacon St. at Arlington St. 42.355596 -71.07278 58 The Esplanade - Beacon St. at Arlington St. 42.355596 -71.07278 960 Subscriber \N 0
18 7029 2015-05-03 11:21:05 2015-05-03 13:18:15 58 The Esplanade - Beacon St. at Arlington St. 42.355596 -71.07278 58 The Esplanade - Beacon St. at Arlington St. 42.355596 -71.07278 252 Subscriber \N 0
19 3270 2015-05-03 12:17:23 2015-05-03 13:11:53 58 The Esplanade - Beacon St. at Arlington St. 42.355596 -71.07278 58 The Esplanade - Beacon St. at Arlington St. 42.355596 -71.07278 425 Subscriber \N 0
20 3257 2015-05-03 12:17:37 2015-05-03 13:11:55 58 The Esplanade - Beacon St. at Arlington St. 42.355596 -71.07278 58 The Esplanade - Beacon St. at Arlington St. 42.355596 -71.07278 760 Subscriber \N 0
In [47]:
#CLeaning of the birth year feature actually occured during the logistic regression
#when the years were converted to an age decile 


#hubway_df["birth year"] = hubway_df["birth year"].str.replace('''\\\\N''', '1970')
#hubway_df["birth year"] = hubway_df["birth year"].str.replace('\0','0')
#hubway_df["birth year"].unique()
In [48]:
#stats.mode(hubway_df['birth year'])
In [49]:
#np.mean(hubway_df['birth year'])
In [50]:
#if hubway_df['birth year'] <= '1974':
    #hubway_df['birth year'].replace(1974)
#hubway_df["birth year"] = pd.to_numeric(hubway_df["birth year"])

Cambridge has the highest volume of traffic

In [51]:
camb_trip_station_df = pd.read_sql('''SELECT * FROM hubway_db WHERE "start station name" LIKE "%Cambridge%" OR "end station name" LIKE "%Cambridge%"''', con = conn)
camb_trip_station_df.tail()
Out[51]:
tripduration starttime stoptime start station id start station name start station latitude start station longitude end station id end station name end station latitude end station longitude bikeid usertype birth year gender
190410 750 2016-10-31 22:48:16 2016-10-31 23:00:46 115 Porter Square Station 42.387995 -71.119084 96 Cambridge Main Library at Broadway / Trowbridg... 42.373379 -71.111075 1796 Subscriber 1990 2
190411 328 2016-10-31 23:19:55 2016-10-31 23:25:23 43 Rowes Wharf - Atlantic Ave 42.357143 -71.050699 6 Cambridge St. at Joy St. 42.361174 -71.065142 1139 Subscriber 1984 1
190412 396 2016-10-31 23:34:19 2016-10-31 23:40:56 90 Lechmere Station at Cambridge St / First St 42.370677 -71.076529 141 Kendall Street 42.363560 -71.082168 1738 Subscriber 1965 1
190413 625 2016-10-31 23:57:52 2016-11-01 00:08:18 10 B.U. Central - 725 Comm. Ave. 42.350406 -71.108279 8 Union Square - Brighton Ave. at Cambridge St. 42.353334 -71.137313 1158 Subscriber 1995 1
190414 184 2016-10-31 23:58:51 2016-11-01 00:01:56 139 Dana Park 42.361589 -71.107437 76 Central Sq Post Office / Cambridge City Hall a... 42.366426 -71.105495 603 Subscriber 1990 1
These results aren't accurate, SQL string needs to be refined-- for exapmle Cambridge Street in Boston is in the results

Qustions to investigate:

Peak usage can be tied to rush hour commuter usage

Involves breaking date, which is a string down to a time feature.

Other peaks in usage involve traveling to 'squares' that are both commercial and residential

Involves setting up a classification feature for locations of stations

A rider/customer profile can be predicted

Involves:
  • Logistic regression
  • Random forest
  • Adaboost
In [53]:
#conn.close()

Creation of categories

Categories for station will be classified as follows:
  • 1: Residential side streets
  • 2: Squares with businesses and residential elements
  • 3: Recreational and tourist areas
  • 4: Businesses (large enterprise) and institutions (hospitals, colleges, transit)
  • 5: Major shopping areas and plazas
Isolate and export list of station names
In [61]:
cat_df = hubway_df['start station name'].unique()
In [63]:
#cat_csv = cat_df.to_csv(path_or_buf='./database/cat_csv.csv')
In [64]:
cat_db = pd.read_csv('./database/cat_csv.csv')
#cat_csv = pd.DataFrame(cat_csv)
type(cat_db)
Out[64]:
pandas.core.frame.DataFrame
Join the category list to main dataframe and create new database
In [66]:
cat_db.head()

#hubway_cats_df = hubway_df.join(cat_csv, on='start station name', how='right')
#hubway_cats_df.tail(10)
#from sqlalchemy import create_engine

#engine = create_engine('sqlite:///C:\Users\Owner\Documents\Capstone\database\cats.db', echo=True)

#cats_db = cat_csv.to_sql(name = 'cats_db', con = engine, if_exists = 'replace', index = False)
Out[66]:
Unnamed: 0 end station name end station category
0 0 Porter Square Station 5
1 1 MIT Stata Center at Vassar St / Main St 4
2 2 One Kendall Square at Hampshire St / Portland St 4
3 3 Lower Cambridgeport at Magazine St/Riverside Rd 5
4 4 Inman Square at Vellucci Plaza / Hampshire St 2
In [35]:
#hubway_start_cats_db = './database/cats.db'
#connct = sqlite3.connect('cats_db') 
Create new table and merge categories to appropriate end station names
In [36]:
#hubway_cats_db = pd.read_sql('CREATE TABLE hubway_full.cats_db 
    #("start station name" VARCHAR "start station catagory INTEGER")', con=conn)
#hubway_cats_db
In [37]:
hubway_cats_df = pd.merge(hubway_df, cat_db, on='end station name')
In [38]:
hubway_cats_df.head()
Out[38]:
tripduration starttime stoptime start station id start station name start station latitude start station longitude end station id end station name end station latitude end station longitude bikeid usertype birth year gender Unnamed: 0 end station category
0 542 2015-01-01 00:21:44 2015-01-01 00:30:47 115 Porter Square Station 42.387995 -71.119084 96 Cambridge Main Library at Broadway / Trowbridg... 42.373379 -71.111075 277 Subscriber 1984 1 23 4
1 432 2015-01-01 00:53:46 2015-01-01 01:00:58 115 Porter Square Station 42.387995 -71.119084 96 Cambridge Main Library at Broadway / Trowbridg... 42.373379 -71.111075 1307 Subscriber 1987 1 23 4
2 589 2015-01-01 13:57:15 2015-01-01 14:07:05 72 One Broadway / Kendall Sq at Main St / 3rd St 42.362613 -71.084105 96 Cambridge Main Library at Broadway / Trowbridg... 42.373379 -71.111075 488 Subscriber \N 0 23 4
3 455 2015-01-01 17:42:12 2015-01-01 17:49:47 95 Cambridge St - at Columbia St / Webster Ave 42.372969 -71.094445 96 Cambridge Main Library at Broadway / Trowbridg... 42.373379 -71.111075 992 Customer 1992 2 23 4
4 672 2015-01-02 11:50:44 2015-01-02 12:01:56 74 Harvard Square at Mass Ave/ Dunster 42.373268 -71.118579 96 Cambridge Main Library at Broadway / Trowbridg... 42.373379 -71.111075 883 Subscriber \N 0 23 4
Perform same operation with start station names
In [39]:
cat2_db = pd.read_csv('./database/cat2_csv.csv')
#cat_csv = pd.DataFrame(cat_csv)
type(cat2_db)
Out[39]:
pandas.core.frame.DataFrame
In [40]:
cat2_db.head()
Out[40]:
Unnamed: 0 start station name start station category
0 0 Porter Square Station 5
1 1 MIT Stata Center at Vassar St / Main St 4
2 2 One Kendall Square at Hampshire St / Portland St 4
3 3 Lower Cambridgeport at Magazine St/Riverside Rd 5
4 4 Inman Square at Vellucci Plaza / Hampshire St 2
In [41]:
hubway_cats2_df = pd.merge(hubway_cats_df, cat2_db, on='start station name')
In [42]:
#hubway_cats2_df.to_csv(path_or_buf='./database/hubway_cats.csv')
In [43]:
hubway_cats2_df.head()
Out[43]:
tripduration starttime stoptime start station id start station name start station latitude start station longitude end station id end station name end station latitude end station longitude bikeid usertype birth year gender Unnamed: 0_x end station category Unnamed: 0_y start station category
0 542 2015-01-01 00:21:44 2015-01-01 00:30:47 115 Porter Square Station 42.387995 -71.119084 96 Cambridge Main Library at Broadway / Trowbridg... 42.373379 -71.111075 277 Subscriber 1984 1 23 4 0 5
1 432 2015-01-01 00:53:46 2015-01-01 01:00:58 115 Porter Square Station 42.387995 -71.119084 96 Cambridge Main Library at Broadway / Trowbridg... 42.373379 -71.111075 1307 Subscriber 1987 1 23 4 0 5
2 580 2015-01-04 14:29:05 2015-01-04 14:38:45 115 Porter Square Station 42.387995 -71.119084 96 Cambridge Main Library at Broadway / Trowbridg... 42.373379 -71.111075 673 Subscriber 1991 1 23 4 0 5
3 754 2015-01-08 16:17:04 2015-01-08 16:29:39 115 Porter Square Station 42.387995 -71.119084 96 Cambridge Main Library at Broadway / Trowbridg... 42.373379 -71.111075 1092 Customer 1961 2 23 4 0 5
4 667 2015-01-10 11:40:49 2015-01-10 11:51:57 115 Porter Square Station 42.387995 -71.119084 96 Cambridge Main Library at Broadway / Trowbridg... 42.373379 -71.111075 772 Subscriber 1980 1 23 4 0 5
Drop unneeded columns and check dataframe's structure
In [44]:
#hubway_cats2_df.drop('Unnamed: 0_x', 1)
#hubway_cats2_df.drop('Unnamed: 0_y', 1)
hubway_cats2_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1214424 entries, 0 to 1214423
Data columns (total 19 columns):
tripduration               1214424 non-null int64
starttime                  1214424 non-null object
stoptime                   1214424 non-null object
start station id           1214424 non-null int64
start station name         1214424 non-null object
start station latitude     1214424 non-null float64
start station longitude    1214424 non-null float64
end station id             1214424 non-null int64
end station name           1214424 non-null object
end station latitude       1214424 non-null float64
end station longitude      1214424 non-null float64
bikeid                     1214424 non-null int64
usertype                   1214424 non-null object
birth year                 1214424 non-null object
gender                     1214424 non-null int64
Unnamed: 0_x               1214424 non-null int64
end station category       1214424 non-null int64
Unnamed: 0_y               1214424 non-null int64
start station category     1214424 non-null int64
dtypes: float64(4), int64(9), object(6)
memory usage: 185.3+ MB
Final save and close SQL connection
In [47]:
hubway_cats2_df.to_csv(path_or_buf='./database/hubway_cats.csv')
In [46]:
conn.close()
In [ ]:
 
Back to Executive Summary

No comments:

Post a Comment