Hubway Capstone Project— Executive Summary by Erik C Ellis¶
This Project was completed for the Data Scientist Immersive course at General Assembly in Boston in the Spring of 2017.¶
Background¶
Launched in the City of Boston in 2011, Hubway is a bike-share program collectively owned by four metro Boston cities; Boston, Cambridge, Somerville, and Brookline. It is operated by Motivate, who manages similar initiatives in NYC, Portland, Chicago, Washington DC, and several other metro areas in Ohio, Tennessee, and New Jersey. They are opening up operations in San Francisco during the month of June, 2017. Hubway currently exists as a system of 188 stations with 1,800 bikes.Addendum: Hubway in Boston has been re-launched as Blue Bike in a partnership with Blue Cross Blue Shield of Massachusetts in the spring of 2018. This merger promises an expansion from 1,800 bikes to 3,000, and 100 new docking stations in the four municipalities. The same customer experience is preserved, and the service is still operated by Motivate. The leadership in Boston, Cambridge, Brookline and Somerville all consider the bike-share service a success, with Boston's mayor Martin J. Walsh saying it "quickly became integral to our transportation system," and he thanked BCBS for expanding the service further into Boston's many neighborhoods. BCBS has also partnered with other bike-share companies outside of Boston such as Zagster in Salem.
- For this project, I investigated shared data for the months of January, May, June, July, and October during the years of 2015 and 2016.
- Of concern were the questions of:
- How do riders use the bike-share service?
- Are the bikes used for commuting, as a conveyance for shopping, or for recreation?
- What type of customer uses the service?
- How do riders use the bike-share service?
Project parts¶
Essentially the project is presented in three parts:- Acquiring the data, building a database, and doing some preliminary exploration:
- Performing Empirical Data Analysis:
- Machine Learning Models:
Approach¶
The approach taken was to develop a feature that was a category system to apply to the stations based on their locations.- 1: Stations located on residential side streets
- 2: Stations located in the area's many squares that have both a commercial presence and a residential population
- 3: Stations located in recreational and tourist areas
- 4: Stations located near large enterprise businesses or institutions, such as academic, government, hospitals, or transit stations
- 5: Stations located near major shopping areas or plazas
Results¶
Empirical Data Analysis revealed a very interesting look into user behavior; particularly that the end station category '4' (business and institutions) was very strong, with a baseline model predictor of 40.7%, twice that of a one out of five (20%) random guess. Also men were the most users in this category, while women edged out others in the mixed squares (2) category and major shopping (5) category. Casual using customers, who pay at the docking station and don't report their gender, lead in category '3' (recreation and tourist areas). Additionally, most users were between the ages of twenty- and thirty five-years of age. Rider demand over the course of all of the days in the dataset showed extremely high demand during rush hours, with the evening peak usage extending into the late evening. Strangely, Thursdays easily had the most usage; I have no explanation for this behavior.In terms of the Machine Learning portion of the project, the Multi-class Logistic Regression and the Random Forest unfortunately didn't perform very well, due to my own fault I believe. Initially when the predictors were only gender, user type, age, and end station, I got very little lift in the prediction value relative to the baseline value. When I expanded the predictor set to include the start station, the start station category, and the day of the week, I got scores in the range of ~97% to ~99%. These values clearly were due to overfitting; I suspect I made the model too complex, because when I populated the predictors with dummy variables, the first model was based on eleven (11) columns. When I included the other features as dummies I expanded the columns to 403, so the shape of the predictor set was 403 x 946473. Cross validation scores, even after tweaking the parameters, and the classification report had the same issues, showing the same values signaling overfitting. I plan on revisiting both portions of the project to include the smaller predictor set for comparison in the near future.
I went so far as to post a question on Quora regarding the high value scores: How do I interpret the scores of a logistic regression and cross validation when they are both very high, ~99% or ~98% after tweaking the parameters?.
However, the AdaBoost Classifier score, 57.75%, was probably more realistic. Even when using the same large dummy set and tuning the parameters-- random state seed, number of trees, and KFolds-- the result held steady without improvement. The result represents a percent change from the baseline model of 41.88%, or percent change from random guessing of 185%.
I feel relatively confident in predicting that any given user will be a male subscriber near thirty years of age, using the service during rush hour as a commuter to areas of large businesses and/or institutions, most likely late in the work week and in the evening.
Takeaways¶
I enjoyed this project because I have lived in the Boston area most of my life and am an urban cyclist, in fact I was a messenger for many years, and the fact that the bike-share program has become so visible in the landscape of daily life in the streets. I have not used the service, but may in the future because it so convenient and the cost relative to other bike-share services like Lime Bikes and Ant Bikes seem competitive. Currently, there is high competition for space in this niche Boston marketplace: A bike-share border war has started in BostonAs far as the value of the data project, I still have many questions regarding the interpretation of the results and analysis. I feel prepared to use these tools in the workplace as a Data Analyst working with data in a SaaS/PaaS environment, particularly with my technical writing skills, subject matter expertise, and business acumen. Networking, seeking new projects, working with a team, and having good mentorship are definitely my future goals.
Comments and questions can be directed to me by email at: erik@erikcellis.com
No comments:
Post a Comment