Three Pain Points of Data Science by Erik Ellis
Data science is a tremendously effective discipline for corporations and institutions, and is becoming increasingly necessary for those who wish to gain any insight into any studied population or marketplace. But, being a practitioner of data science is not without its uncertainties— even with the prerequisite skills in statistics, technology, and domain knowledge— there are several “pain points” or areas of concern.
Three of the many concerns within the process are:
- The data itself; its source, completeness, and structure, and how much it needs to be cleaned.
- Correct analysis of the data against the hypothesis, and the models chosen.
- Resources, including collaboration and communicating with others within the workflow and technology.
The Data
After developing a hypothesis, a data scientist needs to ask if the chosen dataset can prove or disprove the hypothesis. Initial questions are, “Is the set relevant?”, “How was it gathered?”, “How is the sample represented in the total population?” A dataset’s features are an important part of testing, and considerations must be made to determine if it has the right combination of completeness and diversity.
As datasets become more democratized through data collection APIs, and with the general population’s increased use of devices to drive the data, there’s never been more datasets available. And the drive to collect more data is trending upwards. With all of this data, it can become less structured— for ex; poorly timestamped, not documented well or not including a data dictionary, etc.— but there are more opportunities for the data scientist to explore the data as it is cleaned.
Cleaning and munging data represent the most time-consuming parts of a data scientist’s workflow due to the enormous variety of methods for data collection, and because they generally have no connection to how the data is actually captured. As the data is cleaned, rudimentary exploratory data analysis (EDA) can begin to determine the various nuances and peculiarities of the data. There are many data science programming and scripting tools available, most open source, and whether it’s R or Python, most contain packages and libraries to munge and model data.
The Analysis
Initially, a thorough EDA is an important step, as it aids in establishing a baseline from which the models are then tuned for better performance on the target variable. Descriptive statistics and simple visualizations can show the interdependence of features. Additionally, other features can be engineered at this point. Such graphs can usually easily be plotted to show what dependence or category is relevant, as well as identify any bias or variance in the system.
When the time comes to begin modeling the data, it’s important to acknowledge the nature of the target variable. If it’s linear or binary in type, a regression is a starting point. Relationships that are scaler, for ex; the price of homes over time, would be a linear regression. While logistic regressions are concerned with estimating the category of the dependent variable (such as video like and dislike rankings).
Overfitting— training a model on a dataset so that it seems to perform with higher metrics than it should— is a concern as well. It’s key to use cross validation, or training a portion of the dataset to a particular model, and then testing the remaining portion against the training set’s results.
The choice of model should reflect the relationship between variables. The EDA should have revealed whether the relationship between the target variable and its predictors was linear, categorical, classifiable, or described by clustering. As stated above, examples of the first two are housing prices over time, and the like and dislike rank of a video, respectively.
With classification and clustering, there’s supervised learning and unsupervised learning, respectively. Data classification is predictive, whereas data clustering is descriptive. Some classification applications like medical diagnostics can be solved with a decision tree or random forest model. Clustering algorithms such as k-means have applications in astronomy, agriculture and market segmentation.
Analysis of how the question is answered and which variables will answer the hypothesis is as important as the models that are chosen. Along with these concerns is reproducibility; coded models need to be as generalized as possible to be applied to other versions of the same type of problem. Here’s where technologic resources and human resources can impact the practice of data science.
The Resources
Many models require a lot of time and are memory intensive, such as a gridsearch. Running a database with million records on the average laptop simply isn’t feasible and even large enterprises can struggle with similar technology issues when scaled up. This is where web services come in.
Web services provide more valuable memory to the data scientist. Since they are always running, they make such things as the natural language processing (NLP) of streamed daily media easily possible. These types of services also allow models to be run in parallel, and are able to handle large datasets. Thus, the data scientist can train models quicker, develop better models and use highly specialized feature engineering to utilize the entire dataset.
Such services allow for computing, storage, and machine learning, and are typically marketed as a software, platform or infrastructure as a service (typically called SaaS, PaaS or IaaS, respectively).
At other times, collaborations can breakdown as colleagues are unaware of what a data scientist actually does in the workplace; clear roles can struggle to be defined. The background and perspective of a data scientist, is generally that they are a crossover of mathematics, computer science and industry or sector knowledge, usually of a particular domain. But, despite these strengths, data scientists do not often make great software developers or marketers.
In the technology sector, data scientists may share workflow closely with software developers, but the current state of data science, with its cleaning of the data, EDA, modeling and reporting, is hardly automated, hence the demand for reproducibility. It may be difficult to explain the numeric values of a correlation matrix to someone in marketing, but create colorful heatmap visualizations, and there may be some converts. It’s become increasingly important to include data in operations, and to give it its resources.
“The sad status quo is most analytics work is done only on a local machine, with only the outputs (such as graphs, intermediate data, trained models) shared,” says Ben Hamner, Cofounder and CTO of Kaggle, in a Quora answer. “It doesn’t need to be this way: small steps such as setting up your environment with an end-to-end workflow and reproducibility in mind, committing to version control, and linking back to the source data/code when you share results make collaboration far easier.”
Communication and collaborations in data science are vital, otherwise the big questions go unanswered, and opportunities are missed. Today’s marketplace is driven by data and the intelligence derived from it. Data science, though a highly specialized discipline, isn’t a blackbox discipline, but a discipline which is increasingly important and comes with specific qualities among its practitioners.
Thanks for your feedback-- the article was an assignment for the first job interview I had after just completing a professional development bootcamp in data science. I got the interview, but not the position. I'm still very much a student of data science, and have an advanced understanding of process, testing, and outcomes.
ReplyDeleteThanks for the feedback-- see my comment below, please.
ReplyDeleteThanks for the feedback-- see my comment below, please.
ReplyDeleteThanks for the feedback-- see my comment below, please.
ReplyDelete