The Data Science Process
The data science process can be understood by asking a few of these questions
Frame the problem: Who is your client? What exactly is the client asking you to solve? How can you translate their ambiguous request into a concrete, well-defined problem?
Collect the raw data needed to solve the problem: Is this data already available? If so, what parts of the data are useful? If not, what more data do you need? What kind of resources (time, money, infrastructure) would it take to collect this data in a usable form?
Process the data (data wrangling): Real, raw data is rarely usable out of the box. There are errors in data collection, corrupt records, missing values, and many other challenges you will have to manage. You will first need to clean the data to convert it to a form that you can further analyze.
Explore the data: Once you have cleaned the data, you have to understand the information contained within at a high level. What kinds of obvious trends or correlations do you see in the data? What are the high-level characteristics and are any of them more significant than others?
Perform in-depth analysis (machine learning, statistical models, algorithms): This step is usually the meat of your project, where you apply all the cutting-edge machinery of data analysis to unearth high-value insights and predictions.
Communicate results of the analysis: All the analysis and technical results that you come up with are of little value unless you can explain to your stakeholders what they mean, in a way that’s comprehensible and compelling. Data storytelling is a critical and underrated skill that you will build and use here.
1. Define the Business Objective
Step one of the data analysis process should be to state and understand the business objective.
This can be started as simply as “we need to increase sales or increase revenues.”
Then, through discussions with business stakeholders such as executives, product management, sales, and marketing, the objective should become more specific and actionable. From “increase sales” it may become: “find the best product to offer customers based on their buying history.” The second statement is more specific and actionable and aligns with “increasing sales.”
This objective may be even further refined into very specific statements that lend themselves to analytical solutions.
2. Source and Collect Data
The second step is data sourcing and collection. The goal is to find relevant solutions to the problem or support an analytical solution of the stated objective. This step involves reviewing existing data sources and finding out if it is necessary to collect new data. It may involve any number of tasks to get the data in hand, such as querying databases, scraping data from data streams, submitting requests to other departments, or searching for third-party data sources.
3. Process and Clean Data
In step three of the data analysis process, the data collected is processed and verified. Raw data must be converted into a usable format and this often requires parsing, transforming, and encoding. This is a good time to look for data errors, missing data, or extreme outliers. Basic statistical summary reports and charts can help reveal any serious issues or gaps in the data. How to fix the issues will depend on the type of problem and will likely need to be considered case-by-case, at least at first. Over time, company protocols may be developed for specific data issues. Especially in a new data science solution, the data almost always needs a little repair work.
4. Perform Exploratory Data Analysis (EDA)
In the exploratory data analysis step, the data is examined carefully for possible logical groupings and hidden relationships. Basic statistical methods and graphs can be used, as well as more advanced methods like clustering, principal component analysis, or other dimension reduction methods.
5. Select, Build, and Test Models
The next step after exploratory data analysis is model selection, building, and testing. In this step, the analytical approach is put together and tested.
A few considerations will help select one or more appropriate statistical or machine learning models:
- What are the data types? Categorical, ordered, continuous, or mixed.
- Is there a time index to consider?
- Is the response multivariate?
- Are there rules and constraints that need to be incorporated into the model?
- What models have others used for similar problems?
With a few candidate models selected, the next step is model building, testing, and tuning. In this step the models are configured, validated, and fine-tuned to get better accuracy.
For model validation, a very popular approach is to train the model on one set of data and then, using the trained or fitted model, evaluate its predictive ability on a separate set of data. Through the train-validate-test approach, the best-performing models and configurations can be selected.
6. Deploy Models
After selecting, building, and tuning models, the next step is model deployment. The goal of model deployment is to produce outputs that lead to a decision or action.
In a common scenario, model predictions and other variables are inputs to an optimization problem. The solution to that problem produces raw outputs that must be translated and communicated to business experts and decision-makers. If the recommendations make sense from their perspective, they can decide to put them into play.
Here are some examples of what those decisions might look like after evaluating and translating model outputs:
- Raise price
- Launch the promotion
- Change the policy
- Change the mixture
In a data science application, model deployment is often automated while still allowing analyst users to override and influence the model’s recommendations.
7. Monitor and Validate
The final step in a data analysis process is monitoring and validation. After decisions have been put into play and allowed a short time to work, it’s important to go back and check to see if outcomes are as expected.
Monitoring and validating results can take many forms For example, summary reports and simple charts of actual versus targets or average revenue or sales over time.
The goal is to make sure the results are as expected. Otherwise, review any assumptions, check for errors in the data feeds or any unexpected changes to data attributes. Look to see if something unexpectedly changed in the market.
By continually monitoring and going through the above data analysis process steps, problems can be detected early on and corrected before decision-makers find themselves trying to understand nonsensical outputs, or worse, the entire project is branded a disappointing failure. With a good process in place finding and fixing issues will be routine—and with a good complement of software tools, quality and assurance can be built into the system.