When I set out to write this post, I thought I would be talking about the scientific method and how it applies to data science workflow. I had it in mind to align the steps and methods of each for discussion.
Then I changed my mind.
I think the more interesting discussion is about a realization I had — one of the steps of the scientific method isn’t necessarily explicit in data science. It’s still there, but that wasn’t apparent to me at first glance. This was likely due to my previous experience being more technical than scientific.
Maybe there are more people like me out there and this will resonate with them too.
Here is a brief refresher on the scientific method, for those who may not have thought about it since they learned it in school. Merriam-Webster’s definition is “principles and procedures for the systematic pursuit of knowledge involving the recognition and formulation of a problem, the collection of data through observation and experiment, and the formulation and testing of hypotheses.”
These are the general steps involved in the scientific method:
- Define a question
- Gather information & resources
- Form an explanatory hypothesis
- Test the hypothesis by performing an experiment and collecting data in a reproducible manner
- Analyze and interpret the data. Draw conclusions that may serve as the starting point for a new hypothesis
- Communicate results
The first couple of steps seem self-explanatory, and are the same between the two processes. I think it’s useful to mention that I would place exploratory data analysis here in step two. It may be obvious to some, but I would also say that asking clarifying questions about the goal of your project belongs there was well.
Step three is the point that wasn’t immediately obvious to me. I didn't think I was actually forming a hypothesis in the course of what I was doing. In fact, it was a more experienced data scientist who pointed out to me that I was. I just wasn't formally or explicitly stating it. There are a couple of things about this idea which I want to talk about.
The first thing is simply a definition I heard for what a hypothesis is that made sense to me. It was that a hypothesis can be defined as an educated guess about the relationship between two or more variables. It seems to me that this educated guessing, followed by testing to validate (or invalidate) it, almost defines feature selection, feature engineering and model selection.
The other thing is that we make a number of assumptions when we start trying to solve a problem. For instance, we make the initial assumption that the data we have gathered is all the data we need. We assume that the features we select have a relationship with the target we’re trying to predict. We assume that the underlying assumptions of the model we select have been met. When it comes down to it, we start with the very basic assumption that we can solve the problem we’re studying. This is not an exhaustive list. In a way, all of these assumptions are hypotheses, or at the least pieces of one.
It also seems to me that the line between steps four and five tends to be kind of blurry. I hadn’t really articulated this for myself before sitting down to write this, but each iteration of a model is an experiment. Each experiment (should) shed light on one or more of the assumptions (hypotheses) made earlier. Assessment of the results of the model (experiment) often leads to immediate adjustments to the model or features, which is then a new experiment. We may cycle back and forth between these two steps fairly rapidly, especially in the early stages of a project.
The last paragraph points to another important realization: the scientific method and a general data science workflow are both iterative processes. It seems pretty unlikely for a person to just follow the steps one time through and arrive at an accurate conclusion. Most likely, it will take several experiments, additional data collection, and reforming hypotheses to get there.
At last, we talk about the final step, communication of your results or findings. There are things about this step that may not be immediately obvious. For example, it’s important to consider the audience you’re communicating your results to in order to present them effectively. A presentation for a room full of board members should look very different than a paper being presented for peer review.
Another consideration is your call to action. If you are making one, make certain it’s clear and compelling. If you’re not recommending a specific action to be taken, you should try to present your findings in a way that helps to spark ideas about the next steps to be taken. To do otherwise runs the risk of devaluing your work and your conclusions.
Articulating these thoughts in this post has helped me to realize that I am actually engaging in the scientific process. For me, this lends a bit more gravity to the things I’m learning and practicing. It also spurs me to put a little more thought into my assumptions than I have been up to now.
Hopefully, this has helped you in some way as well.
- GIFs were sourced from giphy.com