DEV Community

Lan
Lan

Posted on • Updated on

4 Lessons I learned using data science in a real scientific article

Over the past few years, there's been an outrageous growth on the interest towards data science, that's a fact. However, there were two main publics for the "new" knowledge field:

  • Computer scientists who produce data science related articles(Deep Learning models, Transformation techniques comparisons).
  • People who wanted new skills to work at a high-paying industry.

Personally, I don't think I fit in neither of those. I always felt it was good that the potential data science had as a tool was finally being popularized, but there was a point that bothered me: at the time this article was written, data science still wasn't being broadly used as a research tool at other knowledge fields other than computer science.

So, being a Physics major student at UFRJ(one of the best brazilian universities), and a self taught data scientist, I tried to find a research group at the university, with the aim to use the Data Science skills I had to produce better evidence to a scientific work.

Soon, I joined a group already writing a article. The group used statistics to determine which comorbidities statistically led more people to die of Covid-19. Along with that, I also asked the question: how well can you predict someone's death based on symptons they have. The article was published internationally, and the results and discussion can be seen here.

Finally, here are 4 lessons I learned through 8 months working on the article.

1 - Science isn't a straight line

What I mean by that is: science is not your beginner project, with a well defined dataset, and no responsability attached. First of all: the data you have is usually a mess, and the longest part of the workflow is cleaning it. Second: the decision making process on what to do with each variable is basically a way of molding reality before feeding your model. Is is a tiring process, but it makes all the difference to the outcome if well done.

When working on an article, there is always a hypothesis. Those who initially thought on the hypothesis surely have a guess on what the result will turn out to be.

However, that should always be a main point of attention, since grasping too hard on what you expect the result to be might mislead you into confirmation bias (google it up if you haven't heard of it. It's important). That leads us to the second point:

2- Science learns from bad results

If a model or analysis didn't result on pretty visualizations and excellent models, there's always something to learn from that.

Maybe, the data just doesn't answer the question you're asking. Maybe, there's some information about how the data was collected or what each variable means exactly that you don't know, that would make all the difference to the outcome. Data is not a human: it won't lie to you. So, try to listen to what it is trying to tell you, rather than only seeing what you expect.

3- Model performance isn't always the main goal

Working on the article, our main goal was to measure the relative importance between our variables to the model's outcome. Whilst working on it for a few weeks, we faced a problem: the model performed better when we removed 2 especific variables. If it was a industry problem, it wouldn't be a problem: results are the main goal, so just delete it.

However, in our case, it was not wise to remove these variables. Although they were irrelevant to the model's performance, they were relevant healthcare indicators. Does it mean we designed bad features? Does this indicate a systematic problem on the way data was collected?

In my example, our data was collected in hospitals by stressed out healthcare workers from a developing country during a pandemic. We can't simply close our eyes to the logistic and humanitarian challenge. Or maybe, it just means that the variable we thought was an important indicator for Covid-19 deaths turned out not to be. Don't forget the first point.

4- Reproducibility matters

When working with data science, we(at least I did!) tend to develop models and analysis in a messy way. No one else is going to read it, so why bother making it easy to read and accessible?

First of all, reproducibility is not about making things look good: it is all about making sure that every decision you took is clear, including: where you took the data from, what feature engineering decisions you made along the path, how you dealt with missing values, how you chose a model, and how and why the metrics you're using were chosen.

The scenario you have to imagine is: suppose someone has the same data you have, can they reproduce your exact results only with information you put on the article? If not, it shouldn't be considered science. You're no better than magicians.

Don't get me wrong, it's not that I don't trust you: science isn't about trust. Would you trust a rocket only a scientist said work? Or would you rather have it checked by dozens of experienced engineers?

It's only science if it can be reproduced.

Top comments (0)