The Need for Transparency and Clarity in Data Science

#datascience #machinelearning

Before I started on my path to become a Data Scientist, I was a High School Special Education Teacher. At first it may be difficult to see Teaching and Data Science as very similar, but they in fact have many similarities. In this article, I would like to focus on their need for transparency and clarity.

During the end of my teaching career, the Special Education Department had a citizen’s complaint made against it. In the complaint, the family believed that the teachers were providing insufficient services to the student. Of course, as a department, we believed our methods were sound, so we continued to teach per usual during observations by the district. However, the verdict came and the district agreed with the family. This came as a disappointment to our department, but we wanted what was best for the students and wanted to work with the district to better serve the students. Unfortunately, the response we received basically boiled down to teach better and we are going to observe you again next year to make sure you fixed everything. No transparent explanation on how they came to their conclusions, and no clear goals for us to achieve. How could we fulfill these goals if we received no clear path or parameters to achieve?

Both Data Science and Teaching benefit greatly from transparency and clarity. If no one knows how your model works, how do we know if it is accurate or equitable? If you don’t have a clear research goal, how do we know what kind of data we need? Below I have some areas of focus and tips I have learned that I believe are important to keep in mind while we develop and implement our research.

A major part of Special Education is having students achieve individualized goals. Each of these would be based around which areas they qualified in (Math, English, etc.) and created by the Teacher who had them on their caseload. All goals are not created equally though. One teacher could have a goal that read, ‘Student will improve their Algebra skills from a C grade to a B grade’, and another could read, ‘Student will increase ability to multiply fractions from 2/10 attempts to 8/20 attempts’. The difference is that the first goal has a vague goal which means different things depending on their teacher, and the second contains a clear skill for them to work on.

Data Science can have similar issues with research questions. It is extremely important to have clear goals so you know what data to use, and what counts as a success. We All Count, a group focused on equity, has a article that talks about clear research questions for Data Science. In it, they highlight the following example:

Example: If we’re working for an organization that is trying to improve the school performance of students by making sure they have access to a healthy breakfast, one of our research questions might be:

“Did math test scores increase for the students who participated in our breakfast club at least once a week?” – This question is specific, measurable and can be answered with data.

Things that are not research questions:

“Is our program working?” – This question is too general, it’s more related to the overall goal or motivation of the data project. There are many specific research questions that could come out of this general question.

By having a clear research question, you can have a better understanding of what data you need, saving you from either getting unnecessary data, or finding yourself missing important data.

While I studied for my teaching certificate, one study stuck with me even till today. The exact article I read is lost to me, but I found an article that describes it well. The study wanted to find out if paying students money to get better grades would actually increase their scores. There were three schools that took part, two of them paid students up to $500 a semester for good test scores, attendance, and grades, and one offered students $2 to read an approved book. While the students could potentially earn more money with the method of the schools that paid for grades, the only one that saw a meaningful improvement was the school that gave $2 for students reading a book. The reason for this was that the way to receive the money for reading the book was clear and transparent. Students knew exactly what they needed to do to receive the cash. Just telling students they could earn money by doing better in school did not give them any skills to actually improve, just a reward for figuring it out.

In her book Weapons of Math Destruction, author Cathy O’Neil focuses on the issue of transparency in Data Science models. An example she addresses is a model made to evaluate teachers based on a number of factors. Unfortunately, these factors were opaque so teachers were unable to know how they would be graded until after the fact. Scores fluctuated wildly from year to year as well. One teacher in particular scored a 6 out of 100 one year, and 96 the next despite not changing his teaching practices at all. She argues that prevent these type of events from happening, we need to have transparency in our models for greater equity and effectiveness. If no one knows how your model works, it is very difficult to improve or critique it or point out flaws.

Simply creating a model that gives accurate predictions is not enough, we need to strive to create models that are clear and transparent in addition to being effective. Models can have a major impact on peoples’ lives, such as getting approved for a loan. We must keep this in mind and do our best to help people and companies, and not harm them with opaque and vague models that are ineffective and difficult to improve.

DEV Community

The Need for Transparency and Clarity in Data Science

Top comments (0)

Read next

Machine Learning Basics: Building Your First Predictive Model in R

Why Run LLM's /SLM's locally

Why Seeing Data Beats Reading It: The Case for Data Visualization

Part 11: Building Your Own AI - Introduction to Generative Models: GANs and VAEs