A discussion of the history of DS methodologies

I was curious about the history of data science and so looked into the influence data science has had on other schools of thought. While skimming through a list of scientific papers, I found one that was of interest to me so I thought I might discuss it and its contents. The paper is called Big Data, New Epistemologies, and Paradigm Shifts. In it the author is responding to major changes Big Data methodologies have caused in the social sciences and the humanities. He specifically speaks to the new methods that were put forth by practitioners of data science and how these methods and epistemologies affect the fields and their studies.

The first major subject discussed were the claims made by data scientists at the time. Big Data was creating a new age of study in which the volume of data and the methods used allow data to explain its own trends without the necessity of using methods specific to any one social science. Data science is not unique to any one discipline and is being used for both non scientific approaches and trends as well as using them to understand natural phenomena and natural processes. Due to these methods there was and possibly still is culture among data scientists to claim the scientific method and empirical processes specific to a single scientific school such as models or hypotheses are unnecessary when studying large datasets whose framing is extensively holistic, don’t need to have explanations for correlation and trends, and do not need a scientific explanation or further exploration. In addition, the data will show no inherent bias and will not inherently have explanation in its trends without human intervention thus, the patterns and trends are meaningful and factual in their own right and anyone who can read a model can interpret them.

The author asserts these beliefs are misguided. First, due to the fact that any dataset used is a sample of the larger whole that could be accessed cannot be holistic nor can it provide inherently general or completely factual information since there are many factors involved in the collection of the data that introduce bias and data is rarely collected without a given purpose in mind. Second, the algorithms that have been engineered for use were developed and created with the need to answer a question based on a philosophical construct such as checking a hypothesis. Thus, no algorithm used in data science was created without some sense of scientific purpose and were put through extensive scientific rigor to prove their reliability as such they are scientific models. Third, the explanation of data will always be framed in some form of question and the algorithms used for the analysis of this data were created for a particular approach to answering hypotheses and then allowing the results to be interpretable based on said questions. As such these algorithms are used for a particular scientific approach all of which introduces further bias. In addition, patterns illuminated by the algorithms are not inherently meaningful and can be random or unnecessary. Finally, the claim that knowledge in a particular field is unnecessary to interpret results is unrealistic. Data can be interpreted without prior knowledge but it is likely there will be a weakness of findings that would be useful in a greater context. The reality is that though these claims made may be true for practitioners of data science in the unscientific community they oversimplify a more complex approach in order to prove the value of big data analytics.

At this point the author asserted a need for an epistemological process specifically by stating first, Data science uses a different method to define hypotheses withing empirical methods using guided techniques to identify potential questions worth examining. He explained this by stating people who use big data methodologies didn’t use the traditional empirical approach. Rather, meaningful data were generated using preplanned approaches to harvest data and then the materials were strategically evaluated to identify information that was worth further inquiry to people in a given field. In other words, the data collection and evaluation used abductive reasoning to find data using a logical method but not specifically using a definitive approach. Information and relationships in the data were then used to form hypotheses through induction, by studying relationships that became apparent in the data and then making a hypothesis based on those insights. Both he points out are different than the usual scientific method.

This new model was argued by the author to become the new paradigm in the age of Big Data as it is more suited to extracting meaningful data from large datasets where the traditional model is less useful. This was especially true and most challenging to scientists in fields like the humanities where emerging fields such as the Digital Humanities and the Computational Social Sciences have been challenging to the traditional methods used by scientists. Both fields of study have been met with similar resistance from more traditional scholars as the methods used by practitioners of these school of thought may yield results that lack depth and context in terms of the knowledge base and didn’t need any real domain knowledge to be understood. Specifically they argued these studies were too reductivist and took the humanity out of the greater context of the subject by minimizing its role in the studies and ignoring the greater complexities of society when using only quantitative systems of explanation. The author defends these new schools somewhat by stating though there are problems with these methods highlighting these limitations should show their use even if only in reference to other studies that give the patterns context.

The model that is most promising, uses radical statistics and GIS data. The practitioners of this model use current social theory to determine empirical approach and how to frame the eventual results. It accepts research as a system with a human influence that places the information in a particular frame of reference and the researcher is acknowledged as needing prior grounding in the subject matter. It places the research in a greater context and the data shows the original context of the work. In short, the methods accept the necessary reflexivity or self examination in its methods.

The author concluded his work by stating though Big Data is a disruptive influence and is likely going to cause a change in the practical methods used by many in the scientific community. The historic methods used will never really be replaced but they will likely be in conflict with the new data driven empirical methods until there can be a new theoretical framework for the new paradigm being presented. He also points out that there are certain methods that show promise by drawing attention back to the humanities and the challenges BD poses with the lack of necessity of knowledge base but addresses the epistemology of some models that are reflexive and situated in the realities of the scientific community. Due to these points the author stated a need for examination of the methods used in Big Data analytics due to affects in the changing landscape.

I find this paper helpful and difficult at the same time. It points out a need for my own understanding of how to best practice my craft. While I would like to say that I can do whatever I would like with the data presented to me, it is good to remember that there should be some particular guidelines I should pay attention to. Such as, what was the dataset I am looking at originally created for and will it be a good representation of the knowledge that will suit my purpose. Doing this will allow me to avoid any unexpected biases that could influence the outcome of my data. In addition, remembering that I have influence in how his data is framed and what I am using it to discuss is equally important as I could place unnecessary constraints or make decisions that will have unintended consequences. Finally, this article made a good point in that the data should be something the data scientist has grounding in. This will allow the analyst to give greater depth and understanding to their study and get less broad outcomes. I hope this article helps the reader get a better understanding of the importance of basic methodologies. Though they can be difficult they were created for a particular purpose and I hope the article helps to illustrate the reasoning behind them.

If you found this subject interesting a link to the original article can be found here.

Big Data, New epistemologies, and Paradigm Shifts