DEV Community

Cover image for Demystifying Data Science: A Beginner’s Guide!
Michelle Njuguna
Michelle Njuguna

Posted on

Demystifying Data Science: A Beginner’s Guide!

Introduction

Hey there! I'm Michelle, and I like to call myself a data enthusiast! Data science might sound intimidating, but it’s not. I’ve been where you are now, staring at the screen with that “where do I even begin?” look. But don’t worry, I’m here to guide you through this journey.

What Is Data Science?
We first need to understand data. Data is a very broad term that can refer to raw facts, process data or information.
There are two types of data:

  • Traditional data- structured data.
  • Big data- unstructured data. With Big Data came the evolution of Data analysis roles like Data Science and Machine learning.

According to Wikipedia, Data science is an interdisciplinary academic field that uses statistics, scientific computing, scientific methods, processes, scientific visualization, algorithms and systems to extract or extrapolate knowledge and insights from potentially noisy, structured, or unstructured data.

In simpler terms, data science is all about discovering hidden insights in data and making predictions for the future.

Data Science

  1. Business Intelligence (BI): studies the numbers and explains where and why some things went well and others not so well. Having the business context in mind, the business intelligence analyst will present the data in the form of reports and dashboards(translates raw data).
  2. Traditional Methods: were designed prior to the existence of big data, where the technology simply wasn't as advanced as it is today. They involve applying statistical approaches to create predictive models. 3.Machine Learning (ML): utilizing unconventional methods or A.I., to predict behavior in unprecedented ways, using machine learning techniques and tools.

Traditional Data
A a quick run-down of how to handle traditional data:

  1. Data Collection: Start with gathering your raw data. This could be survey responses, sales records etc.
  2. Data Preprocessing: Now that you’ve got your data, it’s time to clean it up. This is like sorting through a pile of maize after harvesting. For example, sales numbers are numerical data, while customer feedback is categorical. 3. Data Cleansing: Sometimes, your data is messy—maybe someone wrote "two" instead of "2" or mistyped a name. Cleansing is all about fixing inconsistencies.
  3. Balancing: If you’ve got uneven data you need to balance it out so your results aren’t skewed or bias.

Big Data
Wikipedia states that With 619 million active users, X creates around 12 TB daily. X, formerly known as Twitter, generates about 4.3 PB annually. The social media platform amasses around 500 million tweets daily, amounting to 560 GB of data.
This is just twitter so you can imagine how much data is put out there on other platforms on a daily! And it is growing everyday. The data has different forms hence it is unstructured and maybe now you can understand why Data is classified under 3 V’s namely;

  • Volume refers to the amount of data.
  • Velocity refers to the speed of data processing.
  • Variety refers to the number of types of data.

How to handle Big Data

  1. Text Mining: the process of deriving valuable, unstructured data from a text.
  2. Data Masking: maintaining a credible business or governmental activity by preserving confidential information.
  3. Predictive Analytics Predictive analytics is looking into the future using data. You can do this with traditional statistical methods or with machine learning. Traditional Methods
  • Regression: is a model used for quantifying causal relationships among the different variables included in your analysis.
  • Clustering: Grouping similar things together.

Machine Learning
Training computers to learn from data and make predictions without being explicitly programmed.

There are three main types of ML:

  • Supervised Learning: Works with labeled data to predict outcomes. Examples; support vector machines, neural networks, deep learning, random forest models and Bayesian networks are all types of supervised learning.
  • Unsupervised Learning: works with unlabeled data to predict outcomes. Examples; There are neural networks that can be applied to an unsupervised type of machine learning, but K-means is the most common unsupervised approach.
  • Reinforcement Learning: Training models with rewards and punishments.

Deep learning is divided in supervised, unsupervised and reinforcement learning.

Wrapping Up
I know—it’s a lot. But remember, every expert was once a beginner. Keep experimenting, and don’t be afraid to make mistakes. That’s how you learn!

That’s it for now! Stay tuned for more articles where we’ll dive deeper into these topics.

Top comments (0)