DEV Community

Josphat Kahihia
Josphat Kahihia

Posted on

Data Science for Beginners: 2023 - 2024 Complete Roadmap

Data Science for Beginners: 2023 - 2024 Complete Roadmap
Introduction
Navigating the world of Data Science is particularly confusing for a beginner, having to deal with terms such as data, data analysis, data engineering, SQL, Python and many more. Let’s dive in to see it in a better way, shall we?

Jump to the technical aspect, if you already know why data science exists here
So, what is Data?
Data is, checks notes, “facts and statistics collected together for reference or analysis”. Simply put, any observation made is data. Examples are: number of people wearing red t-shirts in a room, the weather today, the opinion provided by your friend on your favourite sport, and many more.

As we are always observing the world around us, we are encountering data. This means that data is not a new concept, but rather a term to describe what you’ve been doing all your life.

Why put the ‘science' in there then?
Observations on their own have little or no impact on understanding the bigger picture on a topic or subject matter. Example: just observing the weather of the day can’t help you make a judgement on how to make the most out of that day. If you’re already out in the street and it starts raining out of nowhere, you can’t whip out an umbrella from nowhere if you hadn’t carried it or expected rainfall. So you’re now forced to seek shelter or face your cat instincts and brave the rain.

Thankfully, we have history
A combination of related observations and their occurrences in the past help us get a pattern of how events (observations) have been occurring, and predict the future trend of a pattern of events (eg: weather to accurately predict whether tomorrow will rain, or if you’ll have to dress as lightly as possible).

This is what makes data useful, important. We can choose a subject matter, topic or any general issue, take a look at the observations recorded, and from understanding their causes, links and predicting future patterns; giving us the gold: Information

I can already guess when it can rain and other things, so why have I searched for this - and gotten here?
Yes, that is true, and it’s a great skill to have. However, our natural storage and processing of data is limited since, as you might know, you can’t recall everything, and we have a lot of things to think about, not just a single trend such as weather.

To deal with this in the past, we used to write down data in papers and books and files. This obviously became messy quickly since corrections were hard to implement, it was easy to lose the files and papers, and creating backups was hectic - you took more time writing a copy than creating one since you had to make sure that what you’ve written is the same as the original document.

What a buzzkill, right? How could this issue be addressed?

Come in our digital assistants, computers. Just like the one you’re reading this on, can store things long enough - as long as you don’t delete it :) , which allows us to take a look back at observations we’ve made in the past, and scrutinise them, analyse them and understand them well to make informed decisions and predictions on certain topics.

This is especially useful in businesses, since a misaligned decision or prediction can cause significant losses, or cause the shutting down of the same.
As many as the rocks on the beach, were the different approaches in making sense of data
Everyone came up with their style, and communicating results to other people became a hard task as there was no common ground. Where communication fails due to differences in our approaches, science is used to set the common ground.

More on the digital assistants
Our computers don’t communicate in English, but rather electric signals. Yes, even these letters you are seeing are a result of those electrical signals. We obviously don’t speak Hertz and electricity, so we have to make them understand what we want. Come in Programming languages (which are basically translators of human language to electric signals). This and other concepts such as databases help us to store the data and make analysis on them. More on those later.
How Science redefined the analysis and understanding of data
Science streamlined the process by defining procedures, guidelines and rules for understanding data, which eventually became Data Science.
It combines multiple disciplines such as:
Statistics
Computer science (programming)
Domain Knowledge
Advanced analytics
Machine learning and AI

And that’s how we now have data science :)

tl;dr,
Data Science is therefore the process in which we make sense (produce information) out of data by gaining understanding on the patterns, trends and what drives that topic.

The Data Science Roadmap
Now that we know what data is, and where all this started - let’s see how we can become data scientists.

Technologies and terminologies
Database Technology
This is how data is stored in computers. It is structured with references and information to understand where how the data is structured and linked.

Database Types are:
Hierarchical Databases. Developed in the 1960s, the hierarchical database looks similar to a family tree. ...
Relational Databases. Relational databases are a system designed in the 1970s. ...
Non-Relational Databases. ...
Object oriented databases

Popular Databases include

  1. Structured Query Language (SQL),

Python

Python is a versatile programming language which has support for many mathematical and statistical functions (tasks) which are stored conveniently in libraries, which you simply make calls to (mention them as a reference to your code) to substitute rewriting them and cramming them.

It is through python (or any other compatible language) that you are able to easily fetch data from databases and pass it onto the computer for analysis.

ETL Pipelines

In an ideal world, data is stored in a similar structure, and is complete (in the sense that the observations are fully recorded).
However, in the real world, this isn’t the case.
Extract, Transform, Load Pipelines are therefore used to try and get the data to be standardised as much as possible, while retaining as much recording as possible.
This involves deletion of incomplete data, filling up patches in data, “rectifying data”, and so on.

Machine Learning and AI

As we get more reliant on our digital assistants, we want them to do more and more work for us, leaving only what is deemed as strictly necessary.
To achieve this, they had to be taught to think on their own, and find anomalies and trends in data fed into them.

What to Learn

Learn Python; Database Management, its design and principles; ETL pipeline technologies, AI and Machine Learning.

There are plenty of resources such as online courses, YouTube tutorials, books, and even school-based courses.

At the end of the day, once you have learnt how to use the above technologies to make sense out of large datasets and form accurate decisions and predictions based on them, you have now become a Data Scientist. Happy Learning!

Top comments (0)