DEV Community

Cover image for Data Scientist Roadmap for Complete Beginners 2023-2024 part 1.
Wachirajob
Wachirajob

Posted on

Data Scientist Roadmap for Complete Beginners 2023-2024 part 1.

But what exactly is data science?

According to Datacamp

Data science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data. In simpler terms, data science is about obtaining, processing, and analyzing data to gain insights for many purposes.

In order to systemize the data science process, various data science life cycles or workflows have been developed and most of them can be more or less broken down into the following ten process;

DS LifecycleDS lifecycle

So today we are going to focus on the key skills needed at each stage of this DS lifecycle.

Before we dive into the skills required, it's important to know that some of the skills of a data scientist largely overlap with other data professionals like data analysts, data engineers and analytics engineers, who depending on the scale of data and other organisation factors would focus more on a subset of the above process.

So What skills do you need to become a data scientist?

I. General soft skills

More often than not, soft skills are usually treated as less important but without them, it is virtually impossible to create value using the DS lifecycle. Why, you may ask? The core use of data science is to solve real-world problems which among other things requires understanding what the problem is and communicating results in an easy-to-understand way while all at the same time maintaining a good relationship with colleagues and other shareholders.

The key skills you will need here at both ends of the cycle are:

  1. Domain expertise helps with understanding key metrics and features in whatever industry you are solving for.

  2. Strong communication skills to be able to ask questions to understand the problem as well as communicate results and tell stories with the data.

  3. Critical thinking and problem-solving skills allow you to choose what tools to use and the approach to take based on any number of constraints including budget and data availability.

II. First Principles

Most guides will let you dive directly into programming languages but before we get to that, we need to separate the knowledge from the tool. A good data scientist should be able to explain the core concepts separately from the tool used to implement them and this is where the following core Math skills come in. Some of the below skills can be used even on spreadsheet software.

  1. Descriptive Statistics
  2. Probability
  3. Inferential Statistics
  4. Linear Algebra
  5. Calculus

Here is a MIT cheat sheet for statistics that you can refresh your knowledge with and there are countless resources online to help you learn the rest.

III. Programming and Basics of Computer Science

Programming

Once you have the above core skills, the logical next step is to learn a programming language that will allow you to create reproducible work that others can verify. It also allows you to collaborate with others and makes your work easier by using programs created by others to solve common problems or perform common actions. Even though there are countless programming languages you can learn, your odds are better if you use industry-standard languages and tools which include the following:

Language: Python or R and their related libraries for data ingestion cleaning and pre-processing. The two are market leaders in analytics and have a huge community developing tools for them.

Version Control: Git is the obvious choice due to its ease of use and availability of platforms like GitHub and Gitlab that offer even easier ways to work with Git.

IDEs: You can always use Jupyter Notebooks or any online variation like Google Collab or codespaces or if you want to work locally you can use VS code and some extensions to help you out specifically with data processing like Microsoft data Wrangler

Computer Science Basics

Here you mostly need to get comfortable with Data Structures and algorithms as well as understand Time complexity and Big O notation. These will help you craft perfomant and efficient code advantages which can be clearly seen when dealing with large data sets and can directly lead to higher operational costs for your model when deployed in a cloud service provider.

IV. Machine Learning

Now that you have mastered the core math skills and tools used widely for data science, it's time to dive into building models, a key component of any data science project.

  1. EDA and Data Cleaning - Unless you have someone to assist with this, it is upon you to learn how to clean datasets that you intend to use for machine learning yourself. This is because garbage in garbage out and the quality of a model fundamentally depends on the data used to create it. Here you will mostly use a programming language. With python in mind, you will have to learn Numpy and Pandas to help with the wrangling and basic visualisation to matplotlib and seaborn for more advanced visualisation for EDA. You have to understand how to handle missing values, incorrect values, data types, etc. to allow you to mold the data appropriately.

  2. Feauture Engineering and Selection - Remember when we highlighted the importance of domain expertise, here is where it shines most. By understanding how the features relate to each other and how their transformations are linked to business goals you will be able to choose the right features or engineer more relevant features for the model to use.

  3. Model Selection - You will need to learn common machine learning algorithms for use in building your model and also know when to use the different types of models and their benefits. To learn why this is important, you can further read this article by the founder of Pycaret, which is itself an excellent tool for this type of work.

V. Model Deployment

Now that you have built your perfect model, you want people to use it but definitely not from your computer or building it from source code. You need a way to expose the model to more people via standard APIs or visual interfaces. You can do this using libraries like flask and fast API if you are going with the API way or using libraries like Streamlit and Dash for building interactive dashboards and interfaces.

You may also need to learn cloud technologies like Azure/GCP/AWS which provide cloud resources during the entire cycle and most importantly storage, compute, and deployment.

That's the basic approach I can recommend for those wanting to start out in data science. You will notice that I have left out a lot of the skills needed for the collection and transformation of data because in as much as they are core skills for all data professionals, most times a scientist will work on the statistical part of the process specifically in building models.

In the next article, I will dive into the actions that can set you apart once you are ready to start applying for opportunities and the types of projects that will make you shine.

Top comments (0)