DEV Community

Cover image for The Ultimate Guide to Getting Started in Data Science
SHEM MAINA
SHEM MAINA

Posted on

The Ultimate Guide to Getting Started in Data Science

Data science is a word I first heard in May 2019 when I first arrived on college, but I didn't pay much attention to it until the beginning of 2022. I was foolish at first, but things are improving with time, owing to excellent mentorship and connections with the right people. It's safe to say that if it hadn't been for this counsel, I wouldn't have lasted more than two weeks in my endeavor to break into this area.

As a novice, I'm sure you're wondering how to get started, what you'll need to do, how you'll accomplish it, and where you can find resources to help you. This blog will serve as a guide for you.

What is data science?

Data science is an interdisciplinary field that use scientific methods, procedures, algorithms, and systems to extract knowledge and insights from noisy, structured, and unstructured data, as well as to apply that knowledge and actionable insights to a variety of application areas. Data mining, machine learning, and big data are all connected to data science.

Why Data Science

Data has the ability to produce magic. Data is required by industries in order for them to make informed judgments. Raw data is churned into valuable insights via data science. As a result, industries require data science. A Data Scientist is a magician who understands how to use data to produce magic.

A proficient Data Scientist will be able to extract useful information from whatever data he encounters. He assists the company in making the proper decisions. He is an adept at data-driven judgments, which the company requires.

Why do some find it difficult to start?

Impostor syndrome may make getting into any digital trend overwhelming, especially if you use online platforms like Twitter and Discord. You'll see people posting about stuff you've never heard of, and you'll probably feel frightened and want to give up.

## Roadmap
Language
First you have to choose a programming language that you will be using in your learning journey and also in your career. You can chose any object oriented programming language like Java, C++, Java script or Python. Personally, I prefer python due to its many use cases and easy uses. You can read more about getting started with Python in an article I published here (https://dev.to/mainashem/introduction-to-modern-python-42fo)
You also need an IDE depending on your language of choice and if you decide to use python then I would recommend Jupyter notebook.
A guide to getting started with Jupyter notebook is here(https://medium.com/codingthesmartway-com-blog/getting-started-with-jupyter-notebook-for-python-4e7082bd5d46)

Database
A database is a collection of digital material that has been indexed. It can be searched, referenced, compared, altered, or otherwise handled quickly and with low processing overhead.

A database programming language is used to create and maintain databases. SQL is the most widely used database language. You need it to manipulate your Data. To learn more about SQL, you can view the official documentation here (https://docs.oracle.com/en-us/iaas/mysql-database/doc/getting-started.html)
There also other database infrastructures that you can use including: Postgres tutorial: https://www.postgresqltutorial.com/
MongoDB tutorial : https://www.mongodb.com/docs/manual/tutorial/getting-started/
DynamoDB tutorial: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Introduction.html

Python libraries for Exploratory Data Analysis
Exploratory data analysis, or EDA, is the process of becoming acquainted with your data. This includes inspecting samples of your dataset, examining its datatypes, evaluating the relationship between different combinations of variables using various charting options, and evaluating distinct variables using summary statistics.

Some of the tools used in data exploration and visualization with resources to learn them include:
Pandas-https://www.geeksforgeeks.org/data-analysis-visualization-python/?ref=rp
Numpy-https://medium.com/nerd-for-tech/a-complete-guide-on-numpy-for-data-science-c54f47dfef8d
Matplotlib-https://medium.com/analytics-vidhya/a-beginners-guide-to-matplotlib-for-data-visualization-and-exploration-in-python-3fb32d03c3cd

These are just but a few of the resources you need to get started. There are many more available on the internet and also in university programmes.
Having covered that, It is now easier to cover the data analysis process and what it entails.

Data Analysis process

  • Data Collection
    First and foremost, you must be able to have access to data. Whatever you want to do with it, having the abilities to obtain it is an important first step.
    Get your feet wet with SQL if you haven't done it already. Structured query language (SQL) is an acronym for structured query language. It's all about getting information from a database. Because the main aim is to ask a database for data, the code is actually quite straightforward.

  • Data Cleansing
    The goal of data cleaning is to get your data into a useful form for whatever analysis comes next.
    There are several parts to data cleaning: how do we manage missing values, are data types correct, is there any form of re-encoding of variables that has to be done, and so on – all of which are important to evaluate in light of the analysis ahead.

  • Data Wrangling
    Data wrangling is a procedure that comes after data cleaning. This also has to do with getting your data into the proper format so that it may be used.
    You may need to integrate a number of datasets into a single one. As a result, you might use a join or a union to integrate the datasets.

  • Exploratory Data Analysis
    Exploratory data analysis is a data exploration technique for gaining a better understanding of the data's many characteristics. It's a kind of data summary. Before executing any machine learning or deep learning activities, this is one of the most crucial procedures to take.

Data Scientists use information representation methodologies to perform exploratory data analysis procedures to investigate, deconstruct, and summarize the essential properties of datasets. Data Scientists can locate the proper responses they require by locating information designs, spotting inconsistencies, confirming suppositions, or testing conjecture using EDA processes that take into account compelling control of information sources.

Exploratory data analysis is used by data scientists to see what datasets can reveal beyond traditional data visualization or hypothesis testing assignments. As a result, they are able to gain top to bottom.

  • Statistical Analysis This is where you get to exercise your statistics muscles once you have a decent comprehension of your data. Probability density functions, t-tests, linear regression, logistic regression, hypothesis testing, and so on are examples of this.

With this roadmap, you'll be well on your way to becoming a data scientist, and you'll be able to apply for positions or take on projects to solve. You can learn how to develop machine learning models using Data Science as a prerequisite once you've mastered your way around.

Discussion (0)