DEV Community

loading...
Cover image for 👨‍🎓️📊 Data Scientist— 12 Steps From Beginner to Pro

👨‍🎓️📊 Data Scientist— 12 Steps From Beginner to Pro

mikhailraevskiy profile image Mikhail Raevskiy Updated on ・13 min read

12 steps for those looking to build a career in Data Science from scratch. Below there is a guide to action and a scattering of links to useful resources.

Source: [proglib.io](https://proglib.io/

1. Decide who you want to become 💭

The field of data science is developing vigorously. But data science is not only neural networks, but also classical statistics and machine learning algorithms (which is more understandable for business processes), and overall everything related to the analysis, processing, and presentation of information in digital form.

It cannot yet be said that there is a clear division of labor in Data Science — this is a non-specialized profession. A rough analogy: just as there were pure **Computer Scientists *(computer scientists and programmers) who understood everything related to computers, so now there are **Data Scientists * who are engaged in everything related to data. The marker of the first movement towards specialization of labor is the sphere of online education.

One way or another, a data scientist works at the intersection of several areas:

  • ▶️ Mathematics (including linear algebra, machine learning algorithms)

  • ▶️ Programming (e.x. Python, R, SQL is usually a minimum requirement)

  • ▶️ Business problems (yes, apart from Computer Science, you should understand what are business processes and how you can improve it)

Depending on your role in the team, some of these things will have to be done more. When choosing a vector of development, start from your own interests — learning will require significant resources, and without love for your work, you will quickly burn out. A mathematical base is necessary, but it is likely that the personal circle of tasks will be reduced to the use of existing tools and knowledge, and not to the invention of something new. As K. V. Vorontsov said in one interview:

People who know how to use ready-made algorithms need 50–100–500 times more. It seems that the problem of how to teach Computer Science and the problem of “more math or more engineering” has the following answer: you need both, but you have to teach mathematics to a carefully selected multitude of people who have realized themselves as creators, designers of new methods

2. Pull up the Math base ➕

If you want truly understand machine learning algorithms, you need first to understand Linear Algebra, Multivariable Calculus, probability theory, and mathematical statistics.

Stepik has suitable free video courses for each of the knowledge testing areas:

If illustrations, visualization are not enough, I highly recommend taking a look at the wonderful channel 3Blue1Brown. There are playlists for linear algebra, analysis, differential equations.

By the way, there is a detailed course of 175 videos on multivariate mathematical analysis on the **Khan Academy channel**.

When taking video lectures, do not forget about the possibility of fast-forwarding. To use motor memory and work deeper into the material, take notes.

3. Learn to program 👨‍💻️

Besides mathematics, you need to be able to program. Usually, Python or R is chosen as the main language for data analysts. Stepik has good courses in both languages, including with an emphasis on data analysis:

Newcomers to Data Science often have a question about which language to choose the main — created specifically for data processing R or universal Python. Although this is a hot topic, I personally started with R (in computational biology people like it more), however, now I know both languages and highly recommend starting first with Python, since a transition Python -> R is more smooth, compared to backward direction.

In short: if you are planning a career in Data Science, I recommend you master both languages. Knowing R concepts and libraries will keep you one step ahead of Python-only users, and vice versa. Here’s how data analyst Irina Goloshchapovawrites about it:

By combining the most powerful and stable R and Python libraries in some cases, you can improve the efficiency of calculations or avoid the invention of bicycles for the implementation of any statistical models.
Secondly, this is an increase in the speed and convenience of project execution, if different people in your team (or yourself) have good knowledge of different languages. A reasonable combination of existing R and Python programming skills can help.

But if you want to go, albeit not a simple, but easier way, then one Python is enough — you will find more courses and answers to all sorts of questions on it.

4. Learn to use the tools 🛠️

One of the most popular tools for sharing data analysis results is Jupyter notebooks:

Jupyter Notebooks and the Jupyter Lab Platform allow you to combine code, text in Markdown, and formulas in LaTeX, testing, and profiling in a single document. Alternatively, you can collaborate on notebooks using Google Colab or JupyterHub.

Learn to use Git as soon as possible. In the process, you will have to choose between a variety of models and architectural solutions — version control is very useful here.

Plus, there are many great Data Science projects on GitHub. Remember that open source is one of the easiest ways to gain the necessary teamwork experience and contribute to a common cause.

You will naturally come across other popular tools as you progress through the courses. For example, in Python for high-speed processing of data arrays, knowledge of NumPy is required, for tabular data presentation, Pandas dataframes are usually used, for visualization — MatplotLib or Plotly, ready-made classes of popular machine learning models are imported from Scikit-learn.

Few courses focus on this, but in practice, data is usually stored in databases — SQL or NoSQL. For further work, you will need to learn how to communicate with them:

For deep learning, you need to get familiar with frameworks — TensorFlow or PyTorch. There are others — we compared them in the article “Write your first Generative Adversarial Network Model on PyTorch” .

5. Take as many Data Science courses as you can 🎓

Courses:

Textbooks:

Alt Text

Alt Text

Alt Text

Alt Text

Alt Text

Alt Text
Alt Text

6. Join the Open Data Science community 👥

A lot of interesting things can be learned from the English-language news aggregators from the world of data science:

7. Take part in competitions 🤼

Register on Kaggle. Not only is it the most famous machine learning competition platform with cash prizes, but it is also a large community with a registry of datasets, Jupyter notebooks, mini-courses, and discussions. Participating in the Kaggle ranking on your resume can give you extra credit for your interview.

<!-- -->

8. Explore specific Data Science questions 👁️‍🗨️

Data science is an incredibly broad interdisciplinary field, and special skills are required to solve specific problems. After familiarizing yourself with Kaggle, it will become clearer to you in which demanded knowledge you have gaps.

Also pay attention to the following courses:

YouTube channels also come in handy:

On the YouTube channel of the Computer Science Center, courses in special sections are conveniently organized into playlists:

Don’t stop learning. Browse the top and sidebar subreddits for topics related to machine learning:

9. At the end of each course, do a project 🏗️

Use new knowledge in the field of Data Science to benefit yourself and others. Create something that will make others say “wow”! Lots of project ideas are listed in **awesome-ai-usecases, **51 toy data proble, **practical-pandas-projects**.

You can start not from the project, but from an interesting dataset. List of popular registries:

Lots of discussions with project ideas can be found on Quora:

Create a public repository on GitHub for each project. Brush up the results, share them on your blog and community. Contribute to side projects, post your ideas and thoughts. All this will help you build a portfolio and get to know people working on related tasks.

10. Read scientific articles🔬

The main languages ​​of data science are not Python or R, but English and the language of mathematics.

Preprints of articles are published on the ArXiv website. The most useful sections for data scientists:

It is simply impossible to keep track of all publications. The Reddit branches listed above will help to isolate the most important texts (since the author became the head of the AI ​​department at Tesla, the site began to break more often, but it’s still the best tool). There is also such a list of articles with comments and recordings of webinars from the YouTube channel Kaggle with parsing of scientific articles related to data science algorithms.

11. Take a Data Science Internship / Job🕴

Data Science is a highly competitive profession in demand. But even the results of interviews are turned into data by community members. There are many lists of questions to prepare for a data scientist interview:

This year it is more difficult, but we hope that summer schools and internships will return soon:

Be sure to use your data mining skills to analyze the job market — analyze which skills are found in jobs more often to hone them as much as possible. Estimate how much income you can expect, taking into account spending on the site, rental housing, and moving to another city.

12. Share your experience with the community 📢

Share your project or find it with the Data Science community. Prepare a talk and speak at a local meetup. Start a blog where you will share your finds, your own ideas, and repositories.

Last but not least, enjoy how your skills help make the world a better place!

13. Read More

If you found this article helpful, click the💚 or 👏 button below or share the article on Facebook so your friends can benefit from it too.

https://slidetosubscribe.com/raevskymichail/

Discussion (1)

pic
Editor guide
Collapse
dennis1107 profile image
Dennis1107

Thanks for all the good resources! There is one thing I would like to add for starters: Don't feel overwhelmed so just get started and learn on the go. There are so many Problems around Data Science that it is not possible to learn everything beforehand. Get the basics and learn when you need it.