12 steps for those looking to build a career in Data Science from scratch. Below there is a guide to action and a scattering of links to useful resources.
The field of data science is developing vigorously. But data science is not only neural networks, but also classical statistics and machine learning algorithms (which is more understandable for business processes), and overall everything related to the analysis, processing, and presentation of information in digital form.
It cannot yet be said that there is a clear division of labor in Data Science — this is a non-specialized profession. A rough analogy: just as there were pure **Computer Scientists *(computer scientists and programmers) who understood everything related to computers, so now there are **Data Scientists * who are engaged in everything related to data. The marker of the first movement towards specialization of labor is the sphere of online education.
One way or another, a data scientist works at the intersection of several areas:
▶️ Mathematics (including linear algebra, machine learning algorithms)
▶️ Programming (e.x. Python, R, SQL is usually a minimum requirement)
▶️ Business problems (yes, apart from Computer Science, you should understand what are business processes and how you can improve it)
Depending on your role in the team, some of these things will have to be done more. When choosing a vector of development, start from your own interests — learning will require significant resources, and without love for your work, you will quickly burn out. A mathematical base is necessary, but it is likely that the personal circle of tasks will be reduced to the use of existing tools and knowledge, and not to the invention of something new. As K. V. Vorontsov said in one interview:
People who know how to use ready-made algorithms need 50–100–500 times more. It seems that the problem of how to teach Computer Science and the problem of “more math or more engineering” has the following answer: you need both, but you have to teach mathematics to a carefully selected multitude of people who have realized themselves as creators, designers of new methods
Stepik has suitable free video courses for each of the knowledge testing areas:
Linear Algebra for Data Science in R (4 hours of lessons)
Introduction to Calculus (48 hours)
Foundations of Probability in Python (5 hours)
When taking video lectures, do not forget about the possibility of fast-forwarding. To use motor memory and work deeper into the material, take notes.
Besides mathematics, you need to be able to program. Usually, Python or R is chosen as the main language for data analysts. Stepik has good courses in both languages, including with an emphasis on data analysis:
Newcomers to Data Science often have a question about which language to choose the main — created specifically for data processing R or universal Python. Although this is a hot topic, I personally started with R (in computational biology people like it more), however, now I know both languages and highly recommend starting first with Python, since a transition Python -> R is more smooth, compared to backward direction.
In short: if you are planning a career in Data Science, I recommend you master both languages. Knowing R concepts and libraries will keep you one step ahead of Python-only users, and vice versa. Here’s how data analyst Irina Goloshchapovawrites about it:
By combining the most powerful and stable R and Python libraries in some cases, you can improve the efficiency of calculations or avoid the invention of bicycles for the implementation of any statistical models.
Secondly, this is an increase in the speed and convenience of project execution, if different people in your team (or yourself) have good knowledge of different languages. A reasonable combination of existing R and Python programming skills can help.
But if you want to go, albeit not a simple, but easier way, then one Python is enough — you will find more courses and answers to all sorts of questions on it.
One of the most popular tools for sharing data analysis results is Jupyter notebooks:
Jupyter Notebooks and the Jupyter Lab Platform allow you to combine code, text in Markdown, and formulas in LaTeX, testing, and profiling in a single document. Alternatively, you can collaborate on notebooks using Google Colab or JupyterHub.
Learn to use Git as soon as possible. In the process, you will have to choose between a variety of models and architectural solutions — version control is very useful here.
Plus, there are many great Data Science projects on GitHub. Remember that open source is one of the easiest ways to gain the necessary teamwork experience and contribute to a common cause.
You will naturally come across other popular tools as you progress through the courses. For example, in Python for high-speed processing of data arrays, knowledge of NumPy is required, for tabular data presentation, Pandas dataframes are usually used, for visualization — MatplotLib or Plotly, ready-made classes of popular machine learning models are imported from Scikit-learn.
Few courses focus on this, but in practice, data is usually stored in databases — SQL or NoSQL. For further work, you will need to learn how to communicate with them:
For deep learning, you need to get familiar with frameworks — TensorFlow or PyTorch. There are others — we compared them in the article “Write your first Generative Adversarial Network Model on PyTorch” .
Andrew Ng’s Machine Learning Course on Coursera is one of the most popular MOOCs out there. It is worth taking if only because it is often referred to other advanced courses. However, Octave / Matlab is used instead of standard Python and R.
Leskovets et al. Mining of massive datasets . There is a breakdown by chapters: pdf, exercises, presentations, videos.
Courses on DataCamp
Harvard Data Science Course (eDX)
Dive into Deep Learning: Free Interactive Book with Code, Math and Discussion http://d2l.ai
- Hasti et al. Elements of Statistical Learning
Shalev-Schwartz and Ben-David. Understanding Machine Learning: From Theory to Algorithms
- David Barber. Bayesian Decision Theory and Machine Learning
- Tom Mitchell. Machine Learning
- Devroy et al. Probabilistic theory of pattern recognition
- Neatly designed editions with easy copying of R in action: data analysis and graphing with R and Machine Learning in action
A lot of interesting things can be learned from the English-language news aggregators from the world of data science:
Register on Kaggle. Not only is it the most famous machine learning competition platform with cash prizes, but it is also a large community with a registry of datasets, Jupyter notebooks, mini-courses, and discussions. Participating in the Kaggle ranking on your resume can give you extra credit for your interview.
Data science is an incredibly broad interdisciplinary field, and special skills are required to solve specific problems. After familiarizing yourself with Kaggle, it will become clearer to you in which demanded knowledge you have gaps.
Also pay attention to the following courses:
Computer graphics: the basics (useful for working with models that process images).
YouTube channels also come in handy:
On the YouTube channel of the Computer Science Center, courses in special sections are conveniently organized into playlists:
- Machine Learning ( second part )
- Image and Video Analysis ( second part )
- Introduction to Natural language processing
- Data analysis in Python in examples and tasks ( continued )
- Data analysis in R
- Technologies for storing and processing large amounts of data
- Mathematical statistics.
Don’t stop learning. Browse the top and sidebar subreddits for topics related to machine learning:
Use new knowledge in the field of Data Science to benefit yourself and others. Create something that will make others say “wow”! Lots of project ideas are listed in **awesome-ai-usecases, **51 toy data proble, **practical-pandas-projects**.
You can start not from the project, but from an interesting dataset. List of popular registries:
Lots of discussions with project ideas can be found on Quora:
What Data Science Problems Can Be Solved Over the Weekend by One Programmer? I am studying Machine Learning and Statistics and am looking for something socially significant using public datasets and APIs
Create a public repository on GitHub for each project. Brush up the results, share them on your blog and community. Contribute to side projects, post your ideas and thoughts. All this will help you build a portfolio and get to know people working on related tasks.
The main languages of data science are not Python or R, but English and the language of mathematics.
Preprints of articles are published on the ArXiv website. The most useful sections for data scientists:
It is simply impossible to keep track of all publications. The Reddit branches listed above will help to isolate the most important texts (since the author became the head of the AI department at Tesla, the site began to break more often, but it’s still the best tool). There is also such a list of articles with comments and recordings of webinars from the YouTube channel Kaggle with parsing of scientific articles related to data science algorithms.
Data Science is a highly competitive profession in demand. But even the results of interviews are turned into data by community members. There are many lists of questions to prepare for a data scientist interview:
This year it is more difficult, but we hope that summer schools and internships will return soon:
Be sure to use your data mining skills to analyze the job market — analyze which skills are found in jobs more often to hone them as much as possible. Estimate how much income you can expect, taking into account spending on the site, rental housing, and moving to another city.
Share your project or find it with the Data Science community. Prepare a talk and speak at a local meetup. Start a blog where you will share your finds, your own ideas, and repositories.
Last but not least, enjoy how your skills help make the world a better place!
If you found this article helpful, click the💚 or 👏 button below or share the article on Facebook so your friends can benefit from it too.