- There is a lot to learn in data science.
- We can group the technologies by the subfields of data science.
- There are a few key technologies for each subfield to focus on.
- Creating this personal tech stack list was a fun and useful exercise.
If we tried to survey the technologies used by Data Scientists, we might get a picture like this:
This list is in no way comprehensive (this is already filtered based on my personal interests). On top of that, this list will change year-to-year. Keeping up with everything will be impossible. Thankfully, no Data Scientists will need to know or use all of these tools.
But since we should strive to be T-shaped people, we should at least learn a good chunk of these technologies, right? But where do we begin? How much should we learn, and what technologies? First, we should discuss briefly the nature of the profession itself.
Data science is still relatively young and continues to evolve. Once popularised by the tagline "sexiest job of the 21st century", many people were attracted to the interesting profession.
What may have begun as an application of statistics to solve business problems, is now a name that encompasses areas of big data engineering, visualisation, machine learning, deep learning and artificial intelligence. The rapid evolution was in part due to the breadth of areas in which data science can be applied to, but also because the technologies have also developed rapidly. The number of skills that a Data Scientist must possess has grown with the nebulous definition of the job.
I imagine a one-person Data Scientist in a small organisation would have a different set of tasks to do compared to a Data Scientist in a team within a large organisation. I also imagine that the exact job will depend very much on the industry and the nature of the organisation. Compounded with the rapid decrease in job tenure (or an increase in job mobility), this variance in the job description requires the practising Data Scientist to keep up with a large number of skills and technologies.
If we had to group some of the subfields of data science, they would look something like this:
- Data Analysis: This part of the job is about understanding the data. It involves data wrangling, exploratory data analysis, and "explanatory" data analysis. In a larger team, dedicated Data Analysts will perform these tasks.
- Data Visualisation: This part of the job is about communicating the data, usually to a non-technical audience. In a larger team, dedicated Business Intelligence Analysts will perform these tasks, although this can be a part of the Data Analyst's duties.
- Machine Learning: This part of the profession is probably where the "sexy" comes from. Using regression, classification, and clustering to solve a wide range of problems including computer vision and natural language processing. Sometimes, the people who develop new and better ways of solving problems through machine learning are called Machine Learning Scientists and the people who implement the solutions are called Machine Learning Engineers.
- Data Engineering: This part of the field has become so important that Data Engineers are more in demand than Data Scientists. To do data science, we need data and tools. Making these available is what data engineering is about.
- Cloud DevOps: More and more, both the data and the tools required to do data science are being made available on the cloud. Navigating a large number of cloud products, managing the scalable infrastructure, and managing the access and security are the duties of the Cloud DevOps Engineers.
- Web Development: This part might seem out of place, but if we consider the end-to-end data science projects, then the web is most likely the prototyping or deployment solution. In larger teams, there may be a team of Front-End and Full-Stack Developers.
Sure, these groupings are not clear cut and there are overlaps. At least, these groups give me some way of organisation. I have noted in the brief descriptions, these roles can be carried out by dedicated specialists in the team. But in a small organisation, it could be up to the one-person generalist data scientist to carry out all of these functions.
Whether we need to perform all of these roles or not, it would be helpful to understand a little bit about what other people in the team do. Or perhaps you are looking to switch your career track, say from a data analyst to a data engineer or from a web developer to a machine learning engineer, in which case, you will benefit by knowing something about everything.
Coming back to the tech stack, we can (loosely) group the technologies according to these roles.
EDIT: (Notes on what these are added towards the end of the post)
Scanning through job descriptions and MOOCs, we can probably narrow down the very employable stack to something like the following:
Even this is too much to truly master. Even if I have touch points with all of these technologies, I wouldn't need to learn all of them. I would either be working with specialists or only use them to the extent that can be handled by reading through the documentation.
But I think I can remove some "duplicates" or remove some from my personal "core" stack from a learning standpoint.
After much consideration, my data science "core" stack for the coming years will look something like this:
Even though this is just a silly exercise, I still struggled to come up with the final, and personal, tech stack. I am not sure that this is the ideal minimalist stack. And even if it was, I won't be able to use many of these at my current job. I would probably use this "data science core stack" if I were to embark on a personal project or start my own tech company. There are other technologies that I have learned that I will continue to keep up with (Tidyverse, Spark, and Airflow), additional technologies I will learn a little bit about this year (Vue perhaps) knowing that I probably won't use them. I also recognise that one must choose the right tool for the job and that this list may look different in a couple of years' time.
Nevertheless, I think doing this kind of exercise once in a while is helpful in reassessing the pros and cons of each technology and getting the feel for the overall landscape. I have probably read over a hundred blog posts and a hundred videos about the latest developments and commentaries about the technologies listed in the first diagram and many more. This in itself was a valuable exercise. It also helped me narrow my focus on what I wanted to learn and why I should prioritise them, because realistically, I won't be able to (nor do I need to) learn all of them.
So, what's your tech stack?
I've jotted down some thoughts that went through my head as I was reducing my tech stack. These are all personal opinions, but the final tech stack is a personal one. So I think that it's okay.
- There is a lot of knowledge that is a co-requisite for some packages like statsmodels, scikit-learn, and Tensorflow. We shouldn't forget that there are whole fields of mathematics such as probability, statistics, linear algebra, vector calculus, econometrics, and machine learning algorithms that form the foundations of data analysis and machine learning. These take up a very big part of the skillset of a data scientist and therefore provides motivation to reduce what would be included in the core tech stack.
- Contrary to some popular narrative 80% of a data scientist's job is not data wrangling and cleaning. Half of the job would be meeting various people to understand requirements, communicate insights, and educate the benefits; as well as administrative duties, data governance duties, and professional development. Technical parts are perhaps half the story, and professional Data Scientists should also develop their management, communication, and design thinking skills. Yet another reason to reduce the technology fatigue by minimising the tech stack.
- I think that some technologies are similar enough that learning one would provide the transferrable skills required to learn others. For example, knowing Tableau will probably make learning PowerBI much easier. So learn one, and we can pick up the different nuances if the job requires.
- Similarly, where two different technologies essentially do the same job but are just different implementations, we can consider "duplicates". For example, PyTorch and Tensorflow are both very good deep learning packages, picking any one of the two would be a good choice.
- Some competing technologies are all worth keeping for different reasons. ggplot2 is the defacto visualisation tool for R. I have read some blog posts written in 2019 that still claim that R has better visualisation capabilities compared to Python. I think this is one of the reasons why we should take some time once in a while to update our knowledge about the data science tech landscape. Altair is arguably a better implementation of the grammar of graphics than ggplot2. But Altair uses Vega-lite (built on top of D3) which is very suitable for the web. For "print", I think that seaborn is the best. Within notebook environments, I think that Plotly Express is a very good candidate. At this point, I don't think I could choose between Altair, seaborn, and Plotly Express. All three are declarative and really easy to learn and use. I would probably continue to use all three. I would consider Altair and seaborn to be a part of the pandas ecosystem, and Plotly Express to be a part of the Plotly ecosystem (together with Plotly Dash).
- Some technologies are easier to learn than others. For example, React and Angular are powerful front-end frameworks (or library), but may not be the easiest to master. Some say that Vue, another front-end framework, takes the best of both styles and is easier to learn. Given that I am not looking to specialise in web development, I think Vue or even Svelte will suffice.
- In fact, some technologies are so easy that they are almost not worth "learning" or need keeping up-to-date that much. For example, HTML, CSS, Excel and Tableau. I think I can put these under the "assumed skills" category.
- One could say the same about SQL, but I think there are enough dialects within SQL and No-SQL "languages" built similar to SQL that it is worth keep reading up on. In my diagram, I am including all these peripheral and related things within "SQL".
- Speaking of including all related things, much like Tidyverse contains a lot of packages within it, I am including a lot of related packages within the pandas logo in the diagram above: NumPy, SciPy, matplotlib, seaborn, Altair, pandas-profiling, and pyjanitor for example. But I have separated out statsmodels because of the magnitude of co-requisite knowledge required to wield this package.
- Some technologies are "closer together". For example, while R may be better for statistics and econometrics, Python's statsmodels have caught up significantly. Since Python is useful for web development as well as machine learning, an argument can be made for using statsmodels over R. This is a hard balance to make. On the one hand, there are economic gains to be made by minimising the number of languages to strive for mastery in. But R still appears to be more sophisticated. And in my experience, learning Tidyverse (in R) helped me become a better pandas (in Python) user. In this post, I am somewhat trying to be more economical, so if I had to pick, I would drop R and focus on Python. I would still happily use R if required for a specific job.
- The "cloud" technologies are developing too rapidly, and I am not sure about learning a core stack just in case my future job will require me knowing how to use them. With the release of products like Google Cloud Run, I am not sure whether Kubernetes would be worth learning for the generalist data scientist.
Plus a whole bunch of packages not included in image. Just use "Tidyverse". R tends to have individual libraries for doing just about any stats. While I cannot list them all, Tidyverse should be central.
The pandas ecosystem is the Python equivalent to Tidyverse of R.
**Unlike R (ggplot), viz tools in Python does not have a king (yet). I included plotly in the pandas ecosystem in the list, but it is an ecosystem on its own and extends beyond just python. Sorry.
- GraphQL (Graphene, Apollo etc.)
- Keras (absorbed into TensorFlow)