All signs point towards an auspicious future for data engineering.
Dice's 2020 tech jobs report cites Data Engineering as the fastest growing job in 2020. Increasing by a staggering 50%, while Data Science roles only increased by 10%. You can rest assured that the influx of data engineering will not regress anytime soon. To bolster this supposition, the International Data Group (IDG) predicts that the five year compound growth rate (CAGR) of data utilization from 2021 to 2024 will outweigh the total data creation spanning the entirety of the last thirty years. Yes, you heard that correctly, thirty years, dating back far before the origins of both FaceBook, YouTube and Amazon.
If you are still not sold on the prospect of data engineering, let's look into earning potential. As of May 9th, 2021, with over eight thousand salaries reported, Indeed indicates that data engineers make $10,000 more per year than data scientists. Additionally, the benefits of data engineering do not stop at pay alone, a study from The New Stack indicates that there is less competition for data engineering roles than other tech positions.
The New Stack found that for LinkedIn and Indeed job posts, for every open data science position there were 4.76 viable applicants, while data engineering roles experience only 2.53 suitable competitors per job opening. Nearly, doubling the chances of obtaining a data engineering role for applicable candidates.
We have established that data engineering is a well-paying position, in one of the fastest-growing tech fields, with relatively low competition. What is not to love?
However, merely graduating from a relative field alone will not qualify you for a data engineering position.
You'll need related real-world experience to fine-tune your hard skills. Concerning your future job search, one of the best ways to develop and convey these skills is through akin data engineering portfolio projects. In this article, we will review five potential project ideas with data sources. Before we cover the projects, you need to know the skills you should include in potential projects. For that, we will explore the most in-demand skillsets for data engineers.
When you look to build a data engineering project there are a few key areas you should focus on.
- Multiple Types Of Data Sources(APIs, Webpages, CSVs, JSON, etc)
- Data Ingestion
- Data Storage
- Data Visualization (So you have something to show for your efforts).
- Use Of Multiple Tools (Even if some tools may not be the perfect solution, why not experiment with Kinesis or Spark to become familiar with them?)
Each of these areas will help you as a data engineer improve your skills and understand the data pipeline as a whole. In particular, creating some sort of end visual, especially if it involves creating a basic website to host it can be a fun way to show your projects off.
But enough talk, let's dig into some ideas for your data engineering projects.
With the expansion of cryptocurrency exchanges and the rise and fall of GameStop stock, stocks have become a hot issue, gaining substantial outsider interest.
If you have also developed a zeal for trading markets I would suggest developing a project similar to CashTag, a project that was developed by an engineer currently working at Reddit. The goal of this project was to develop a "Big data pipeline for user sentiment analysis on the US stock market". In short, this project scrapes social media with the intent of predicting how people may feel about particular stocks in real-time. Below is a representation of the workflow used in this project.
This project is well documented and can be used as a base of inspiration for your project, which you can appropriate to accommodate your interest.
To engage with some new technologies, you should try a project like sspaeti's 20 minute data engineering project. The goal of this project is to develop a tool that can be used to optimize your choice of house/rental property.
This project collects data using web scraping tools such as Beautiful Soup and Scrapy. Creating Python scripts that interact with HTML is something that you should be exposed to as a data engineer and web scraping is a great way to learn. Interestingly this project covers both Delta Lake and Kubernetes, which are hot topics at the moment.
Lastly, no good data engineering project is complete without having a clean UI to show your work. This project dives into data visualization with Superset and everything is orchestrated together with Dagster. The sheer variety of tools used in this project make it perfect for a portfolio.
What if you could analyze all or at least some of the public Github repos. What questions would you ask?
But with so much data, there is a lot of opportunities to work on some form of analytical project. Felipe, for example, analyzed concepts like:
- Tabs vs Spaces?
- Which programming languages do developers commit to during the weekend?
- Analyzing GitHub Repos for comments and questions
There are so many different angles you could take on this project and it provides, you, the data engineer a lot of creativity in how you think about data.
You can analyze the source code of 2.8 million projects.
Maybe you can write an article like What StackOverflow Code Snippets Can We Find In GitHub?
In addition, this project idea should also point out that there are plenty of interesting data sets you can use out there that exist on platforms like GCP and AWS. So if you don't feel like scraping data from an API, you can always work on your analytical chops on the hundreds of data sets these two cloud providers to offer.
Extending outside of stock prediction, PredictIt makes market data available via an API. If you are unfamiliar with PredictIt, it is a New Zealand-based online prediction market that offers exchanges for global political and financial events. You may be familiar with the reported betting odds of the last election cycle, when these numbers are reported they are citing markets similar to Predictit.
Using their live API data you can cross reference spikes with news potentially, tying in scraped data from social media. Like the CashTag project previously discussed. You could find a way to tie online political chatter to a dollar value.
Of course, why stop there? Why not try to create a data storage system using something like BigQuery and add in other data like tweets, news, and so on?
Then spend time normalizing that data and trying to create tables that represent connections between all these disparate data sources.
Now that would be a fun and challenging data engineering project.
Another interesting project was conducted by Dr. Usama Hussain, where he measured the rate of inflation by tracking the change of price of goods and services online. Considering that the BBC reports that the United States has seen the largest inflation rate since 2008, this is an important topic.
In this project, the author used petabytes of web page data contained in the Common Crawl.
I also think this is another great example of putting together and displaying a data engineering project. One of the challenges I often reference is how hard it can be to show off your data engineering work.
But Dr. Hussain's project is documented in a way that shows off what work was done and the skills that he has, without having to dig into all of the code.
Dr. Hussain outlines the data pipeline below.
When it comes to selecting a project, the best project is one that strikes a balance between the interest of industry and personal interest. Whether you like it or not, personal interest is conveyed through the topic you choose, so it is important to find a project that you like. If your interest included, stocks, real estate, politics, or some other niche category, you can use the projects listed above as a blueprint that can be applied to a topic of your interest.
Thanks for reading! If you want to read more about data consulting, big data, and data science, then click below.