As a data engineer, I can’t imagine doing my job without Python. In this article, I’d like to share my thoughts on how Python makes my work easier and, in some cases, possible at all. Python is one of the most popular programming languages worldwide. It often ranks high in surveys—for instance, it claimed the first spot in the Popularity of Programming Language index and came second in the TIOBE index.
In Stack Overflow, one of the most authoritative developer surveys, Python consistently ranks at the top; it’s the most wanted and third most loved programming language according to respondents in 2021.
Python is also the go-to language for data scientists and a great alternative for specialist languages such as R for machine learning. Often branded the language of data, it’s indispensable in data engineering.
Data engineering in the cloud
Everyday challenges facing data engineers are similar to the ones facing data scientists. Processing data in its various forms is the center of attention for both specializations. In the data engineering context, however, we focus more on the industrial processes, such as data pipelines and ETL (extract-transform-load) jobs. Those need to be robust, reliable, and efficient, whether the solution is meant for an on-premise or a cloud platform.
When it comes to the cloud, Python has proven itself to be good enough to incentivize cloud platform providers to use it for implementing and controlling their services. If we look at the biggest players—namely Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure—they all accommodate Python users in their solutions to a number of problems.
First of all, the serverless computing principle enables triggering data ETL processes on demand, without the need to maintain and pay for a constantly running server. Physical processing infrastructure is transparently shared here by the users in order to optimize the costs and limit the management overhead to a strict minimum.
Python is one of the few programming languages supported by the serverless computing services of all three platforms (AWS Lambda Functions, GCP Cloud Functions, and Azure Functions).
Parallel computing is, in turn, necessary for heavy-lifting ETL jobs on big data problems. Splitting the transformation workflows among many worker nodes is the only feasible way memory-wise (when the data can’t be kept in memory on one physical machine) and time-wise (when sequential processing would take too long) to accomplish the goal.
While Apache Spark is now the go-to technology for data parallelism, a Python wrapper for the Spark engine called PySpark is supported by AWS Elastic MapReduce (EMR), Dataproc for GCP, and HDInsight for Azure.
As far as controlling and managing the resources in the cloud is concerned, appropriate Application Programming Interfaces (APIs) are exposed for each platform.
APIs are especially useful for performing programmatic data retrieval or job triggering. Those developed by AWS, GCP, and Azure are conveniently wrapped in Python SDKs: boto, google_cloud_*, and azure-sdk-for-python, which makes them easily integrable within Python applications.
Python is therefore widely available across all cloud computing platforms. But the language is also a useful tool for performing a data engineer’s job, which is to set up data pipelines and ETL jobs in order to retrieve data from different sources (ingestion), process/aggregate them (transformation), and finally render them available for users, typically business analysts, data scientists, and machine learning experts.
Focus on data ingestion with Python
Business data may come from various sources of different natures, including databases (both SQL and noSQL), flat files (e.g. CSVs), other files used by companies (e.g. spreadsheets), external systems, APIs, and web documents.
The popularity of Python as a programming language results in an abundance of libraries and modules, including those used for accessing the data, for example SQLAlchemy for some SQL databases, Scrapy, Beautiful Soup, or Requests for data with web origins, among others.
One particularly interesting library is Pandas. It enables reading data into “DataFrames” from a variety of different formats, including CSVs, TSVs, JSON, XML, HTML, LaTeX, SQL, Microsoft, open spreadsheets, and several other binary formats that are results of different business systems exports.
The library also supports column-oriented formats including Apache Parquet, which enables optimizing querying that data later on with tools such as AWS Athena.
Pandas is based on other scientific and calculationally optimized packages, offering a rich programming interface with a huge panel of functions necessary to process and transform data in reliably and efficiently.
AWS Labs maintains the aws-data-wrangler library called “Pandas on AWS” to facilitate well-known DataFrame operations on AWS. The package can be, for instance, used as a Layer for Lambda Functions, thus rendering serverless functions deployment much easier.
Parallel computing with PySpark
Apache Spark is an open-source engine for processing huge volumes of data that leverages the parallel computing principle in a highly efficient and fault-tolerant way. While originally implemented in Scala and natively supporting this language, it has a widely used interface in Python: PySpark.
PySpark supports most of Spark’s features, such as Spark SQL, DataFrame, Streaming, MLlib (Machine Learning), and Spark Core. This makes developing ETL jobs extremely straightforward for Pandas adepts.
All of the cloud computing platforms I mentioned above support PySpark: Elastic MapReduce (EMR), Dataproc, and HDInsight for AWS, GCP, and Azure, respectively. In addition, you can connect a Jupyter Notebook to facilitate the development of the distributed processing Python code, for example with natively supported EMR Notebooks in AWS.
PySpark is therefore a powerful tool for transforming and aggregating huge volumes of data, making it ready for consumption by end users, such as business analysts, or by further components, for instance by involving machine learning.
Job scheduling with Apache Airflow
The existence of popular and well-regarded Python-based tools on on-premise systems motivates platform cloud providers to commercialize them in the form of “managed” services that are, as a result, easier to set up and operate.
This is, among others, true for Amazon’s Managed Workflows for Apache Airflow, which was launched in 2020 and facilitates using Airflow in some of the AWS zones (nine at the time of writing). Cloud Composer is a GCP alternative for a managed Airflow service.
Apache Airflow is written in Python and it’s an open-source workflow management platform. It allows you to programmatically author and schedule workflow processing sequences, and then monitor them via the built-in Airflow user interface.
The logic of transformations and the subservices invoked is implemented in Python, as well. A huge advantage for developers is that they can import other Python classes to extend the workflow management capabilities.
There are several alternatives to Airflow, including Prefect and Dagster. Both are Python-based data workflow orchestrators with UI (via Dagit in Dagster’s case) used to build, run, and monitor the pipelines. They aim at addressing some of the issues users have with Airflow, the more popular and better known predecessor. In both of those tools, workflows can be managed with Python.