DEV Community

muriuki muriungi erick
muriuki muriungi erick

Posted on

Introduction to Python for Data Engineering

-

Setting up the tools.

To use python, you have to have a code editor and required libraries and modules installed on your pc. This article covers how to run python on jupyter notebook running on vs code. We will start by installing vs code on our machine. Vs code runs on all machines from windows, mac or Linux. If you don't have vs code installed on your computer, visit https://code.visualstudio.com/ to download and install. Vs code is open source; therefore, it is free and easy to install. After installing vs code and opening it, your screen will look as shown below.

Image description

Click on the extension on the left side of the window, and on the search, type Jupyter notebook

Image description

Then click on the jupyter notebook icon and then install it. Repeat the same procedure to any library you want to install in the vs code. After that, install the anaconda on your machine. If you haven't already installed anaconda, visit https://anaconda.org/ for an installation guide. After that, go to the start button on your windows machine and type anaconda prompt

Image description

Click on the anaconda prompt, and you will be directed to the conda terminal. And then create your workspace environment. I created an environment called datascience_basics and will use python 3.9 in this project. The command to create this env in anaconda will be

C:\Users\HP.DESKTOP-QMIMHR3>conda create --name datascience_basics python==3.9
Enter fullscreen mode Exit fullscreen mode

After running the command, your screen will be as shown below.

Image description

Several packages will be installed, as shown in the screenshot. After that, we have to activate our environment. To activate the environment run the following command in your terminal.

C:\Users\HP.DESKTOP-QMIMHR3>conda activate datascience_basics
Enter fullscreen mode Exit fullscreen mode

We can check whether jupyter is properly installed by running

Conda list jupyter.
Enter fullscreen mode Exit fullscreen mode

If you installed it properly, your screen would be as shown below

Image description

We shall then navigate to our desktop and create a folder for data science. Then we install jupyter in the created folder.

(datascience_basics) C:\Users\HP.DESKTOP-QMIMHR3>cd Desktop
Enter fullscreen mode Exit fullscreen mode
(datascience_basics) C:\Users\HP.DESKTOP-QMIMHR3\Desktop>mkdir datascience1
Enter fullscreen mode Exit fullscreen mode

The screen will be as shown below.

Image description

Now open vs code from the terminal using

(datascience_basics) C:\Users\HP.DESKTOP-QMIMHR3\Desktop>code .
Enter fullscreen mode Exit fullscreen mode

In the taskbar of the vs code, click view and then command pallet followed by new jupyter notebook. Your screen should be as shown below

Image description
You can change from python to markdown from the top uppermost icon on vs code editor. Now your workspace is set and ready to be used. You will be able to use all python libraries and packages. Jupyter notebook uses dropdown, and your workspace will be as on the screen below.

Image description

Why python in data engineering?

Data engineers collect data from different sources and convert it to the right format before delivering it to the right team. Data engineers prepare the data by carrying out activities such as removing repeated data and collecting the missing data, among other data cleaning and pre-processing activities. The cleaned data is then forwarded to the analytic team. Below is a summary of the responsibilities of data engineers.

  1. Ingesting data from various data source
  2. Carrying out data optimization for data analysis
  3. Removing corrupted data in the dataset
  4. Developing, constructing, testing, and maintaining the data structure. The growth of data engineering has facilitated the growth of data engineering. Big data is a very large dataset that traditional data management systems can not economically analyze. The growth of big data has been facilitated by the growth of IoT, mobile application and smart sensors. As per 2021 IDC data, there were more than 10 billion Connected devices. It is projected that this number will roughly rise to 25.4 billion by the year 2030. This means that more than 15000 million devices will be connected to the internet per second. Because of this, companies, organizations and governments are investing heavily in how to ingest such data and store them for economic purposes. In past years data was mainly structured. Data from mobile apps, website pages and iot is mainly informing of pictures, videos or audio such data is unstructured. We can get data from these devices in the form of the JSON format. Such data is described as being semi-structured. Bid data is described using the five vs the 5vs helps data scientist to deliver valuable insights from the data, and at the same time, it helps in making scientists, analysts and data engineer organizations become customer-centric. These 5vs include:
    • Volume: this is the amount of data existing. When the volume of the data is big enough data is termed to be big data.
    • Variety: refers to the diversity of the data types. An organization can receive data from different sources, which sometimes differ in type. The collected data can be either structured,semi-structured or unstructured.
    • Velocity: this refers to how fast the data is produced and moved. This aspect is very important for the company to track the movement of data and to make it available at the right time
    • Veracity: This is the quality and the value of the data collected. Collected data may contain missing values or wrong formats, making them messy and difficult to use.
    • Value: This refers to the usefulness of the data to an organization. Sometimes data engineering and data scientist sounds as if they are the same. However, these two terms are totally different. To understand the two let's look at their differences.

Image description

Data pipelines.

Data is the new form of oil. As the oil moves from crude oil to different oil forms, so does data. Raw data gets into the hands of machine engineers who prepare the data and clean it before giving it to the data scientists. Data scientist manipulates and analyze the data to get different insights. Companies ingest data from a variety of sources, and they need to store this data. To achieve this, data engineers develop and constructs data pipelines. These data pipelines are used to automate the flow of data from one location to another. Depending on the nature of the data source, the data can be processed in either data streams or in batches.
Before doing anything to the system's data, engineers ensure that it flows efficiently in the system. The input of this data can be anything from images, videos, streams of JSON and XML data, timely batches of data, or even data from deployed sensors. Data engineers design systems that take this data as their input, transform it, and then store it in the right format for it to be used by the data scientist, data analyst machine learning engineers, among other data personnel. These systems are sometimes referred to as extract, transform, and load (ETL) pipelines.
As the data is flowing through to the system, it needs to conform to certain standards of architecture. To make the data more accessible to the user, data normalization is done. some of the
activities for data normalization include removal of duplicated data, fixing missing and conflicting data, and converting the data to the right format. Unstructured data is stored in data lakes, while data warehouses are used to store relational database information.
data lakes and warehouses
Data lake stores data from both internal and external sources. Data lakes and data warehouses are different. Let's have some of their differences

Image description

Data catalog for data lakes keeps on records for
• The sources of the data
• The location where the data should be stored
• The owner of the data
• How often is the data updated
Python libraries for data engineering.
Python is mainly used in data engineering because of its wealthy libraries and modules. Some of the data engineering python libraries include:
Pandas. Pandas library is used by data engineers for the purpose of reading, querying, writing, or manipulating data. Pandas can read both JSON and CSV file formats. Pandas can also be used to fix issues such as missing data from data sets. Data engineers use pandas to convert the data into a readable format.
Psycopg2/pyodbc/sqlalchemy: data engineers use mypostgresql to store data. These libraries are used to connect to a database.mypostgresql handles structured data
Elasticsearch. Data engineers use this library to manage a NoSQL database.
Scipy. This library offers quick maths solutions. Data engineers use it to perform scientific calculations on the problems related to the data.
Beautiful soup: This library is used for the purpose of data mining and web scrabbing. Data engineers use beautiful soups to extract data from specific websites. Beautiful soup supports both HTML and JSON data formats.
Petl: This library is used to extract and modify tabular data. Data engineers use this library when they are building Extract, Transform and Load data (ETL)pipelines
Pygrametl: this is a library that is used during the deployment of the ETL data pipeline.
From what we have covered, it is clear that python is among the best languages to use in data engineering. This is because of its simplicity and wealth of data engineering libraries. Python is also an open source resource; therefore, everyone is free to improve and use the already existing resources for their personal use.

Oldest comments (0)