In this lab we are going to create an EC2 Instance using Ubuntu OS, and on it we are going to install Apache Airflow to programmatically author, schedule and monitor workflows as we want, using Postgres as backend database. Why Postgres? Because it allows us to run multiple tasks at the same time in Airflow.
Architecture Diagram
Creating an EC2 Instance
In the AWS Management Console search bar, enter EC2, and click the EC2 result under Services. You will be placed in the Amazon EC2 console.
To start creating a new EC2 Instance, in the left-hand pane, click Instances option. The EC2 Instances list will load.
Click Launch instances. The Launch an instance form will load.
-
Under Name and tags section, enter the following:
- Name: airflow-instance
Under Application and OS Images (Amazon Machine Image) section, select Ubuntu option.
-
Under Instance type section, in the drop-down list, select t2.small instance type. This is because we are going to use PostgreSQL as database in Airflow, and this needs a minimum of 2GB of RAM to work properly.
- Warning: This is not eligible for the AWS Free-Tier. However, the use of this instance type is cheeper.
For a productive environment we will need to create a Key pair (login) to access our instance from the terminal, however for our demo we will not do it, since we will access our instance through Instance Connect.
-
Under Network settings section, select the following:
- Create security group: Checked (In this lab we are going to create a new Security Group; however, if you already have a Security Group, you feel free to use it)
- Allow SSH traffic from: Checked (For demo purposing we do it; however, for production environments it is recommended to map specific origin sources)
- Note: As shown in the image above, we also are going to create a new Security Group called βlaunch-wizard-1β. It is important to take it into account because later we are going to modify this security group to enable the use of the port 8080 for inbound requests.
Finally, under Summary the section located on the top-right, click Launch instance button.
The Create key pair form will be shown to confirm/deny the creation of a key pair. For demo purposes, we select Proceed without key pair and then click Proceed without key pair button.
Back to the list of instances doing click in the View all instances button appeared in the bottom-right side.
The next step will be configure the Security Group βlaunch-wizard-1β created in this phase. To do that, go to the left navigation pane and click Security Groups under the Network & Security section. The Security Groups list will load.
Select launch-wizard-1 security group, click Actions button and then click Edit inbound rules.
-
For demo purposes, select/insert the next values and click Save rules button:
- Type: Custom TCP
- Protocol: TCP
- Port range: 8080
- Source: Anywhere - 0.0.0.0/0
Installing Airflow
Back to the list of instances doing click Instances option located in the left-hand side navigation pane.
When the Instance state column is Running, right click on our EC2 instance called airflow-instance and then click Connect. The Connect to instance form will load.
Select EC2 Instance Connect tab option. Under it, in the Connection Type section, select Connect using EC2 Instance Connect.
Under Public IP address section, copy the public ip and save it in a secure place (we will use it later to connect to our Airflow instance).
-
Inside the new ssh browser tab, type the following to update out SO:
sudo apt update
π¨βπ»π©βπ»
-
Install pip for Python 3. The command below will also install all the dependencies required for building Python modules.
sudo apt install python3-pip
We will probably be asked to accept or not accept certain dependencies. In case it appears to us, insert Y and then press Enter to continue.
After the installation of pip, we will be asked to select the services we want to restart. We select all (we can select one by one using the space bar on our keyboard) and press Enter.
-
Install SQLite 3. Initially we will use this package to install Airflow.
sudo apt install sqlite3
π¨βπ»π©βπ»
-
A good practice when installing specific packages is to work with virtual environments. This will help us to have a better management of the packages that we install:
sudo apt install python3.10-venv
We will probably be asked to accept or not accept certain dependencies. In case it appears to us, insert Y and then press Enter to continue.
-
Create a new virtual environment called venv:
python3 -m venv venv
π¨βπ»π©βπ»
-
Activate our virtual environment created in the previous step:
source venv/bin/activate
π¨βπ»π©βπ»
-
Because we are going to install Airflow using Postgres as database, we need to install some additional libraries for that:
sudo apt-get install libpq-dev
We will probably be asked to accept or not accept certain dependencies. In case it appears to us, insert Y and then press Enter to continue.
After the installation of libpq-dev, we will be asked to select the services we want to restart. We select all (we can select one by one using the space bar on our keyboard) and press Enter.
-
Install Airflow package with postgres and complementary dependencies. The selected version for this demo will be 2.5.0:
pip install "apache-airflow[postgres]==2.5.0" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.5.0/constraints-3.7.txt"
π¨βπ»π©βπ»
-
Initialize the database backend in Airflow:
airflow db init
π¨βπ»π©βπ»
-
Install Postgres:
sudo apt-get install postgresql postgresql-contrib
We will probably be asked to accept or not accept certain dependencies. In case it appears to us, insert Y and then press Enter to continue.
-
Access to the postgres instance created in the previous step:
sudo -i -u postgres
-
Start the postgres client:
psql
-
Create Airflow database objects and permissions:
CREATE DATABASE airflow; CREATE USER airflow WITH PASSWORD 'airflow'; GRANT ALL PRIVILEGES ON DATABASE airflow TO airflow;
π¨βπ»π©βπ»
-
Exit the postgres client and return to the postgres instance:
\q
-
Exit the postgres instance connection:
exit
π¨βπ»π©βπ»
-
The next step will be change the sql_alchemy_conn and executor values inside the airflow.cfg file. For this, open the airflow.cfg file and replace the values of the sql_alchemy_conn and executor variables using vim, nano or any other preferred editor:
cd ~/airflow/ vim airflow.cfg # Locate the "sql_alchemy_conn =" line and replace the current line by this: sql_alchemy_conn = postgresql+psycopg2://airflow:airflow@localhost/airflow # Locate the "executor =" line and replace the current line by this: executor = LocalExecutor # Save the changes and close the airflow.cfg file
π¨βπ»π©βπ»
-
Initialize the database backend in Airflow to take the new changes:
airflow db init
π¨βπ»π©βπ»
-
Create a new admin user called airflow:
airflow users create -u airflow -f airflow -l airflow -r Admin -e airflow@gmail.com
Where:
- -u β Username
- -f β First name
- -l β Last name
- -r β Role
- -e β Email
Note: We will asked to enter a password. Insert airflow and press Enter. At this step, we have our user (airflow) and password (airflow) configured.
-
Start the web server in the background:
airflow webserver &
Note: After that press Enter to insert the command detailed in the next step. Donβt worry, the webserver service will continue to run.
-
Start the scheduler in the foreground:
airflow scheduler
π¨βπ»π©βπ»
-
Do you remember the EC2 instance ip we saved in the step 4 of this section? Yes, we are going to use it right now! Open a new tab browser (in our case we chose Chrome) and go to the next url format:
http://<my_ec2_public_ip>:8080
π¨βπ»π©βπ»
-
Insert the Airflow credentials created in the step 25 and click Sign In button.
- Username: airflow
- Password: airflow
-
Feel free to explore and run some DAG examples to be sure Airflow is working properly. After that, if we want to stop the Airflow services, back to the EC2 Instance Connect session and insert the following commands:
# Press "ctrl + c" twice to stop the scheduler and insert the following: kill $(ps -ef | grep "airflow scheduler" | awk '{print $2}') kill $(ps -ef | grep "airflow webserver" | awk '{print $2}')
π¨βπ»π©βπ»
-
Finally, donβt forget to delete the resources created as part of this lab to avoid unexpected costs in our billing π. Basically we need to delete the following resources:
- EC2 instance: airflow-instance
- Security Group: launch-wizard-1
Conclusion
AWS EC2 is an unstoppable force in the world of cloud computing, and its power to transform businesses and streamline operations is truly remarkable. In this article, we explored the incredible capabilities of EC2, specifically its ability to seamlessly integrate and leverage third-party tools like Airflow. By harnessing the boundless power of EC2, organizations can unlock new frontiers of data orchestration, workflow management, and beyond.
π Attention, cloud and data enthusiasts! π Are you ready to join an incredible community of like-minded individuals who are passionate about cloud and data topics? Look no further! Follow me on my social networks for a thrilling journey into the world of cloud and data. π
π Follow me:
π§βπ» Medium: jandro898.23
π§βπ» Github: jandroro
π§βπ» Youtube: The Cloud Lover
π§βπ» LinkedIn: jano-camacho-vicente
Top comments (0)