I spent quite sometime figuring out how to install Python Packages in AWS Glue inside a VPC without internet access and I managed to figure it out after some tinkering. Just to recall, AWS introduced the support for installation of Python Packages via
--additional-python-modules option. While this is a lifesaver - for those who started working with Glue 1.0, it only works if your Glue Job can connect to the internet.
Given the emphasis on security, a number of customers chose to limit/restrict egress traffic from their VPC to the public internet and require a method to manage the packages used by their data pipelines.
This article focuses on that challenge. This is a step-by-step process on how to setup your Glue Job to connect to a pypi mirror via AWS CodeArtifact, allowing you to install packages in a Private Subnet. For this tutorial, it is recommended to have a working knowledge of basic stuffs (e.g. Networking, Services) on AWS. But, I'll try my best to explain each part.
Let's get started!
The core of the solution is the AWS CodeArtifact, which allows you to use it as tool to securely store, publish, and share packages, in this case,
PyPi packages, across your private network without directly connecting into the Public PyPi Repository. This is made possible by VPC Endpoints through PrivateLink connections.
You do need to create endpoints for S3 and CodeArtifact for this to work, or else, you'll get errors like
Connection timed out errors.
Here's some resources to help you out with that:
Create VPC endpoints for CodeArtifact - if via console, kindly follow the same steps as with the S3 Endpoint.
An AWS account, of course
Note: Test this on your dev environment first
I won't go over these tools one by one as I believe ChatGPT can you give those definitions and its use better than me.
In this section, I'll go over the step-by-step solution for each process.
Let's start by setting up our CodeArtifact Repository.
Public upstream repositories- I chose PyPi
Specify your domain name
You should have the following repositories after creation:
Now that's done, you can inspect the created repositories. The
pypi-store was automatically created. The
<your-repo> is the one that we're interested in since this will contain our Python Packages.
With that, let's proceed with configuring your local environment.
$ docker pull amazonlinux:latest
Run the container and interact with the command line of the container using
$ docker run -it --rm -v /path/on/host:/path/in/container image_name /bin/bash
-v /path/on/host:/path/in/container: This is the volume mount option. It mounts a directory from your host
(/path/on/host)into the container
(/path/in/container). Any changes made in the mounted directory inside the container will be reflected on the host directory and vice versa.
--rm: This tells Docker to automatically remove the container when it exits. This means that once you're done with the bash session and exit, the container will be cleaned up, and no container filesystem will be left on your host system. Feel free to remove this option if you do not want your container to behave like that.
$ wget https://www.python.org/ftp/python/3.10.0/Python-3.10.0.tgz
$ tar -xf Python-3.10.0.tgz
$ cd Python-3.10.0
$ ./configure --enable-optimizations
$ sudo make altinstall
AWS Glue 4.0 runs
Python 3.10 version. For others, kindly refer to the documentation.
$ pip install awscli
Refer to this for creating your access keys:
After getting the values for the access keys, configure your AWS CLI:
$ aws configure
AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE
AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
Default region name [None]: us-west-2
Default output format [None]: json
Go back to the AWS Console and click on your created repository.
View connection instructions
Copy and run the command in
Step 3 of the
$ aws codeartifact login \
--tool pip \
--repository <your-repo-name> \
--domain <your-domain-name> \
--domain-owner <your-account-id> \
Once successfully logged in, kindly note that any
pip install command will be pushed to this repository instead of the Python environment on the Docker container.
Install your packages!
Now that the repository is ready, we can now install from AWS Glue using this Pypi mirror that we created!
This section discusses how you can point the installation of Python Packages in AWS Glue to AWS Codeartifact.
We need to generate an
authorization token from AWS CodeArtifact. This is done using this command:
$ aws codeartifact get-authorization-token \
--domain my_domain \
--domain-owner 111122223333 \
--query authorizationToken \
Note that the maximum duration of this token is
12 hours. And yes, you do need to generate this every day if you are planning to run your jobs daily.
Store this into a
Navigate to your Glue Job
I'm assuming you have already configured the
Data Connections. If not kindly configure it before proceeding to this step. The idea is that the Glue Job will run inside the Private Subnet of the VPC.
See screenshot below
Job Parameters, add the following
Key - "--additional-python-modules" // without double quotes
Value - "<your-python-package>==<version>"
Key - "--python-modules-installer-option"
Value - "--no-cache-dir --verbose --index-url https://aws:<CODEARTIFACT-AUTH-TOKEN>@<DOMAIN-NAME>-<ACCOUNT-ID>.d.codeartifact.<REGION-NAME>.amazonaws.com/pypi/pypi-store/simple/"
Change the following values:
CODEARTIFACT-AUTH-TOKEN- refer to Step 1
After configuring all of that, run your Glue Job and check the CloudWatch Logs to confirm if it's being installed correctly. You should see some text there that says:
Looking in indexes: https://aws:****@test-mirror-1234561234.d.codeartifact.ap-southeast-1.amazonaws.com/pypi/pypi-store/simple/
Kindly make sure that the
IAM_ROLE that you are using for the Glue Jobs has access to
CloudWatch Logs, some engineers usually forgets this. Also tick the
Enable logs in CloudWatch on Glue Jobs.
That's it! In this article, we demonstrated how we can leverage CodeArtifact for managing Python packages and modules for AWS Glue jobs that run inside a Private Subnet that have no internet access.
Do let me know if you have any questions on this, happy to answer any queries you might have.
Happy Coding, builders!
This blog is authored solely by me and reflects my personal opinions, not those of my employer. All references to products, including names, logos, and trademarks, belong to their respective owners and are used for identification purposes only.