Introduction
The previous blog explored different use cases for using a graph database such as customer journey mapping, social networking and fraud detection.
A graph database is useful for reporting or describing relationships within your dataset.
This is my solution overview below:
In this second blog, we will load data into Amazon Neptune.
Recap - Create a graph database
A graph database was created using AWS CloudFormation.The Neptune DB cluster is shown below:
Learning Outcomes
- Create an IAM policy to allow access to data files in S3 bucket
- Create an Amazon S3 bucket
- Load your data
- Query data using Amazon Athena
- Load data into Amazon Neptune
- Query data
Prerequisites to load data
The instructions to allow an IAM Role and Amazon S3 bucket to access the Amazon Neptune DB cluster are outlined here.
- Create an IAM role to allow Amazon Neptune to access Amazon S3 resources
- Add IAM Role to an Amazon Neptune Cluster
- Create the Amazon S3 VPC Endpoint
Tutorial 1: Create an IAM policy to allow access to data files in S3 bucket
We will follow the instructions provided here to create permissions for Amazon Neptune to access the S3 bucket.
Step 1: Login to the AWS Management Console with your IAM Admin User account.
Step 2: Navigate to the IAM console by typing 'IAM' in the search bar.
Click Roles on the left-hand-menu.
Step 3: Click Create role.
Step 4: Type S3 under 'Use cases for other AWS Services' and click Next.
Step 5: Under the filter, type S3readonlyaccess and click Next.
Step 6: Create a role name and enter a description
Click Create role.
Step 7:After the role has been created, type in the search bar NeptuneLoadFromS3 and click the hyperlink.
Step 8: Click Edit Trust Relationship on the 'Trust Relationship tab'.
Copy and paste this code:
Click Update policy
Tutorial 2: Add IAM Role to an Amazon Neptune Cluster
Step 1: Add this IAM role to the Neptune cluster.
Navigate to the Neptune console by typing Neptune in the search bar and click on the left-hand menu Databases
Step 2: Select the cluster and under the Actions drop-down menu, choose Manage IAM roles
Step 3: Select the new IAM role that was created in Tutorial 1: Step 6 and click Done.
Step 4: The cluster is now available.
Tutorial 3: Create the Amazon S3 VPC Endpoint
You may follow the instructions here from the Amazon Neptune User Guide to create an Amazon S3 VPC endpoint.
Step 1: In the search bar, type 'VPC'
Step 2: On the right-hand side click Endpoints
Step 3: Click Create endpoint
Step 4:In this section, enter 's3' in the search field and select 'com.amazonaws.ap-southeast-2.s3'
Step 5:Choose the VPC to create your endpoint:
Step 6: Check the box of the route table associated with the subnet
Step 7: Click Choose endpoint.
Gremlin load data format
Data loaded into Amazon Neptune as a property graph is queried using Gremlin must be in a csv format.
The csv file must be saved as a UTF-8 format
Dataset
I created this dataset from the Amazon Community Builders Directory which contains a record of 457 Community Builders in Asia-Pacific region with details such as:
- Name
- City
- Country
- Category
- Year
- Location
This is a preview of a subset of data:
In this dataset, Name represents Nodes and other variables describing relationships are the Edges.
Tutorial 4: ETL Process and Data Prep for Loading data into Amazon Neptune
The following tutorial will help to transform data into Gremlin data load format.
Step 1: Create an AWS account and sign in with IAM Admin user. You may follow instructions here to create an account if you do not have one.
Step 2: Create a S3 bucket and name it aws-community-builders- apj with lowercase letters and select a region that is geographically close to you, for example Asia-Pacific(Sydney)-ap-southeast-2.
You may follow the instructions here to create your first S3 bucket.
The S3 bucket was successfully created as shown below:
Step 3: Create two folders and name them raw and processed
Step 4: Add the 'apj_cb_list.csv' file into the raw folder and click upload.
Th csv file has successfully uploaded into S3 bucket in the raw folder and stored in a standard class.
Upload the csv file in UTF-8 format into the S3 bucket
The file has been added to the S3 bucket in the 'raw' folder.
Step 5: Navigate to AWS Glue console to create a crawler and obtain the schema within the Glue data catalog.
Click on the left hand menu crawler.
You may follow all the steps from this ETL tutorial here to understand how to create a crawler and access a table from the AWS Glue data catalog.
Step 6: A crawler named 'aws-community-builder-apj-crawler'
was created as shown below.
After the crawler job has completed, you will see a 'ready' status and a table is created.
Step 7: Browse the data catalog by clicking Table on the left-hand menu under the database Community that was previously created in the ETL job.
Step 8: Click into the 'aws-community-builder-apj-crawler' and click edit
The csv file saved in a UTF-8 file format was selected in the data store.
Step 9: Re-run the crawler to determine the schema for the dataset.
The crawler job has completed as reflected with a 'ready' status.
Step 10: Select Tables on the left-hand side and check the box for the second table to check the schema.
Check the schema of the table.
Step 11: On the top-right corner, click Edit schema so that you can edit the names of the column headers as per the original data set.
Rename the column names and then click Save.
Step 12: Type 'Athena' in the search bar and navigate to the Amazon Athena console.
Step 13: You will be able to view the 'Community' database that was created from AWS Glue and also two tables.
Step 14: Click Create and select the option to create a table from a data source from S3 bucket data
Provide a name of the table APJ_CB
Under Column Details enter the schema from the table 'apj_cb_list_utf.csv'
Step 15: After the table ' apj_cb' table is created, click on the three dots to preview the table.
You can inspect the result set of the first 10 records.
Step 16: Create a graph data model from the 'apj_cb' table
In the next blog (Part 3), we will explore Gremlin statements in the Neptune data model in greater detail with four components:
- subject (S) - relationship of two vertices (ie.nodes)
- predicate (P) - edge label
- object (O) - target vertex (i.e. node)
- graph (G)
Step 17: Use Amazon Athena to run a SQL query named Edge query and export the data out as a csv file by clicking Download results.
Step 18: Use Amazon Athena to run a SQL query named Node query and export the data out as a csv file by clicking Download results.
Step 19: Navigate to your S3 bucket which will store the Amazon Athena exported CSV files from the two SQL queries with the pre-fixes identical to query names node_query and apj-cb-edges see below:
Create a new bucket to move these CSV files into a new S3 Bucket for Bulk data loading e.g workingbucketapjneptune
Create two folders to move these CSV files.
Step 20: Load data into a Neptune instance by following the instructions here
Navigate to the Neptune console by typing the word Neptune into the search bar.
On the left-handside menu, select Databases to find the Neptune Cluster endpoint under the tab Connectivity and Security.
Step 21: Make a REST API call to the Neptune loader to commence the bulk loading process.
Connect to EC2 instance using SSH for Windows by following the Amazon documentation here
or you may follow the steps provided by Stephane Maarek in his Youtube video SSH to EC2 Instances using Windows Tutorial
.
I was able to successfully connect to a Linux instance for Windows using PuTTy as shown below:
Step 22: Navigate to the Amazon Neptune console, click on Cluster and go to Action and select Manage IAM role. Check to ensure that you have an existing IAM role to allow Neptune to connect to S3 buckets for the bulk load API call (i.e. NeptuneFromS3role)
Add the role from the drop-down menu and add this additional IAM role.
Step 23: Make a REST API call to load the data into Neptune.
The sample code includes the following parameters for a data load.
Query Data
You may query data from a Neptune DB cluster or use a Graph Notebook linked to Amazon Sagemaker using the following languages:
- Gremlin
- Open Cypher
- SPARQL
Open the sample graph notebooks with tutorials provided by Amazon created from Neptune cluster in a jupyter notebook hosted on Python 3. You can create your own jupyter notebook.
Write gremlin code for your vertex and edge statements.
Until the next lesson, happy learning ! 😁
Next Lesson
In Part 3 the final instalment of the graph database series you will learn how to visualize network graphs and explore relationships.
References
Additional Resources
Keep up to date withe the latest news from AWS Database Blogs for Amazon Neptune
Amazon Neptune simplifies graph analytics and machine learning workflows with Python integration
AWS Innovate Online Conference APJ - 23 & 25 August 2022
Register at this link
🚀 Hot off the press: Introducing Amazon Neptune Global Database
From 27 July 2022, AWS Neptune Global Database can build graph relationships in multiple regions including US East (N. Virginia), US East (Ohio), US West (N. California), US West (Oregon), Europe (Ireland), Europe (London), and Asia Pacific (Tokyo) Regions.
You can read about this announcement from the AWS Database Blog written by author Navtanay Sinha here.
Register for Australia's biggest data engineering conference DataEngBytes:
- Melbourne: 27 September 2022
- Sydney: 29 September 2022
Register here: https://dataengconf.com.au/
Top comments (0)