Wendy Wong for AWS Community Builders

Posted on Sep 15, 2022

Build a graph database with Amazon Neptune Part 2 - Load data

Introduction

The previous blog explored different use cases for using a graph database such as customer journey mapping, social networking and fraud detection.

A graph database is useful for reporting or describing relationships within your dataset.

This is my solution overview below:

In this second blog, we will load data into Amazon Neptune.

Recap - Create a graph database

A graph database was created using AWS CloudFormation.The Neptune DB cluster is shown below:

Learning Outcomes

Create an IAM policy to allow access to data files in S3 bucket
Create an Amazon S3 bucket
Load your data
Query data using Amazon Athena
Load data into Amazon Neptune
Query data

Prerequisites to load data

The instructions to allow an IAM Role and Amazon S3 bucket to access the Amazon Neptune DB cluster are outlined here.

Create an IAM role to allow Amazon Neptune to access Amazon S3 resources
Add IAM Role to an Amazon Neptune Cluster
Create the Amazon S3 VPC Endpoint

Tutorial 1: Create an IAM policy to allow access to data files in S3 bucket

We will follow the instructions provided here to create permissions for Amazon Neptune to access the S3 bucket.

Step 1: Login to the AWS Management Console with your IAM Admin User account.

Step 2: Navigate to the IAM console by typing 'IAM' in the search bar.

Click Roles on the left-hand-menu.

Step 3: Click Create role.

Step 4: Type S3 under 'Use cases for other AWS Services' and click Next.

Step 5: Under the filter, type S3readonlyaccess and click Next.

Step 6: Create a role name and enter a description

Click Create role.

Step 7:After the role has been created, type in the search bar NeptuneLoadFromS3 and click the hyperlink.

Step 8: Click Edit Trust Relationship on the 'Trust Relationship tab'.

Copy and paste this code:

Click Update policy

Tutorial 2: Add IAM Role to an Amazon Neptune Cluster

Step 1: Add this IAM role to the Neptune cluster.

Navigate to the Neptune console by typing Neptune in the search bar and click on the left-hand menu Databases

Step 2: Select the cluster and under the Actions drop-down menu, choose Manage IAM roles

Step 3: Select the new IAM role that was created in Tutorial 1: Step 6 and click Done.

Step 4: The cluster is now available.

Tutorial 3: Create the Amazon S3 VPC Endpoint

You may follow the instructions here from the Amazon Neptune User Guide to create an Amazon S3 VPC endpoint.

Step 1: In the search bar, type 'VPC'

Step 2: On the right-hand side click Endpoints

Step 3: Click Create endpoint

Step 4:In this section, enter 's3' in the search field and select 'com.amazonaws.ap-southeast-2.s3'

Step 5:Choose the VPC to create your endpoint:

Step 6: Check the box of the route table associated with the subnet

Step 7: Click Choose endpoint.

Gremlin load data format

Data loaded into Amazon Neptune as a property graph is queried using Gremlin must be in a csv format.

The csv file must be saved as a UTF-8 format

Dataset

I created this dataset from the Amazon Community Builders Directory which contains a record of 457 Community Builders in Asia-Pacific region with details such as:

Name
City
Country
Category
Year
Location

This is a preview of a subset of data:

In this dataset, Name represents Nodes and other variables describing relationships are the Edges.

Tutorial 4: ETL Process and Data Prep for Loading data into Amazon Neptune

The following tutorial will help to transform data into Gremlin data load format.

Step 1: Create an AWS account and sign in with IAM Admin user. You may follow instructions here to create an account if you do not have one.

Step 2: Create a S3 bucket and name it aws-community-builders- apj with lowercase letters and select a region that is geographically close to you, for example Asia-Pacific(Sydney)-ap-southeast-2.

You may follow the instructions here to create your first S3 bucket.

The S3 bucket was successfully created as shown below:

Step 3: Create two folders and name them raw and processed

Step 4: Add the 'apj_cb_list.csv' file into the raw folder and click upload.

Th csv file has successfully uploaded into S3 bucket in the raw folder and stored in a standard class.

Upload the csv file in UTF-8 format into the S3 bucket

The file has been added to the S3 bucket in the 'raw' folder.

Step 5: Navigate to AWS Glue console to create a crawler and obtain the schema within the Glue data catalog.

Click on the left hand menu crawler.

**.

You may follow all the steps from this ETL tutorial here to understand how to create a crawler and access a table from the AWS Glue data catalog.

Step 6: A crawler named 'aws-community-builder-apj-crawler'
was created as shown below.

After the crawler job has completed, you will see a 'ready' status and a table is created.

Step 7: Browse the data catalog by clicking Table on the left-hand menu under the database Community that was previously created in the ETL job.

Step 8: Click into the 'aws-community-builder-apj-crawler' and click edit

The csv file saved in a UTF-8 file format was selected in the data store.

Step 9: Re-run the crawler to determine the schema for the dataset.

The crawler job has completed as reflected with a 'ready' status.

Step 10: Select Tables on the left-hand side and check the box for the second table to check the schema.

Check the schema of the table.

Step 11: On the top-right corner, click Edit schema so that you can edit the names of the column headers as per the original data set.

Rename the column names and then click Save.

Step 12: Type 'Athena' in the search bar and navigate to the Amazon Athena console.

Step 13: You will be able to view the 'Community' database that was created from AWS Glue and also two tables.

Step 14: Click Create and select the option to create a table from a data source from S3 bucket data

Provide a name of the table APJ_CB

Under Column Details enter the schema from the table 'apj_cb_list_utf.csv'

Step 15: After the table ' apj_cb' table is created, click on the three dots to preview the table.

You can inspect the result set of the first 10 records.

Step 16: Create a graph data model from the 'apj_cb' table

In the next blog (Part 3), we will explore Gremlin statements in the Neptune data model in greater detail with four components:

subject (S) - relationship of two vertices (ie.nodes)
predicate (P) - edge label
object (O) - target vertex (i.e. node)
graph (G)

Step 17: Use Amazon Athena to run a SQL query named Edge query and export the data out as a csv file by clicking Download results.

Step 18: Use Amazon Athena to run a SQL query named Node query and export the data out as a csv file by clicking Download results.

Step 19: Navigate to your S3 bucket which will store the Amazon Athena exported CSV files from the two SQL queries with the pre-fixes identical to query names node_query and apj-cb-edges see below:

Create a new bucket to move these CSV files into a new S3 Bucket for Bulk data loading e.g workingbucketapjneptune

Create two folders to move these CSV files.

Step 20: Load data into a Neptune instance by following the instructions here

Navigate to the Neptune console by typing the word Neptune into the search bar.

On the left-handside menu, select Databases to find the Neptune Cluster endpoint under the tab Connectivity and Security.

Step 21: Make a REST API call to the Neptune loader to commence the bulk loading process.

Connect to EC2 instance using SSH for Windows by following the Amazon documentation here
or you may follow the steps provided by Stephane Maarek in his Youtube video SSH to EC2 Instances using Windows Tutorial

I was able to successfully connect to a Linux instance for Windows using PuTTy as shown below:

Step 22: Navigate to the Amazon Neptune console, click on Cluster and go to Action and select Manage IAM role. Check to ensure that you have an existing IAM role to allow Neptune to connect to S3 buckets for the bulk load API call (i.e. NeptuneFromS3role)

Add the role from the drop-down menu and add this additional IAM role.

Step 23: Make a REST API call to load the data into Neptune.

The sample code includes the following parameters for a data load.

Query Data

You may query data from a Neptune DB cluster or use a Graph Notebook linked to Amazon Sagemaker using the following languages:

Gremlin
Open Cypher
SPARQL

Open the sample graph notebooks with tutorials provided by Amazon created from Neptune cluster in a jupyter notebook hosted on Python 3. You can create your own jupyter notebook.

Write gremlin code for your vertex and edge statements.

Until the next lesson, happy learning ! 😁

Next Lesson

In Part 3 the final instalment of the graph database series you will learn how to visualize network graphs and explore relationships.

References

Additional Resources

AWS Innovate Online Conference APJ - 23 & 25 August 2022

🚀 Hot off the press: Introducing Amazon Neptune Global Database

From 27 July 2022, AWS Neptune Global Database can build graph relationships in multiple regions including US East (N. Virginia), US East (Ohio), US West (N. California), US West (Oregon), Europe (Ireland), Europe (London), and Asia Pacific (Tokyo) Regions.

You can read about this announcement from the AWS Database Blog written by author Navtanay Sinha here.

Register for Australia's biggest data engineering conference DataEngBytes:

Melbourne: 27 September 2022
Sydney: 29 September 2022

DEV Community