DEV Community

Cover image for Build a graph database with Amazon Neptune Part 2 - Load data
Wendy Wong for AWS Community Builders

Posted on

Build a graph database with Amazon Neptune Part 2 - Load data

Introduction

The previous blog explored different use cases for using a graph database such as customer journey mapping, social networking and fraud detection.

A graph database is useful for reporting or describing relationships within your dataset.

This is my solution overview below:

solution

In this second blog, we will load data into Amazon Neptune.

Recap - Create a graph database

A graph database was created using AWS CloudFormation.The Neptune DB cluster is shown below:

Ineptune cluster

Learning Outcomes

  • Create an IAM policy to allow access to data files in S3 bucket
  • Create an Amazon S3 bucket
  • Load your data
  • Query data using Amazon Athena
  • Load data into Amazon Neptune
  • Query data

Prerequisites to load data

The instructions to allow an IAM Role and Amazon S3 bucket to access the Amazon Neptune DB cluster are outlined here.

  • Create an IAM role to allow Amazon Neptune to access Amazon S3 resources
  • Add IAM Role to an Amazon Neptune Cluster
  • Create the Amazon S3 VPC Endpoint

Tutorial 1: Create an IAM policy to allow access to data files in S3 bucket

We will follow the instructions provided here to create permissions for Amazon Neptune to access the S3 bucket.

Step 1: Login to the AWS Management Console with your IAM Admin User account.

Step 2: Navigate to the IAM console by typing 'IAM' in the search bar.

iam

Click Roles on the left-hand-menu.

role

Step 3: Click Create role.

create role

Step 4: Type S3 under 'Use cases for other AWS Services' and click Next.

S3

Step 5: Under the filter, type S3readonlyaccess and click Next.

read only access

Step 6: Create a role name and enter a description

role name

Click Create role.

create

Step 7:After the role has been created, type in the search bar NeptuneLoadFromS3 and click the hyperlink.

type role

Step 8: Click Edit Trust Relationship on the 'Trust Relationship tab'.

edit

Copy and paste this code:

copy code

Click Update policy

Upate policy

Tutorial 2: Add IAM Role to an Amazon Neptune Cluster

Step 1: Add this IAM role to the Neptune cluster.

Navigate to the Neptune console by typing Neptune in the search bar and click on the left-hand menu Databases

neptune search

Step 2: Select the cluster and under the Actions drop-down menu, choose Manage IAM roles

manage iam

Step 3: Select the new IAM role that was created in Tutorial 1: Step 6 and click Done.

iam done

Step 4: The cluster is now available.

avaialble

Tutorial 3: Create the Amazon S3 VPC Endpoint

You may follow the instructions here from the Amazon Neptune User Guide to create an Amazon S3 VPC endpoint.

Step 1: In the search bar, type 'VPC'

VPC

Step 2: On the right-hand side click Endpoints

endpoints

Step 3: Click Create endpoint

create endpoints

Step 4:In this section, enter 's3' in the search field and select 'com.amazonaws.ap-southeast-2.s3'

region endpoint

Step 5:Choose the VPC to create your endpoint:

VPC endpoint

Step 6: Check the box of the route table associated with the subnet

Step 7: Click Choose endpoint.

Gremlin load data format

Data loaded into Amazon Neptune as a property graph is queried using Gremlin must be in a csv format.

The csv file must be saved as a UTF-8 format

Dataset

I created this dataset from the Amazon Community Builders Directory which contains a record of 457 Community Builders in Asia-Pacific region with details such as:

  • Name
  • City
  • Country
  • Category
  • Year
  • Location

Directiry

This is a preview of a subset of data:

Isample

In this dataset, Name represents Nodes and other variables describing relationships are the Edges.

Tutorial 4: ETL Process and Data Prep for Loading data into Amazon Neptune

The following tutorial will help to transform data into Gremlin data load format.

Step 1: Create an AWS account and sign in with IAM Admin user. You may follow instructions here to create an account if you do not have one.

Step 2: Create a S3 bucket and name it aws-community-builders- apj with lowercase letters and select a region that is geographically close to you, for example Asia-Pacific(Sydney)-ap-southeast-2.

You may follow the instructions here to create your first S3 bucket.

name bucket
The S3 bucket was successfully created as shown below:

created bucket

Step 3: Create two folders and name them raw and processed

create folder

folders

Step 4: Add the 'apj_cb_list.csv' file into the raw folder and click upload.

upload file
Th csv file has successfully uploaded into S3 bucket in the raw folder and stored in a standard class.

raw object

Upload the csv file in UTF-8 format into the S3 bucket

utf

The file has been added to the S3 bucket in the 'raw' folder.

utf uploaded

Step 5: Navigate to AWS Glue console to create a crawler and obtain the schema within the Glue data catalog.

glue navigation

Click on the left hand menu crawler.

crawler**.

You may follow all the steps from this ETL tutorial here to understand how to create a crawler and access a table from the AWS Glue data catalog.

Step 6: A crawler named 'aws-community-builder-apj-crawler'
was created as shown below.

runing crawler

After the crawler job has completed, you will see a 'ready' status and a table is created.

crawler is ready

Step 7: Browse the data catalog by clicking Table on the left-hand menu under the database Community that was previously created in the ETL job.

inspect table

Step 8: Click into the 'aws-community-builder-apj-crawler' and click edit

The csv file saved in a UTF-8 file format was selected in the data store.

Image description

Step 9: Re-run the crawler to determine the schema for the dataset.

re-run the crawler

The crawler job has completed as reflected with a 'ready' status.

crawler is redy

Step 10: Select Tables on the left-hand side and check the box for the second table to check the schema.

Tables

Check the schema of the table.

schema

Step 11: On the top-right corner, click Edit schema so that you can edit the names of the column headers as per the original data set.

edit schema

Rename the column names and then click Save.

remane

save schema

Step 12: Type 'Athena' in the search bar and navigate to the Amazon Athena console.

Inavigate to athena

Step 13: You will be able to view the 'Community' database that was created from AWS Glue and also two tables.

tables

Step 14: Click Create and select the option to create a table from a data source from S3 bucket data

Provide a name of the table APJ_CB

Under Column Details enter the schema from the table 'apj_cb_list_utf.csv'

Step 15: After the table ' apj_cb' table is created, click on the three dots to preview the table.

Ipreview tablen

You can inspect the result set of the first 10 records.

result set

Step 16: Create a graph data model from the 'apj_cb' table

data modeling

In the next blog (Part 3), we will explore Gremlin statements in the Neptune data model in greater detail with four components:

  • subject (S) - relationship of two vertices (ie.nodes)
  • predicate (P) - edge label
  • object (O) - target vertex (i.e. node)
  • graph (G)

Step 17: Use Amazon Athena to run a SQL query named Edge query and export the data out as a csv file by clicking Download results.

edge query

download results

Step 18: Use Amazon Athena to run a SQL query named Node query and export the data out as a csv file by clicking Download results.

node query

node output

Step 19: Navigate to your S3 bucket which will store the Amazon Athena exported CSV files from the two SQL queries with the pre-fixes identical to query names node_query and apj-cb-edges see below:

S3

Create a new bucket to move these CSV files into a new S3 Bucket for Bulk data loading e.g workingbucketapjneptune

workingbucket

Create two folders to move these CSV files.

create folder

Step 20: Load data into a Neptune instance by following the instructions here

Navigate to the Neptune console by typing the word Neptune into the search bar.

neptune

On the left-handside menu, select Databases to find the Neptune Cluster endpoint under the tab Connectivity and Security.

connect

Step 21: Make a REST API call to the Neptune loader to commence the bulk loading process.

Connect to EC2 instance using SSH for Windows by following the Amazon documentation here
or you may follow the steps provided by Stephane Maarek in his Youtube video SSH to EC2 Instances using Windows Tutorial

.

I was able to successfully connect to a Linux instance for Windows using PuTTy as shown below:

Putty

AMI

Step 22: Navigate to the Amazon Neptune console, click on Cluster and go to Action and select Manage IAM role. Check to ensure that you have an existing IAM role to allow Neptune to connect to S3 buckets for the bulk load API call (i.e. NeptuneFromS3role)

Add the role from the drop-down menu and add this additional IAM role.

rol

Step 23: Make a REST API call to load the data into Neptune.

The sample code includes the following parameters for a data load.

Bulk load

Query Data

You may query data from a Neptune DB cluster or use a Graph Notebook linked to Amazon Sagemaker using the following languages:

  • Gremlin
  • Open Cypher
  • SPARQL

notebook

Open the sample graph notebooks with tutorials provided by Amazon created from Neptune cluster in a jupyter notebook hosted on Python 3. You can create your own jupyter notebook.

open

Write gremlin code for your vertex and edge statements.

edge statement

Until the next lesson, happy learning ! 😁

Next Lesson

In Part 3 the final instalment of the graph database series you will learn how to visualize network graphs and explore relationships.

References

Additional Resources

AWS Innovate Online Conference APJ - 23 & 25 August 2022

AWS Innovate

Register at this link

🚀 Hot off the press: Introducing Amazon Neptune Global Database

From 27 July 2022, AWS Neptune Global Database can build graph relationships in multiple regions including US East (N. Virginia), US East (Ohio), US West (N. California), US West (Oregon), Europe (Ireland), Europe (London), and Asia Pacific (Tokyo) Regions.

You can read about this announcement from the AWS Database Blog written by author Navtanay Sinha here.

Register for Australia's biggest data engineering conference DataEngBytes:

  • Melbourne: 27 September 2022
  • Sydney: 29 September 2022

Register here: https://dataengconf.com.au/

Data Eng Bytes

Top comments (0)