DEV Community

Daniele Frasca for AWS Community Builders

Posted on

Amazon Neptune

It all started because I was giving my Like to movies on Netflix, and someone asked me why. So I put my IT hat on and started explaining how services like Netflix have some complex logic that tracks the history of my interactions and cross-reference similar tastes with other members considering information like titles, actors, genres, etc.

From this, I got curious and took the opportunity to learn a new service Amazon Neptune. A graph database is one of the NoSQL databases, and they are a good fit when data must be connected and when the connection is more important than the schema structure.

Amazon Neptune

Amazon Neptune is a fully-managed graph database service and supports the popular graph query languages Apache TinkerPop Gremlin and W3C's SPARQL.

Neptune's main components are:

  1. Primary DB instance – Supports read and write operations and performs all data modifications to the cluster volume.

  2. Neptune replica – Connects to the same storage volume as the primary DB instance and supports only read operations (up to 15 Neptune Replicas).

  3. Cluster volume – Neptune data is stored in the cluster volume. A cluster volume consists of copies of the data across multiple Availability Zones in a single AWS Region.

database.png

The cluster endpoint and reader endpoint provide support for high-availability scenarios, and this is because Neptune uses the endpoint mechanism to reroute connections when some DB instances are unavailable. For example, if the primary DB instance of a DB cluster fails, Neptune automatically fails over to a new primary DB instance. It either promotes an existing replica to a new primary DB instance or creates a new primary DB instance.

It is not serverless, so you pay for the instances by the hours and storage consumed. Refer to Amazon Neptune pricing.

I configured Neptune in a private subnet with a security group that allows access only for port 8182 and the security group where my Lambda functions are configured.

vpc.jpeg

You can configure a Lambda function in your AWS account to connect to private subnets in a virtual private cloud (VPC).

lambda.png

Running a Lamba into a VPC takes work, especially if you have a particular type of load. Typical errors are:

  • EC2ThrottledException - when the VPC does not have sufficient ENIs or subnet IPs.
  • Client.NetworkInterfaceLimitExceeded - each lambda requires an ENI (Elastic Network Interface), and each region has a limit.
  • Client.RequestLimitExceeded - The lambda functions hit the request rate limit of creating network interfaces (ENIs).

The quota can be increased, but it could be better. An option to avoid these problems could be some data export as a cache layer in front of Neptune, like in DynamoDB.

Amazon Neptune does not have a built-in user interface. I was expecting something similar to Neo4j

image.png

Instead, on top of your Neptune cost, Amazon asks you to use a Jupyter notebook using the Neptune workbench. Of course, I get billed for workbench resources through Amazon SageMaker, separately from your Neptune billing. Once it is all up and running, I can visualize my graph.

image.png

This should come for free or at least already available in the cluster deployment without paying extra.

The Amazon Neptune graph data is represented as a four-position (quad) element:

  • subject (S)
  • predicate (P)
  • object (O)
  • graph (G)

Each quad is a statement that asserts one or more resources, describing the type of relationship or property being defined. Refer to Neptune Graph Data Model.

From Relational to Graph

I have never worked with a graph database, so let me clarify the terminology.

A graph database has the following:

  • Node/Vertex: represents an item in the graph.
  • Edge/Relation: a connection between two nodes.
  • Label: to indicate the type of vertex or edge.
  • Property: key-value pairs.

If you want to see them side by side, it could be like this:

Relational database Graph
Table Label
Row Node/Vertex
Columns Property
Relationships Edge/Relation

Loading data

To load data, I could create vertex and edges on-demand like this:

// VERTEX
await g.addV("actor")
  .property("firstname", "Tom")
  .property("lastname", "Hanks")
  .next();

// EDGE
await g.V("tom_hanks_id").as('source')
  .V("forrest_gump_id").as('destination')
  .addE("actor_in")
  .from_('source').to('destination')
  .next();
Enter fullscreen mode Exit fullscreen mode

Instead, to load a lot of data in one go, there is Neptune Bulk Loader.

image.png

  1. Copy the data files to an Amazon Simple Storage Service (Amazon S3) bucket.

  2. Create an IAM role with Read and List access to the bucket.

  3. Create an Amazon S3 VPC endpoint.

  4. Start the Neptune loader by sending a request via HTTP to the Neptune DB instance.

  5. The Neptune DB instance assumes the IAM to load the bucket's data.

I created a role using the AWS Managed policy from the console, but I suggest restricting the resource access to the specific bucket.

role.png

And I added the role to the Neptune instance from the console.

add iam role.png

To prepare the data, you have two ways I think:

  1. Build your script, for example, csv processing
  2. ETL Process for Transforming and Loading Data Into Amazon Neptune

Either way, you need to follow the specific formats for the Neptune loader API:

  • CSV format (csv) for property graph / Gremlin
  • CSV format (csv) for property graph / openCypher
  • N-Triples (ntriples) format for RDF / SPARQL
  • N-Quads (nquads) format for RDF / SPARQL
  • RDF/XML (rdfxml) format for RDF / SPARQL
  • Turtle (turtle) format for RDF / SPARQL

Once the data is in the S3 bucket, I need to call the Bulk Load API and to do so, I created an EC2 in the same security group of Neptune and accessed it using AWS Systems Manager Session Manager. So I have added an IAM Role to my ec2 machine:

AWS Systems Manager Session Manager.png

And with the instance running, I can select it and click connect:

step1.png

step2.png

Once inside, I run this command:

curl -X POST \
    -H 'Content-Type: application/json' \
    https://your-neptune-endpoint:8182/loader -d '
    { 
      "source" : "s3://bucket-name/object-key-name", 
      "format" : "csv",  
      "iamRoleArn" : "arn:aws:iam::account-id:role/role-name", 
      "region" : "region", 
      "failOnError" : "FALSE",
      "parallelism" : "MEDIUM",
     "updateSingleCardinalityProperties" : "FALSE", 
    "queueRequest" : "TRUE"
    }'

Enter fullscreen mode Exit fullscreen mode

This will start the process, and keep in mind that files of almost 1 GB could take hours (I think it depends on the size of the machine).

The previous command will return a response like:

{
    "status" : "200 OK",
    "payload" : {
        "loadId" : "635798ae-b306-4d56-a1d3-165dc3d56007"
    }
}
Enter fullscreen mode Exit fullscreen mode

And you can use the loadId to see the progress:

curl -G' https://your-neptune-endpoint:8182/loader/635798ae-b306-4d56-a1d3-165dc3d56007'
Enter fullscreen mode Exit fullscreen mode

Availability

Neptune volume spans multiple Availability Zones in a single AWS Region, and each Availability Zone contains a copy of the cluster volume data. In the presence of reading replicas, if a failure happens, the services should automatically handle the restoration of the service in less than 2 minutes. If there are no read replicas, the primary instance is recreated, and it could take up to 10 minutes to recover. Neptune automatically backs up the cluster volume and retains it for the backup retention period. They are continuous and incremental, so I can quickly restore to any point within the backup retention period between 1 and 35 days. In addition, I can take manual snapshots and move them across regions if needed.

There is more

Amazon Neptune is coming with many features, and I still need to check each of them, but on my radar, I have the following:

  • Export: export data to S3.
  • Streams: capture changes to a graph as they occur.
  • Full-text search: based on Amazon OpenSearch Service and Lucene query syntax.

Export: export data to S3

As for AWS guidelines, there are several ways to export data from a Neptune DB cluster:

  • For small amounts of data, use the results of a query or queries
  • There is also a powerful and flexible open-source tool for exporting Neptune data, namely neptune-export

Because I am new on the Neptune journey, I trust AWS for the best practice, follow the instructions, and deploy the Neptune Export AWS CloudFormation template.

After the Neptune-Export installation has been completed, I can see on my account:

  • Neptune Export API (APIGW)
  • AWS Batch

To start the export with this command:

curl \
  (your NeptuneExportApiUri) \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{
        "command": "export-pg",
        "outputS3Path": "s3://(your Amazon S3 bucket)/neptune-export",
        "params": { "endpoint": "(your Neptune endpoint DNS name)" }
      }'
Enter fullscreen mode Exit fullscreen mode

I need to allow connectivity from an AWS Batch to my Neptune Cluster.

This means I need to attach the NeptuneExportSecurityGroup created by the AWS CloudFormation stack as inbound rules to the security group of my Neptune Cluster.

Without this, you will get this error:

java.util.concurrent.TimeoutException: Timed out while waiting for an available host - check the client configuration and connectivity to the server if this message persists
Enter fullscreen mode Exit fullscreen mode

I have imported around 100 M of nodes, so as usual, I do not test the AWS services for the hello world scenario, and to my not surprise, I get another error:

An error occurred while counting all nodes. Elapsed time: 121 seconds
java.util.concurrent.CompletionException: org.apache.tinkerpop.gremlin.driver.exception.ResponseException: {""code"":""TimeLimitExceededException"",""detailedMessage"":""A timeout occurred within the script during evaluation."",

Enter fullscreen mode Exit fullscreen mode

You need to go into the Neptune console, Parameter groups and change the parameter neptune_query_timeout from 2 minutes to much more and run the export and hope for the best, but maybe you will find another error:

An error occurred while writing all nodes as CSV files. Elapsed time: 1200 seconds
Enter fullscreen mode Exit fullscreen mode

And so on until the export works.

Streams: capture changes to a graph as they occur

Why export all the data in one go when you can do the same with Streams?
This is what I think, and this was my next stop after the export experience.

I was expecting out-of-the-box integration to services like:

  • AWS Lambda
  • Amazon Kinesis

It turned out that Neptune streams is nothing like that, and in reality, you need to poll the data out. For example, suppose you want to ingrate with AWS Lambda. In that case, Neptune provides a polling framework through a CloudFormation template. This framework allows you to author a stream handler and then register it with a Lambda function provided by the polling framework. Furthermore, the framework uses AWS Step Functions and DynamoDB-based workflow to schedule the host Lambda's execution and checkpoint the stream processing. For more, refer here.

I did not try it out because it will be another Hello World, and I bet that t does not cover something, and for sure, I need some customization, so I must implement my polling framework version.

Full-text search

Neptune integrates with Amazon OpenSearch Service and also here is not out of the box, and you need to go through a lot of configurations and enable streams to make it works.

Keep in mind that there are two clusters to maintain (Neptune and OpenSearch), increasing the costs.

Each document in OpenSearch corresponds to a node in the graph, including information about the edges, labels and properties.

Once it is all up and running, I need to change my full-text queries to use Apache Lucene query syntax, for example:

g.withSideEffect("Neptune#fts.endpoint", "es_enpoint")
  .withSideEffect("Neptune#fts.queryType", "query_string")
  .V()
  .has("*", "Neptune#fts predicates.name.value:\"Jane Austin\" AND entity_type:Book")
Enter fullscreen mode Exit fullscreen mode

It is essential to read this section Full-Text-Search Query Execution in Amazon Neptune to avoid many OpenSearch calls.

Conclusion

In this post, I have described the basics. As I learn more, I will write more, especially on the extra features. As a serverless enthusiast, it is a step back because of all the setup required to work with Neptune. For example, configure a Jupiter notebook to see your graph in a UI. Then, if you want to have a full-text search, you have another cluster with OpenSearch service, and of course, I need to configure VPC, subnets and security groups. I define it as a legacy world, but Neptune seems a powerful tool to solve complex networking and recommendation use cases. I will for sure learn more and post more about my experience.
Originally published on Feb 2022 on
https://dfrasca.hashnode.dev/amazon-neptune
https://dfrasca.hashnode.dev/amazon-neptune-part-2

UPDATE

I have decided to revive those articles after the recent announcement
https://aws.amazon.com/about-aws/whats-new/2022/10/amazon-neptune-serverless-generally-available/
As of today, Nov 2022, this release lack of:
🚩No CloudFormation support
🚩Only a few regions
🚩Still requires a VPC
🚩Doesn't scale to zero

It is a step forward in the serverless direction, but to be a genuinely serverless service should scale to zero, and I should only pay if I use it.

Top comments (0)