DynamoDB import from S3

#aws #community #cloud #database

I recently attended AWS ANZ Database Roadshow 2022 at Sydney AWS office. One of the session that I was looking forward to was about DynamoDB. As expected, this session delivered by one of the Solutions Architect was the highlight of this event.

And every year, I look forward to Jeff Bar's blog about Amazon Prime Day and the stats/metrics related to various AWS services which powers the Amazon site to support huge traffic on this day. DynamoDB powers multiple high-traffic Amazon properties and systems including Alexa, the Amazon.com sites, and all Amazon fulfilment centers. Over the course of Prime Day, these sources made trillions of calls to the DynamoDB API. DynamoDB maintained high availability while delivering single-digit millisecond responses and peaking at 105.2 million requests per second.

So I diligently follow all the updates related to DynamoDB and its features. One such feature which was recently announced was DynamoDB import from S3. This is a fully managed feature that doesn’t require writing code or managing infrastructure. I wanted to explore this feature and get some hands on.

Before this feature was announced, there were very limited options available for bulk importing data into DynamoDB. Such pipelines needed building and operating of custom data loaders on a fleet of virtual instances, monitoring and exception handling.

DynamoDB import from S3 helps you to bulk import terabytes of data from Amazon S3 into a new DynamoDB table with no code or servers required. The data in S3 should be in CSV, DynamoDB JSON or ION format with GZIP or ZSTD compression, or no compression. Each record is S3 data should have a Partition Key and a Sort Key (optional) to match the key schema of the target table.

Any errors encountered during parsing of the data or during import, a log entry is created for each error in Amazon CloudWatch Logs. If number of errors exceeds 10000 - the logging stops but the import continues.

Another important thing to note here is, DynamoDB import from S3 feature does not consume any write capacity units. So you don't need to provision additional capacity when creating a new table.

For testing this feature, I downloaded a dataset from Kaggle. This dataset consists of listings of all the movies and tv shows available on Netflix, along with details such as - cast, directors, ratings, release year, duration, etc. and the dataset has close to 9000 records.

Now let us try to use this feature to import the above dataset into a DynamoDB table. I already have an S3 bucket called dynamodb-import-s3-demo and the dataset CSV file is uploaded in the folder path /netflix-shows-movies as shown below:

From the dataset, I will be using the columns title and show_id as Partition Key and Sort Key respectively, for the DynamoDB table. Below is the snapshot of the dataset being used.

Step 1: Next, from the AWS Console, I choose Imports from S3 option under DynamoDB service.

Step 2: Click on Import from S3 button to navigate to Import options.

In the S3 URL, enter the path to the source S3 bucket and the prefix in URI format.
Select** This AWS account** as the bucket owner
Select the remaining fields as shown in the below image and click on Next:

Step 3: On the next screen: Destination table - new table

Table Name - Enter a name for the DynamoDB table.
Partition key - as mentioned above, enter title.
Sort key - as mentioned above, enter show_id
For Table Settings, leave the Default settings selected. DynamoDB table will be created with default RCUs and WCUs. As mentioned earlier, the import process will not consume any of the table's capacity.
Choose Next

Step 4: Review the details and click Import

Step 5: An import job is created. It takes sometime to complete the import. Monitor the status of the job to move to Complete

The dataset has 8808 records and 8807 of those were successfully imported. One record failed to import and the same was logged in CloudWatch Log groups, as shown below:

Below are the records which was imported as DynamoDB table as items.

Common errors that we might encounter can be syntax errors, formatting issues and records which are missing Partition Key and Sort key. Please refer to the Validation errors section in the Developer Guide for more details.

One limitation I see with this feature is that data can only be imported into a new table that will be created during the import process. Already existing DynamoDB tables cannot be used as part of the import process.

Cost wise, DynamoDB import from S3 feature costs much less than normal write costs for loading data manually using custom solutions.

Thanks for reading this blog. Please share your comments and feedback. It helps me to learn and grow.