Have you ever tried to schedule export of DynamoDB data to S3? I mean the automated recurring task everyday e.g. at 6 AM?
You went to AWS console only to discover that it limits you to a single "on click" export?
I did 😏
Therefore, in this article I'll try to cover the whole process of exporting AWS DynamoDB data to S3 as a recurring task. Additionally, I'd like my data to be filtered by secondary index. Also, I'll answer: why and how do this, and compare what solutions AWS offers.
📔 One side note: I explore universal options, but keep in mind that my table size is below 1 GB.
✨ Let's go! ✨
- Why export data from DynamoDB to S3?
- How-to export data from DynamoDB to S3?
- Comparison table
- Final thoughts on export costs
- What's next?
First things first: Why?
Why export data from DynamoDB to S3?
From AWS website we can learn what are the benefits or reasons for exporting data from DynamoDB to S3. They divide it to:
ETL: Perform ETL (Extract, Transform, Load) operations on the exported data in S3, and then import the transformed data back into DynamoDB.
Data Archiving: Retain historical snapshots for audit and compliance requirements
Data Integration: Integrate the data with other services and applications
Data Lake: Build a data lake in S3, allowing users to perform analytics across multiple data sources using services such as Amazon Athena, Amazon Redshift, and Amazon SageMaker
Ad-hoc queries: Query data from Athena or Amazon EMR without affecting your DynamoDB capacity
In my case, BI team asked about a daily snapshot of our table from DynamoDB but only exported partially. So, I started the investigation: what are my options?.
How to export data from DynamoDB to S3?
At the beginning, I excluded the idea of scanning the table at the lambda level. Such a solution would be inefficient and costly, since AWS has tools for this - would also be a waste of time.
Those are 3 possible ways in 2023
- "basic" Export DynamoDB to S3 feature
- AWS Glue Job
- AWS Data Pipeline (to be deprecated)
But before you start prepare this
Requirements:
Enable Point-in-time recovery (PITR) on the source table that allows export table data from any point in time within the PITR window, up to 35 days.
-
Add IAM role with permissions to access the DynamoDB table and write to the S3 bucket, allow:
-
ExportTableToPointInTime
(DynamoDb) -
PutObject
(S3)
-
{
"Effect": "Allow",
"Action": [
"dynamodb:ExportTableToPointInTime"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"s3:PutObject"
],
"Resource": "*"
}
- S3 bucket - create new bucket or select to use an existing one
Build-in Export DynamoDB to S3
Export to S3 as DynamoDB feature is the easiest way to dump the table data to S3. Also it doesn't run a scan against whole table, so it is efficient, cheaper way.
It is a simple, one-click feature in the DynamoDB console that exports the data in either JSON or Amazon Ion text format.
BUT
🚨 no filter the data before export
HOW-TO: Export DynamoDB → S3
Step by step instruction:
- Go to DynamoDB console, select table you want to export.
- There is a tab "Export table",click export button and to fill details:
- S3 bucket
- IAM role (created earlier)
- Format: choose the format for the exported data (JSON or AWS Ion)
- Start the export process and wait until it to complete.
- Check the S3 bucket to verify that the exported data is available in the specified format.
Lambda
As you may notice - there's no option to schedule recurring task on the AWS console level.
That's why we would need a minimal Lambda function triggered daily at the specified time e.g. via EventBridge rule, that calls the exportTableToPointIntime
from AWS SDK.
AWS SDK v3 DynamoDB Client | ExportTableToPointInTimeCommand
Monitoring
We need to configure it, it's not provide as default. For example use AWS CloudTrail logs for table export to enable logging, continuous monitoring, and auditing
Limitations
- Always dump whole table data
- For recurring tasks needs extra lambda, that would run 1 per day
- Task number: Up to 300 export tasks, or up to 100 TB of table size, can be exported concurrently. Doc
- Format: DynamoDB JSON format or Amazon Ion text format
Cost
Export to S3 is “free” to setup, as it's part of the DynamoDB service.
We are charged $0.10 per GB exported + additional S3 costs for data storage and upload, which vary depending on the region you're in.
AWS Glue Jobs
Probably, (right now) it is the best way for data integration, especially when one source needs to stay alive while coping. AWS Glue is flexible as it allows you to export data from not just DynamoDB, but also other AWS services.
It is efficient for large datasets because the export feature uses the DynamoDB backup/export functionality (so it doesn't do a scan on the source table). In other words, it performs the Export to S3 (described above) under the hood.
AWS Glue crawl a DynamoDB table, extract the data into Amazon S3, and perform analysis using SQL queries. Technically, AWS Glue runs jobs in an Apache Spark serverless environment.
📔 Side note: I'm not covering the ETL capabilities of Glue here. For me, I need only data export, but if you plan to use Glue for ETL operations, you may want to create Glue Data Catalog for your jobs.
HOW-TO: Export AWS Glue Jobs → S3
Step by step instruction:
- Go to AWS Glue Job:
- Navigate to the AWS Glue Studio
- Click on the "Jobs" menu.
- "Add job" button to create a new AWS Glue job
- Select the source: select DynamoDB table as the source
- Select the destination: select S3
- Confirm with create button
This opens Glue Studio editor.
Visual editor is guides through job's properties. But you need to know what you want to do because it is a powerful tool full of options.
Configure the job details:
- Data source: DynamoDB table
- AWS Glue is using DynamoDB's feature - export to S3 and creates a temporary S3 bucket
- Data transform:
ApplyMapping
- add filtering in SQL - Data target: set format (e.g. JSON), then select the S3 bucket as the destination
- Set schedule
After configuring the AWS Glue job, click the Run Job button to start the export process.
AWS Glue will automatically extract the data from the DynamoDB and store it in the specified S3 bucket.
👉 In my case, I also:
- set timeout after 8 hours.
- added number of retries - 3 time per day. AWS Glue will automatically restart the job if it fails
- narrowed down number of workers from default 10 to 2 (it's experimental decision, the export takes 10 min, and total cost is lower than with 10 allocated workers. Again this may vary, depending on the size of the input data)
Monitoring
Some logs & monitoring are created by default with job, which is nice 👍
It's good idea to add the alerts in AWS Cloud Watch, that could inform you on email/slack/any way you want, that your job is failing.
Jobs Cost
0.44 USD per DPU-Hour + S3 storage
AWS Data Pipeline [to be deprecated]
✨ I quickly go through the main aspects, but without a detailed configuration, because I personally skipped it as it's not worth delving into anymore.
AWS Data Pipeline is being deprecated and will no longer be available after January 1, 2025. AWS recommends alternative solutions
Please note that Data Pipeline service is in maintenance mode and we are not planning to expand the service to new regions. We plan to remove console access by 04/30/2023.
Unfortunately, AWS Pipeline is often recommended on StackOverflow 😉
AWS Data Pipeline is a more complex service that requires configuration, management, and monitoring of pipelines. Sounds similar to Glue when it comes to the functionalities (flexible - many data sources, large scale). What's the difference?
Disadvantages
Deprecated, sure 🙈
Also, this approach is a bit old-fashioned as it utilises EC2 instances and triggers the EMR cluster to perform the export activity. If instance and the cluster configuration is not properly provided in the pipeline, it could cost... 💸 dearly 💸
HOW-TO: Export AWS Data Pipeline → S3
To export a DynamoDB table, we start with the AWS Data Pipeline console to create a new pipeline. The pipeline launches an Amazon EMR cluster to perform the actual export. Amazon EMR reads the data from DynamoDB, and writes the data to the export file in an Amazon S3 bucket.
AWS Data Pipeline — manages the import/export workflow for you.
Amazon S3 — contains the data that you export from DynamoDB, or import into DynamoDB.
Amazon EMR — runs a managed Hadoop cluster to perform reads and writes between DynamoDB to Amazon S3 (The cluster configuration is one m3.xlarge instance leader node and one m3.xlarge instance core node)
Pipeline Cost
It charges for pipeline creation, execution, and storage
- 0.06$ per low frequency task [e.g. daily activity lambda]
- 1.00$ per high frequency task [e.g. hourly activity lambda]
- Amazon EMR (+EC2) cost for 120 minutes = 17 USD per month
- additional S3 cost.
Overview: Export a table from Amazon DynamoDB to an Amazon S3 bucket steps via AWS Data Pipeline:
AWS Dynamo Export | AWS Glue Job | AWS Data Pipeline | |
---|---|---|---|
Use for | Data transfer | ETL, Data Catalog, AWS Glue Crawlers | Data transfer, transform and process |
Serverless | Yes | Yes | No (default setting manages the lifecycle of AWS EMR clusters and AWS EC2 instances to execute jobs) |
Allows filters / mapping | No | Yes | Yes |
Cost | $0.10/GB + S3 storage | $0.44/DPU + S3 storage | $1.00/high-freq, $0.06/low-freq task + Amazon EMR(+EC2)+ S3 storage |
Data Replication | Full table; Export from a specific point in time | Full table; Export from specific point of time; Incremental | Full table; Incremental replication via Timestamp |
Output format | JSON, Ion(json.gz - compressed) | JSON, Ion, CSV, Parquet, XML, Avro, grokLog, ORC(compression optional) | CSV, JSON, Custom formats |
Final thoughts on export costs 💰
As an AWS developer, you should have a little bit of the accountant inside your heart too 🖤.
Try to keep your records as small as possible, and use on-demand pricing wisely. It is ✨ so convenient ✨, I know. While it may seem not expensive and you don't need to think about scaling, it can sometimes be up to 4-6 times more expensive per request compared to a provisioned capacity. Therefore, it's better to sit down and calculate before making a final decision.
For me on-demand is cheaper than fixed capacity, but please refer to oldest programmer's answer "IT DEPENDS".
What's next?
Now, your data is in S3? That's time to think about retention policy 🧹, when should we archive data to AWS S3 Glacier. But maybe that's subject for next post.
Worth reading 📚
Top comments (2)
just what I needed!
While selection the incremental export, it says "The export period must be between 15 minutes and 24 hours in length. The start is inclusive. The end time is exclusive. For date, use YYYY/MM/DD format. For time, use 24-hour format."
What does it really mean? Does it mean we can export data in a time difference of minimum 15 minutes and maximum 24 hours?