DataSync is a powerful tool to move data between different AWS storage options like S3, EFS, and EFx. However, there is a catch, you can only run a scheduled task every hour (Nov 2021), you cannot create a custom
cron expression for a lower time like
*/5 * * * *. My guessing about this restriction is that this feature was planned for Data Warehousing, not for active synchronization.
My challenge started when I had to read some XMLs using RDS for SQL Server. RDS for SQL Server can read the files from S3 natively, but my files came from several micro-services running in Fargate that have only access to EFS as a volume (for S3, you must use a 3rd party plugin that requires your container to run as privileged and this is not authorized). These files came from external services at different times of the day and represented several gigas to transfer.
In the beginning, I was trying to find a way to read the EFS from SQL Server but it didn't work. RDS doesn't have an option to read EFS because it runs in Windows, and there is not a Linux option available yet, which could potentially give us access to EFS.
After several failed attempts, I created a workaround that involves:
- A DataSync task for creating the basic task and synchronizing the data.
- A Lambda function for running the task.
- An EventBridge rule for triggering the Lambda function every 5 min.
Configure your data source (EFS, for instance):
Choose the destination (S3, for instance):
Configure what you want to move.
Review your new task and create it.
This is the Python script that I wrote:
client = boto3.client('datasync', region_name='YOUR_REGION')
response = client.start_task_execution(
- YOUR_REGION is the location where you want to run it like eu-west-1.
- YOUR_USER_ID is the user that is going to run the task.
- YOUR_TASK_ID is the task ID created in the DataSync.
Create a new rule that runs in your expected schedule.
Create a new rule:
Configure your schedule:
Choose your lambda function:
Review your new rule and create it.
And that's all. Now, you can run your task
DataSync in your required schedule.