DEV Community

Jack Miras
Jack Miras

Posted on

Moving large amounts of data on AWS

Recently, my team got tasked to decouple all the images from a monolith that was public and had to become privately accessible, but we also were asked to keep the images publicly accessible to be consumed by a frontend app.

Our frontend app has access to the monolith API through an API Gateway, the Gateway is in the same VPC as the monolith that became private, and therefore it can reach him. This, in my team, is a strategy to carefully reduce this major apps into microservices and headless frontends.

In order to achieving our goal, we decided to move over 50 GB of images from an EFS storage disk into an S3 bucket and update the images reference into the monolith database. In that way, we would achieve the goal of having a private app with public images.

At first, my approach was to use AWS CLI by connecting into the server attached to the EFS through SSH and use AWS CLI to sync all the images into the S3 bucket by using the following command.

aws s3 sync /var/www/html/app/media s3://app-production-media
Enter fullscreen mode Exit fullscreen mode

This approach works, and if you have been tasked with something similar, you will achieve your goal, but this approach isn't the fastest. It took us 2h to sync 7 GB of data into the S3 Bucket, at that pace we would've spent more than 14h before finishing the synchronization of the images, if we didn't get a timeout error, which we did.

Keep in mind that the aws s3... command got executed in an AWS server in the AWS network, which is way faster than the internet connection at my office, so these 14h have a higher weight, given the circumstances.

So, how could we move these 50 GB of images in an acceptable amount of time? That's where AWS DataSync comes in, this AWS service meant to move data between the commonly used AWS storage solutions such as EFS, FSx, S3, NFS, Object storage and SMB.

You do it by creating a DataSync task that can run only once or be schedule to run from time to time. My recommendation is that you take a full look at AWS DataSync documentation, so you can have a full view of DataSync's features.

This article shows how to create a task that runs manually, if you have a different use case this article will serve you only as a base, and you will have to check AWS docs to implement extra aspects in your DataSync task.

Provisioning DataSync task

At the AWS management console, click on the search bar and search for DataSync and click in the service indicated at the image down below show.


DataSync home page

After clicking it, into the service you will be redirected to the following page, notice that here you have access to a brief explanation of what DataSync is, how it works and benefits and features. Even tough this article is straightforward, my recommendation is that you read everything present in the DataSync home page. After finishing the reading, you can proceed to the next step by clicking in the Tasks option on the left side menu.


Task dashboard

Once you've accessed the Tasks dashboard, you will see an empty dashboard with a Create Task button on the top-right side of the page. Click into the button to be redirected to the page that will allow you to start filling it the information about the DataSync task you will be creating.


Source location

Here you must configure the source location from where the data will be copied. As mentioned, we will be copying data from an EFS disk into an S3 bucket. Hence, our source location will be pointing to an EFS disk and down below you can check how to configure the source location.


Destination location

After configuring the source location, you will be asked to configure the destination location, which will be an S3 bucket. Make sure to create and configure your bucket in a way that your application can later read the files. Down below, you can check how to configure the destination location.


Configure settings

Once the source and destination locations get configured you've to handle general configuration settings such as task name, execution configuration, data transfer configuration, schedule, tags, and logging. Down below, you can check in more detail how to handle these general configurations.

When configuring the Task Logging, you can use a previously created CloudWatch log group, or just hit the Autogenerate button.


Review settings

Now that you've configured the task it is time to review the configurations you've filled in, read the review page carefully, if anything seems out of order, go back to the step where you found the mistake and fix it. Thereafter, come back to the review page, once everything is the way you want to be, you can click into the Create button in the bottom-right corner of the page.


Task start

Finally, the task got created, and now it's time to start it and initialize the process of copying the files from EFS to S3, to accomplish that you just need to click at Start > Start with default.


Task status

Now that the task have start running, you can just watch the Task status until the task gets concluded. In case you want to see more details about the data getting copied from the EFS to S3, you can click in the Running status under Task status.


Execution status

Once you are watching the execution status you can check a wide variety of information including performance and amount of data transferred. You will notice that at first the execution status will be Launching, but it will change as the data sync process keeps going.


Now your DataSync task is finally done, you will have all of your data in a S3 bucket in a matter of minutes and not 14h.

Happy coding!

Discussion (0)