Moving large amounts of data on AWS

#aws #datasync #devops #sre

Recently, my team got tasked with decoupling all the images from a monolith that was public and had to become privately accessible, but we were also asked to keep the images publicly accessible to be consumed by a frontend app.

Our frontend app has access to the monolithic API through an API gateway. The gateway is in the same VPC as the monolith that became private, and therefore it can reach him. In my team, there is a strategy to carefully reduce these major apps into microservices and headless frontends.

To achieve our goal, we decided to move over 50 GB of images from an EFS storage disk into an S3 bucket and update the image references in the monolith database. In that way, we would achieve the goal of having a private app with public images.

At first, my approach was to use AWS CLI by connecting to the server attached to the EFS through SSH and using AWS CLI to sync all the images into the S3 bucket by using the following command:

aws s3 sync /var/www/html/app/media s3://app-production-media

This approach works, and if you have been tasked with something similar, you will achieve your goal, but this approach isn't the fastest. It took us 2 hours to sync 7 GB of data into the S3 bucket; at that pace, we would've spent more than 14 hours before finishing the synchronization of the images if we didn't get a timeout error, which we did.

Keep in mind that the aws s3 command got executed on an AWS server in the AWS network, which is way faster than the internet connection at my office, so these 14 hours have a higher weight, given the circumstances.

So, how could we move these 50 GB of images in an acceptable amount of time? That's where AWS DataSync comes in; this AWS service is meant to move data between the commonly used AWS storage solutions such as EFS, FSx, S3, NFS, Object Storage and SMB.

You do it by creating a DataSync task that can run only once or be scheduled to run from time to time. My recommendation is that you take a full look at the AWS DataSync documentation, so you can have a full view of DataSync's features.

This article shows how to create a task that runs manually. If you have a different use case, this article will serve you only as a base, and you will have to check the AWS documentation to implement extra aspects in your DataSync task.

Provisioning DataSync task

At the AWS management console, click on the search bar, search for DataSync and click on the service indicated in the image down below.

DataSync home page

After clicking it, you will be redirected to the following page: Notice that here you have access to a brief explanation of what DataSync is, how it works, and its benefits and features. Even though this article is straightforward, my recommendation is that you read everything on the DataSync home page. After finishing the reading, you can proceed to the next step by clicking on the Tasks option on the left-side menu.

Task dashboard

Once you've accessed the Tasks dashboard, you will see an empty dashboard with a Create Task button on the top-right side of the page. Click on the button to be redirected to the page that will allow you to start filling in the information about the DataSync task you will be creating.

Source location

Here, you must configure the source location from which the data will be copied. As mentioned previously, we will be copying data from an EFS disk into an S3 bucket. Hence, our source location will be pointing to an EFS disk, and down below, you can check how to configure the source location.

Destination location

After configuring the source location, you will be asked to configure the destination location, which will be an S3 bucket. Make sure to create and configure your bucket in such a way that your application can later read the files. Down below, you can check how to configure the destination location.

Configure settings

Once the source and destination locations are configured, you've got to handle general configuration settings such as task name, execution configuration, data transfer configuration, schedule, tags, and logging. Down below, you can check in more detail how to handle these general configurations.

When configuring the Task Logging, you can use a previously created CloudWatch log group, or just hit the Autogenerate button.

Review settings

Now that you've configured the task, it is time to review the configurations you've filled in. Read the review page carefully; if anything seems out of order, go back to the step where you found the mistake and fix it. Thereafter, come back to the review page, and once everything is the way you want it to be, you can click on the Create button in the bottom-right corner of the page.

Task start

Finally, the task got created, and now it's time to start it and initialize the process of copying the files from EFS to S3. To accomplish that, you just need to click on Start > Start with default.

Task status

Now that the task has started running, you can just watch the Task status until it is concluded. In case you want to see more details about the data getting copied from the EFS to S3, you can click on the Running status under Task status.

Execution status

Once you are watching the execution status, you can check a wide variety of information, including performance and the amount of data transferred. You will notice that at first the execution status will be Launching, but it will change as the data sync process keeps going.