Recently, while working on a project, I came across the task of moving terabytes (1 TB or more) of data from one Amazon S3 bucket to another S3 bucket.
First of all, you cant copy such a large number of objects using AWS S3 Console. It's not a convenient way and it will take months to copy that data manually.
For this particular use case, I have chosen the “Parallel uploads” option using AWS Command Line Interface (AWS CLI).
So, Depending on your use case, you can perform the data transfer between buckets using one of the following options:
Run parallel uploads using the AWS Command Line Interface (AWS CLI)
Use an AWS SDK
Use cross-Region replication or same-Region replication
Use Amazon S3 batch operations
Use S3DistCp with Amazon EMR
Use AWS DataSync
Note: As a best practice, be sure that you're using the most recent version of the AWS CLI. For more information, see Installing the AWS CLI.
You can split the transfer into multiple mutually exclusive operations to improve the transfer time by multi-threading. For example, you can run multiple, parallel instances of aws s3 cp, aws s3 mv, or aws s3 sync using the AWS CLI.
You can create more upload threads while using the --exclude and --include parameters for each instance of the AWS CLI. These parameters filter operations by file name.
Note: The --exclude and --include parameters are processed on the client side. Because of this, the resources of your local machine might affect the performance of the operation.
For example, to copy a large amount of data from one bucket to another where all the file names begin with a test, you can run the following commands on two instances of the AWS CLI. First, run this command to copy the files with names that begin with the text “logs”:
s3://samplebucket-logs/ s3://sampledestbucket-logs/test --recursive --exclude "*" --include "logs2019-09-16*" --profile profile1
Note: If you receive errors when running AWS CLI commands, make sure that you’re using the most recent version of the AWS CLI.
Then, run this command to copy the files with names that begin with the different dates for eg. 2021-04-02 and 2021-04-03:
s3://samplebucket-logs/ s3://sampledestbucket-logs/logs-audit-april-2021/ --recursive --exclude "*" --include "logs2021-04-02*" --include "logs2021-04-03*" --profile profile1
Additionally, you can customize the following AWS CLI configurations to speed up the data transfer:
multipart_chunksize: This value sets the size of each part that the AWS CLI uploads in a multipart upload for an individual file. This setting allows you to break down a larger file (for example, 300 MB) into smaller parts for quicker upload speeds.
Note: A multipart upload requires that a single file is uploaded in not more than 10,000 distinct parts. You must be sure that the chunk size that you set balances the part file size and the number of parts.
max_concurrent_requests: This value sets the number of requests that can be sent to Amazon S3 at a time. The default value is 10. You can increase it to a higher value like 50.
Note: Running more threads consumes more resources on your machine. You must be sure that your machine has enough resources to support the maximum amount of concurrent requests that you want.
Read more about --exclude and --include filters and how to use them: https://docs.aws.amazon.com/cli/latest/reference/s3/index.html#use-of-exclude-and-include-filters
Consider building a custom application using an AWS SDK to perform the data transfer for a very large number of objects. While the AWS CLI can perform the copy operation, a custom application might be more efficient at performing a transfer at the scale of hundreds of millions of objects.
After you set up cross-Region replication (CRR) or same-Region replication (SRR) on the source bucket, Amazon S3 automatically and asynchronously replicates new objects from the source bucket to the destination bucket. You can choose to filter which objects are replicated using a prefix or tag. For more information on configuring replication and specifying a filter, see the Replication configuration overview.
After replication is configured, only new objects are replicated to the destination bucket. Existing objects aren't replicated to the destination bucket. To replicate existing objects, you can run the following cp command after setting up replication on the source bucket:
aws s3 cp s3://samplebucket-logs s3://sampledestbucket-logs --recursive --storage-class STANDARD
This command copies objects in the source bucket back into the source bucket, which triggers replication to the destination bucket.
Note: It's a best practice to test the cp command in a non-production environment. Doing so allows you to configure the parameters for your exact use case.
You can use Amazon S3 batch operations to copy multiple objects with a single request. When you create a batch operation job, you specify which objects to perform the operation on using an Amazon S3 inventory report. Or, you can use a CSV manifest file to specify a batch job. Then, Amazon S3 batch operations call the API to perform the operation.
After the batch operation job is complete, you get a notification and you can choose to receive a completion report about the job.
The S3DistCp operation on Amazon EMR can perform parallel copying of large volumes of objects across Amazon S3 buckets. S3DistCp first copies the files from the source bucket to the worker nodes in an Amazon EMR cluster. Then, the operation writes the files from the worker nodes to the destination bucket. For more guidance on using S3DistCp, see Seven tips for using S3DistCp on Amazon EMR to move data efficiently between HDFS and Amazon S3.
Important: Because this option requires you to use Amazon EMR, be sure to review Amazon EMR pricing.
To move large amounts of data from one Amazon S3 bucket to another bucket, perform the following steps:
Open the AWS DataSync console.
Create a task.
Create a new location for Amazon S3.
Select your S3 bucket as the source location.
Update the source location configuration settings. Make sure to specify the AWS Identity Access Management (IAM) role that will be used to access your source S3 bucket.
Select your S3 bucket as the destination location.
Update the destination location configuration settings. Make sure to specify the AWS Identity Access Management (IAM) role that will be used to access your S3 destination bucket.
Configure settings for your task.
Review the configuration details.
Choose to Create task.
Start your task.