I want to transfer at least 1 TB of data from one Amazon Simple Storage Service (Amazon S3) bucket to another bucket.
Resolution
To transfer large amounts of data from one Amazon S3 bucket to another, use one of the following methods:
- AWS Command Line Interface (AWS CLI)
- Cross-Region Replication (CRR) or Same-Region Replication (SRR)
- Amazon S3 Batch Operations
- S3DistCP with Amazon EMR
- AWS DataSync
Note: If you receive errors when you run AWS CLI commands, then see Troubleshooting errors for the AWS CLI. Also, make sure that you're using the most recent AWS CLI version.
AWS CLI
To improve your data transfer time, split the transfer into multiple mutually exclusive operations. For example, use the AWS CLI to run multiple parallel operations such as, aws s3 cp, aws s3 mv, or aws s3 sync. You can create more upload threads when you use the --exclude and --include parameters to filter operations by file name.
Note: Because the --exclude and --include parameters process on the client side, resources on your local machine might affect the performance of the operation.
To copy a large amount of data from one bucket to another, run the following commands:
Note: The file names begin with a number.
-
Run the following cp command to copy the files with names that begin with the numbers 0 through 4:
aws s3 cp s3://source-awsexamplebucket/ s3://destination-awsexamplebucket/ --recursive --exclude "*" --include "0*" --include "1*" --include "2*" --include "3*" --include "4*"
-
Run the following cp command on a second AWS CLI operation to copy the files with names that begin with the numbers 5 through 9:
aws s3 cp s3://source-awsexamplebucket/ s3://destination-awsexamplebucket/ --recursive --exclude "*" --include "5*" --include "6*" --include "7*" --include "8*" --include "9*"
You can also customize the following AWS CLI S3 configuration values to improve your data transfer time:
- Use the multipart_chunksize to set the size of each part that the AWS CLI uploads in a multipart upload for an individual file. You can break down a larger file into smaller parts for quicker upload speeds.
Note: For a multipart upload, you can upload a single file in a maximum of 10,000 distinct parts. Verify that the chunk size that you set balances the part file size and the number of parts.
- Use the max_concurrent_requests to set the number of requests that you can send to Amazon S3 at one time. The default value is 10, but you can increase it to a higher value. Verify that your machine has sufficient resources to support your maximum number of concurrent requests.
CRR or SSR
Set up CRR or SSR on the source bucket to allow Amazon S3 to automatically replicate new objects from the source bucket to the destination bucket. To filter the objects that Amazon S3 replicates, use a prefix or tag. For more information, see Replication configuration file elements.
After you configure replication, Amazon S3 replicates only new objects to the destination bucket, not existing objects. For more information, see Replicating existing objects with Batch Replication and What isn't replicated with replication configurations?
Amazon S3 Batch Operations
You can use Amazon S3 Batch Operations to copy multiple objects with a single request. When you create a batch operation job, you can use an Amazon S3 inventory report to specify the objects that Amazon S3 performs the operation on. Or, you can use a CSV manifest to specify a batch job. Then, Amazon S3 Batch Operations calls the API to perform the operation.
After the batch operation job completes, you get a notification and an optional completion report.
S3DistCp with Amazon EMR
The S3DistCp operation on Amazon EMR can copy in parallel a large number of objects across Amazon S3 buckets. S3DistCp first copies the files from the source bucket to the worker nodes in an Amazon EMR cluster. Then, the operation writes the files from the worker nodes to the destination bucket. For more information, see Seven tips for using S3DistCp on Amazon EMR to move data efficiently between HDFS and Amazon S3.
Important: Because you must use Amazon EMR with S3DistCp, be sure to review Amazon EMR pricing.
AWS DataSync
To use AWS DataSync to move large amounts of data from one Amazon S3 bucket to another bucket, you must create a transfer location. For a general purpose bucket, see Creating your transfer location for an Amazon S3 general purpose bucket. For an Outpost bucket, see Creating your transfer location for an S3 on Outposts bucket.
Note:
Related information
How do I identify data transfer costs in Amazon S3?