Upload file to S3 using multipart upload with multithread

0

Hi,

Considering the volume of the compressed(zip, gz and tar) files in S3, so trying to read it in stream using python to upload it back to another s3 bucket with uncompressed format.

  1. Read the compressed file in chunks, uncompress it and to upload it to S3 upload_fileobj (https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/client/upload_fileobj.html) but every chunk overwriting previous loaded chuck only the last chuck in available/visible in S3. For Ex: my source compressed file has 10 GB and my chunk size is 5 MB, I could see only last chunk 5 MB size file available in S3 once my glue job completed successfully.
  2. To overcome this issue I used https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/client/create_multipart_upload.html. But I am not sure how to use thread this approach.

Could you please help me on what could be reason for #1 and how to use threads in #2

asked 6 months ago1.3K views
1 Answer
0

If you only call any of the available upload APIs and specify a single chunk as the contents for a destination object, S3 will take the chunk as representing the entire contents of the object, overwriting an existing object by that key.

For S3 to consider the chunks parts of the same object, you first have to call create_multipart_upload (https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/client/create_multipart_upload.html) to establish a multipart upload session. Then use upload_part (https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/client/upload_part.html) to upload each chunk. Note that each chunk has to be a minimum of 5 MiB in size, except for the last part, and there cannot be more than 10,000 parts to one multipart upload. Once all the parts are uploaded, call complete_multipart_upload (https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/client/complete_multipart_upload.html) to tell S3 that it's time to assemble the full object by concatenating the parts and committing the combined result as the target object in the S3 bucket.

There's no particular requirement to upload the parts in multiple threads in your code. It's just that the part uploads are completely independent of one another and can be sent to different S3 servers, so nearly unlimited throughput can be achieved by parallelising the upload. It's probably neither needed nor practical in your specific use case, since you are decompressing the file in a single thread anyway. Technically, there's nothing preventing you from uploading one part at a time.

Note that if the multipart upload doesn't get completed for any reason, the upload session and the parts that you uploaded will remain in S3's staging area for all eternity, until or unless you abort the multipart upload explicitly with abort_multipart_upload (https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/client/abort_multipart_upload.html). The parts in the staging area are charged at the S3 Standard storage class rate. However, there's a built-in mechanism that you can activate to abort incomplete uploads one or more days after they were started. This blog post explains how and why to set it up: https://aws.amazon.com/blogs/aws-cloud-financial-management/discovering-and-deleting-incomplete-multipart-uploads-to-lower-amazon-s3-costs/

EXPERT
answered 6 months ago
  • Thank you for the details. I already used same approach that you shared. But my doubt was how to use multithread to make the transfer much faster ( if there are 1000 of files in the zip) .

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions