How can I explicitly specify the size of the files to be split or the number of files?

0

Situation: If only specify the partition clause, it will be divided into multiple files. The size of one file is less than 1MB (~ 40 files).

What I am thinking of: I want to explicitly specify the size of the files to be split or the number of files when registering data with CTAS or INSERT INTO.

I have read this article: https://aws.amazon.com/premiumsupport/knowledge-center/set-file-number-size-ctas-athena/

Problem: Using bucketing method (like said in above article ) can help me specify the number of file or file size. However, it also said that "Note: The INSERT INTO statement isn't supported on bucketed tables". I would like to register data daily with Athena's INSERT INTO.

Question: what is the best way to build a partitioned data mart without compromising search efficiency? Is it best to register the data with Glue and save it as one file?

질문됨 2년 전1702회 조회
1개 답변
0
수락된 답변

Hello,

Yes. You are right that INSERT INTO is not yet supported for bucketed tables. For your use case where you wanted to specify the number of buckets/file sizes, using Athena bucketing would be appropriate but, with the downfall of not being able to use INSERT INTO to insert new incoming data.

But, I can recommend of using S3distcp utility on AWS EMR to merge small files into ~128MB size to solve your small file problem. You can use it to combine smaller files into larger objects. You can also use S3DistCP to move large amounts of data in an optimized fashion from HDFS to Amazon S3, Amazon S3 to Amazon S3, and Amazon S3 to HDFS.

REFERENCES:

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html

https://aws.amazon.com/blogs/big-data/seven-tips-for-using-s3distcp-on-amazon-emr-to-move-data-efficiently-between-hdfs-and-amazon-s3/

AWS
지원 엔지니어
답변함 2년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠