Redshift Data Transfer Incapability

0

We are trying to copy a dataset from EMR to Redshift which consist of around 13 billion records and 20-25 columns. I tried copying the dataset with the traditional method suing the COPY command through S3 Bucket pointing to EMR but it is taking more than 24 hours to copy the dataset. Current EMR configuration is 1 main Node of r5.4xlarge and 2 core nodes of r5.4xlarge and Redshift configuration is 2 ra.3xplus nodes. We have also enabled sortkey, distkey and compression. The dataset is stored in parquet format in S3 bucket. Please suggest a way to copy the dataset in less time as much as possible.

Noa0411
질문됨 2년 전467회 조회
3개 답변
0

Hello,

Thank you for writing on re:Post

I understand that you would like to increase the speed at which your Spark application writes to S3.

I would recommend experimenting with increasing the size of the executors and/or the number of executors, since it looks like the executors are finishing quite quickly.

Other thing we may want to consider include:

  • How data is being partitioned, partitioning the data can often improve write efficiency. This can be done by using a partitionBy with the write.
  • The data format being written, this can have a significant performance impact.
  • We can also consider using Redshift's UNLOAD function if the data isn't being heavily modified in Spark. Redshift's UNLOAD performance is typically quite good. For more information on UNLOAD: https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html#unload-parameters

Please try the above suggestions. Hope they improve your experience.

Thanks and have a great day!

AWS
지원 엔지니어
답변함 2년 전
  • Hi , while the answer is correct and thorough, the User is asking about COPY command performance.

0

Without analyzing in detail, I know that using Redshift Spectrum to read the files may, in certain cases, be faster than COPY. E.g. it has the ability to automatically split the file. If you can catalog this data set as an external table, then do an insert/select, you may get better performance.

profile pictureAWS
전문가
Rajiv_G
답변함 2년 전
AWS
전문가
검토됨 2년 전
0

Hi,

I understand you have an application in EMR writing to S3 in Parquet Format and you are trying to load the generated data in Redshift uding the Copy Command.

You have built your table using best practices, at the same time the size of your cluster is quite small 2 ra3.xplus nodes (8 cores and 64 GB of memory in total).

As mentioned in the documentation you should have multiple files, which is for sure your situation, which are multiple of the number of slices in your cluster.

The nodes you are utilizing have 2 slices each, these means you will load 4 files at the time.

Load time will depend on 3 factors: total size of the dataset to load ( how many GB in size is the Parquet table ) , the number and size of each file, the parallelism at which you can load.

With 13 Billions rows I would expect the table to be large and have thousands of files.

If your files are already quite large (closer to the upper range -125 MB at least - mentioned in the documentation I mentioned above; for ease of read I add an excerpt here: the file should be: "about equal size, from 1 MB to 1 GB after compression. For optimum parallelism, the ideal file size is 1–125 MB after compression"), you could speed up the copy command would be to scale out the cluster using the Fast Classic Resizing so that you can increase the number of parallel loads.

If your files are small ( less than 100MB or even worst less then 10 or 1 MB) then I would first recommend to compact the files in larger ones and retry, if still too slow you have to consider resizing the cluster anyway to increase the load parallelism.

Hope this helps.

AWS
전문가
답변함 2년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠