Redshift Data Transfer Incapability

Question

We are trying to copy a dataset from EMR to Redshift which consist of around 13 billion records and 20-25 columns. I tried copying the dataset with the traditional method suing the COPY command through S3 Bucket pointing to EMR but it is taking more than 24 hours to copy the dataset.
Current EMR configuration is 1 main Node of r5.4xlarge and 2 core nodes of r5.4xlarge and Redshift configuration is 2 ra.3xplus nodes.
We have also enabled sortkey, distkey and compression. The dataset is stored in parquet format in S3 bucket.
Please suggest a way to copy the dataset in less time as much as possible.

Answer

Without analyzing in detail, I know that using Redshift Spectrum to read the files may, in certain cases, be faster than COPY.  E.g. it has the ability to automatically split the file.  If you can catalog this data set as an [external table](https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_TABLE.html), then do an insert/select, you may get better performance.

Answer

Hello,

Thank you for writing on re:Post

I understand that you would like to increase the speed at which your Spark application writes to S3.

I would recommend experimenting with increasing the size of the executors and/or the number of executors, since it looks like the executors are finishing quite quickly.

Other thing we may want to consider include:
- How data is being partitioned, partitioning the data can often improve write efficiency. This can be done by using a partitionBy with the write.
- The data format being written, this can have a significant performance impact.
- We can also consider using Redshift's UNLOAD function if the data isn't being heavily modified in Spark. Redshift's UNLOAD performance is typically quite good. For more information on UNLOAD: https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html#unload-parameters

Please try the above suggestions. Hope they improve your experience.

Thanks and have a great day!

Answer

Hi,

I understand you have an application in EMR writing to S3 in Parquet Format and you are trying to load the generated data in Redshift uding the Copy Command.

You have built your table using best practices, at the same time the size of your cluster is quite small 2 ra3.xplus nodes (8 cores and 64 GB of memory in total).

As mentioned in the [documentation](https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-use-multiple-files.html) you should have multiple files, which is for sure your situation, which are multiple of the number of slices in your cluster.

The nodes you are utilizing have 2 slices each, these means you will load 4 files at the time.

Load time will depend on 3 factors: total size of the dataset to load ( how many GB in size is the Parquet table ) , the number and size of each file, the parallelism at which you can load.

With 13 Billions rows I would expect the table to be large and have thousands of files.

If your files are already quite large (closer to the upper range -125 MB at least - mentioned in the documentation I mentioned above; for ease of  read I add an excerpt here:   the file should be: *"about equal size, from 1 MB to 1 GB after compression. For optimum parallelism, the ideal file size is 1–125 MB after compression"*), you could speed up the copy command would be to scale out the cluster using the [Fast Classic Resizing](https://aws.amazon.com/blogs/big-data/accelerate-resize-and-encryption-of-amazon-redshift-clusters-with-faster-classic-resize/) so that you can increase the number of parallel loads.

If your files are small ( less than 100MB or even worst less then 10 or 1 MB) then I would first recommend to compact the files in larger ones and retry, if still too slow you have to consider resizing the cluster anyway to increase the load parallelism.

Hope this helps.

Redshift Data Transfer Incapability

관련 콘텐츠