Athena and Redshift Spectrum performance best practices

0

A customer is working on a PoC to validate both Athena & Redshift spectrum. Need help with following questions to provide some clarity.

  1. Is it fine to have a small Redshift cluster (e.g. 2-node dc2.large cluster) and have almost all data in S3 to be queried through Redshift Spectrum. Is this as recommended approach? What is the minimum recommended size of the cluster to use Redshift Spectrum?

  2. What is the correlation between performance and size of data stored in S3. Do we have any numbers that we shall share on what will be the general query performance for specific size of data using both Athena & Redshift Spectrum.

  3. Is there any major difference in query syntax between Athena & Redshift Spectrum. They want to understand how easy it's to swap between Athena & Redshift Spectrum. If there is a difference is there any known transition layer to support both of them?

질문됨 6년 전1.4천회 조회
1개 답변
1
수락된 답변

I would go through the Redshift Spectrum best practices blog here and plan to run some tests. It is hard to quantify such metrics as every customer workload is different.

Regarding your questions:

1/ Depends on a variety of factors as noted in the best practices blog. Such as parquet file format, Snappy compression, proper partitioning on S3 to help with query access patterns/filters, type of queries such as ORDER BY, DISTINCT which cannot be pushed down to Spectrum compute layer etc. Amazon Redshift Spectrum owns managed compute layer independent of your Redshift cluster. The number of Redshift Spectrum compute nodes that a query uses depends on the Redshift node type and the overall workload. Based on the demands of your queries and Redshift cluster configuration, Redshift Spectrum scales automatically in an intelligent fashion.

2/ Same as #1

3/ Regarding query syntax difference between Athena and Redshift Spectrum, yes.
Athena's query engine is Apache Presto and hence, it follows query syntax of Apache Presto. I would refer to Presto documentation here under "SQL Language" and "SQL Statement Syntax". As far as Spectrum goes, you will find that Spectrum follows pretty much the same syntax as Redshift except things like you cannot do DML operations on Spectrum tables due to the external table.

For the second part of your question, I would make sure that customer is aware when to use Athena versus Spectrum. They are not meant to replace each other but rather meant for different workloads. Athena is more like rent-a-car for adhoc/on-demand data explorations as and when needed without needing to spin up a cluster etc. Whereas Redshift Spectrum is more like a secondary car and Redshift is your primary car. A common pattern for Redshift Spectrum is to run queries that span both the frequently accessed “hot” data stored locally in Amazon Redshift and the “warm/cold” data stored cost-effectively in Amazon S3. This pattern serves to separate compute and storage, enabling independent scaling of both to match the use case without having to pay disproportionately for value. Athena and Redshift Spectrum query optimizers are completely different. There are also differences such as you can get the same rich compliance standards of Amazon Redshift.

AWS
답변함 6년 전
profile picture
전문가
검토됨 일 년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠