- Newest
- Most votes
- Most comments
Hi TiwaryShashwat,
Depending on how complex (or not) the transforms in your Glue jobs are it might be easier to just export or unload the source data from your RDS instance to S3 in a format compatible to load into Redshift with a COPY command. You can do most lightweight transforms in the select portion of your unload or export and even partly in your COPY command.
Another option is to export or unload from RDS into an external table compatible key structure in S3 and just map an external table definition on top of it in the Glue Data Catalog. That way with a simple CREATE EXTERNAL SCHEMA ... declaration in Redshift you can query the data in S3 directly using Redshift Spectrum. If you choose this route I'd suggest that you consider some form to time based partitioning of that external table to both make data management in S3 easier and your Redshift Spectrum queries faster where you can use a time based filter predicate on the partitioning column.
While the route you're going now using dataframes in Glue Spark can work, depending on your budget for scaling the Glue job and the amount of data you're operating on, you may likely run into scaling problems where you have to have the whole dataframe in memory on the Glue Spark cluster to perform your current process.
Please let me know if this helps.
-Kurt
Relevant content
- Accepted Answerasked 8 years ago
- Accepted Answerasked a year ago
- Accepted Answerasked 4 years ago
- What happens to Amazon RDS and Amazon Redshift queries that are running during a maintenance window?AWS OFFICIALUpdated 2 months ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 2 years ago