- Newest
- Most votes
- Most comments
Hi.
You can use Glue DynamicFrame API - is similar to a DataFrame, except that each record is self-describing, so no schema is required initially [1]. You can read the CSV files, process them, and store the processed files in another folder in the S3 bucket. Here is an example on how to achieve this in AWS Glue job.
- Create a new AWS Glue job and specify a data source for your input files using the
create_dynamic_frame.from_catalog
method. Provide the catalog database and table names where your crawler has stored the schema information. - Filter the dynamic frame to only include files whose name starts with "DUP" using the ‘filter’ transformation.
- Perform any necessary transformations or processing on the dynamic frame using various transformation functions available in Glue
- Finally, use the
write_dynamic_frame
method to write the processed dynamic frame to the desired location in your S3 bucket.
With these configurations you can read all the CSV files starting with "DUP," process them, and store the processed files in another folder within the same S3 bucket. Remember to set up appropriate IAM roles and permissions for your Glue job to access the necessary resources in your AWS environment. I hope this helps! Let me know if you have any further questions.
Thank you
References [1] DynamicFrame Class https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.htm
Relevant content
- asked a year ago
- AWS OFFICIALUpdated 4 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
Is there a reason you want to read the files one at a time? Generally it would be inefficient to do so in Spark.