Glue ETL drops columns on read. Is the crawler's fault?

0

I am having some inconsistencies with my Glue crawler and Glue ETL job and I need some help to figure out the best setup. At the moment, I have an S3 bucket where I store CSV files and I partition the data using prefixes like so s3://my_bucket/interim/year=2018/version=270/example.csv.

Every time I upload a dataset to the bucket, I run a Glue crawler to update the Glue data catalog table with the new data and partitions. For the crawler, I have the option Crawl all sub-folders enabled. The data catalog table can then be used to query the data in Athena for ad-hoc analysis and run an ETL job using a notebook in Glue Studio. In the ETL job, I have job bookmark enabled and I read the data from the table using the following piece of code.

glue_df = glueContext.create_dynamic_frame.from_catalog(
    database="my-database", 
    table_name="crawler_interim",
    transformation_ctx = "datasource0"
).toDF()

The problem that I am facing is that when I upload new CSV files, create new partitions in the bucket, run the crawler and then run the ETL job, the glue_df dataframe doesn't return all the columns in the data. The dataset should have 61 columns but glue_df is returning 60 columns.

I suspect the problem might be how the crawler interprets the schema of the new files. If I could read the newly added dataset in the ETL job, I could fix the schema directly there but if I have missing columns then that's an issue. I am not sure how I can get past that. Some of my questions are:

  • Am I doing something wrong with the crawler?
  • Should I edit the crawler for the *Subsequent crawler runs to Crawl new sub-folders only?
  • Should I have a Lambda function that will run on every file upload on S3 and it will "fix" the schema of the interim files before they are processed by the ETL job?
  • Should I avoid reading the data from the database and just read directly from S3?

Any help will be much appreciated!

asked 2 months ago74 views
1 Answer
1

Hello, Thank you very much for your questions. Based on your questions and code, it seems like the issue is related to how the Glue Crawler is interpreting the schema of the new CSV files that are being added to the S3 bucket. In order to solve the problem, you could try the following options:

  • Check the Crawler Configuration: Ensure that the Crawler is configured correctly to handle the schema changes in the new CSV files. By default, the Crawler tries to infer the schema from the first few files it encounters. If the new files have a different schema, the Crawler may not pick it up correctly. You can try adjusting the Crawler's settings, such as increasing the number of files it samples for schema inference or specifying a custom classifier to handle the schema changes.

  • Use the "Subsequent crawler runs to Crawl new sub-folders only" option

  • Use a Lambda function to fix the schema: As you suggested, you could create an AWS Lambda function that triggers whenever a new file is uploaded to the S3 bucket. This Lambda function could read the file, fix the schema if necessary, and then write the file back to the same location with the correct schema.

Please access the following resources for further information:

AWS
answered a month ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions