why is the output schema wrong on my visual ETL glue job unless I use data preview?

0

The goal is to create an ETL job that can be altered and executed by non-technical users in our organization, which is why we are sticking to only visuals and not code.

The problem is that the nodes don't seem to update the output schema unless I click "Data Preview" then "Use datapreview schema" which doesn't seem intuitive at all. is this a bug?

For example lets say my datasource is an S3 bucket containing a CSV file with column's A, B, and C. Then I have a second node (node2), "Add Current Timestamp". The output schema of node2 is as follows: Key Data type A string B string C string

then if I click do a data preview and use datapreview schema it becomes: Key Data type A string B string C string current_timestamp timestamp

Since the purpose of node2 is to add the current timestamp column, I would expect it to be added to the output schema without having to use data preview. This becomes very time consuming when you have more than 2 nodes and have to do a preview on each one.

Also the Data Preview doesn't work on glue 4.0 so in that case I can't use the visual editor at all.

Has anyone else ran into this and found a solution other than using code instead of the visual editor?

Thanks!

1 Answer
2
Accepted Answer

Hello,

I understand that when trying to create a new column based on Timestamp attribute using the visual transform (Add Current Timestamp) to the existing schema, the problem is we need to use Data Preview option and output schema button (use datapreviewschema) to see the new column create/add to the output schema during the runtime.

As of now this is the default behaviour of the service and I agree this seems to be a limitation as it is time consuming and have to do preview for each node separately if there are multiple and not very intuitive.

Thank you for providing your valuable feedback on the service. I have raised a feature request with the service team on your behalf. While I am unable to comment on if/when this feature may get released, I request you to keep an eye on our Whats new and Blog for any new feature announcements.

[1]https://aws.amazon.com/new/ [2]https://aws.amazon.com/blogs/aws/

There seems to be another manual way of adding to output schema without running data preview i.e , on the Output schema tab, you can click edit, and then add a root key. check if it helps your use case else will pass it as a feedback to the service.! screenshot

Regarding the Data Preview not working on glue 4.0, This feature currently is in pipeline and should be available soon. I can’t comment on the ETA, please keep an eye on our What's New and Blog pages for updates

Thank you !

AWS
SUPPORT ENGINEER
answered a year ago
  • In addition, you can build your visual job using Glue 3.0. so you can use the preview and then switch to Glue 4.0 when you want to do end to end testing. On the vast majority of cases, the job should do the same on both versions.

  • Thank you this is helpful. Unfortunately using the edit, add root key method doesn't seem to work. It only allows you to add columns at the top of the dataset and it isn't actually where the current_timestamp should be located, so its null on the output file. But I appreciate you adding the service request!

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions