Bug reporting - Glue job: Sometimes Relationlize function doesn't recognize temporary path

0

Need AWS tech team's help here.

I've used my job's temporary path retrived by getResolvedOptions func as staging_path of relationlize function. Found the job fails sometimes - means NOT REGURARY - because the job can't retrive the staged table after Relationalize function executed.

For your better understading, added some explanation & codes in below. Pls advise me if any and kindly confirm back that we can keep using arguments get by getResolvedOptions func.

[Code 1]

getResolvedOptions(sys.argv, [..., "TempDir", ...])
...
# the name of the target field to be relationalized is "params"

flatten_dyc = dyc["post_log"].relationalize(
        root_table_name = 'root',
        staging_path = args["TempDir"],
        transformation_ctx = 'flatten_dyc'
)

flatten_dyc["root"].printSchema()
flatten_dyc["root_params"].printSchema()

This morning, I've ran it and got result as below.

wrongoutput

flatten_dyc["root_params"] is empty despite it should have had id field at least to join with flatten_dyc["root"] table.

[Code 2]

So I tried the same script with hard codedstaging_path(pls refer to the blw) and found the job read staged tables - flatten_dyc["root"] - with all fields successfully.

...

flatten_dyc = dyc["post_log"].relationalize(
        root_table_name = 'root',
        staging_path = "s3://temp-glue-info/"
        transformation_ctx = 'flatten_dyc'
)

flatten_dyc["root"].printSchema()
flatten_dyc["root_params"].printSchema()

Correctoutput

My question is:

1/ Why the function couldn't read the staged table properly when the path was soft-coded?

2/ Moreover , when I run [Code1] again, flatten_dyc["root_params"] was read successfully. Means the function is not realiable. Can you look into this?

hyunie
asked a year ago214 views
1 Answer
1
Accepted Answer

When you run relationalize, the job saves the child tables under a uuid folder created under the stage folder.
Either it's failing to write those files or something is deleting them.
I would suggest using a specific staging folder (better than a bucket or the base job temp directory) and checking the folder as the job runs (maybe put a sleep between steps so you can see the progress).
Also, consider specifying a bucket/path that you know no other job is using so nobody should delete from it.
Ultimately, you might need to enable S3 audit to monitor if the files were added or deleted.

profile pictureAWS
EXPERT
answered a year ago
  • @Gonzalo Herreros 'll try following your suggestion. Could you advise me more the reason why you recommended to use a different path for staging with base job tmp dir? & why there is any chance that the staged table is deleted by other jobs even there's no command to delete it? Tks!

  • Normally the temporary dir is fine but in your case, it's better to move it somewhere else to eliminate the possibility of being deleted by some other job

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions