Hello,
I am trying to add a column to a data set as part of a transformation in an AWS Glue job. I am developing my script locally using interactive sessions and in my session I can see that the code I have written is adding the new column. However, when I write that data and view the resulting data, the column is not added.
Here is a brief walkthrough of how I am implementing the add-column transformation:
- I am working with a dynamic frame assigned to a variable named
ChangeSchema_node1703083178011
.
- I convert the dynamic frame to a spark data frame calling the
toDF
method on the ChangeSchema_node1703083178011
variable and assign the result to the sparkDf
variable.
- I then call the
withColumn
method on the sparkDf
variable, passing in two arguments, and I assign the result to the dfNewColumn
variable.
- I then convert back to a dynamic frame using the
DynamicFrame.fromDF
method and assign the result to the dyF
variable.
sparkDf = ChangeSchema_node1703083178011.toDF()
dfNewColumn = sparkDf.withColumn("test_col", lit(None))
dyF = DynamicFrame.fromDF(dfNewColumn, glueContext, "convert")
As I said, I can verify that the column has been added by calling the show
method on the dyF
variable and seeing the printed out result in my interactive session. However, when I write the dynamic frame in order to produce my output data files with the following code:
AmazonS3_node1702921197069 = glueContext.write_dynamic_frame.from_options(
frame=dyF,
connection_type="s3",
format="glueparquet",
connection_options={
"path": "s3://smedia-data-processing-dev/google/cron_name/",
"partitionKeys": [],
},
format_options={"compression": "snappy"},
transformation_ctx="AmazonS3_node1702921197069",
)
job.commit()
...the resulting output data does not include the added column. The job runs successfully, but the added column is not in the resulting data.
However, interestingly when I try to write a value to the added column instead of using None
in this line
dfNewColumn = sparkDf.withColumn("test_col", lit(None))
such as a string value like "hello"
as in the following:
dfNewColumn = sparkDf.withColumn("test_col", lit("hello"))
then the the output data does include the added column.