Change file format written to temp folder when using glue write_dynamic_frame.from_options

0

Hi, i am reading a table from MySQL RDS and trying to write it to Redshift. Glue write_dynamic_frame.from_options writes the table data implicitly into temporary folder before trying to insert it to redshift. The job fails , it says that 'unknow boolean format' when i check the error from redshift stl_load_errors table.

I went to check those files in the temporary folder, the issue is that those files are written in CSV format, when i opened one of them , the data in some rows are mixed between the columns, thus, having some text data in the boolean column.

is there a way to change how write_dynamic_frame.from_options writes data to temporary folder? maybe changing csv to another format like json or parquet , something that does not miss up the data.

i tried adding the (format= "json") to the wirte_dynamic_frame.from_options but it gave an error (com.amazonaws.services.glue.util.NonFatalException: Formats not supported for SparkSQL data sinks. Got json).

then i tried adding the "format":"json" in the connection_options parameter but it gave me an error (TypeError: getSink() got multiple values for argument 'format') although format is not specified anywhere else.

code below: Thanks

AmazonRedshift_node3=glueContext.write_dynamic_frame.from_options(
frame= dyn_frame,
connection_type="redshift",
connection_options= {"redshiftTmpDir":"s3://aws-glue-assets-8942440245-eu-west-1/temporary/",
                                      "useConnectionProperties":"true",
                                      "dbtable":"dev.udt_order_temp",
                                       "connectionName":"Glue-to-redshift",
                                       "aws_iam_user":"arn:aws:iam:80423242422:role/glue_iam",
},
transformation_ctx="AmazonRedshift_node3
)

posta un anno fa606 visualizzazioni
1 Risposta
0

No, you cannot change that because the "COPY" command on Redshift will expect CSV.
What you can do is fix the data before writing so it respects the format and Redshift can load it.

profile pictureAWS
ESPERTO
con risposta un anno fa
  • Data in the dynamic data frame is accurate , i inspected the distinct values of the boolean fields which gives true or false (no need for me to correct anything here). But once the write_dynamic_frame.from_options job starts, it writes the data into temporary folder in csv, the data in the csv is not accurate (this is the thing i cannot control). I opened one csv file and data is not appearing in the correct columns for some rows (some rows have text in the boolean field).

  • That's odd, I wonder if something is breaking the csv or the DynamicFrame is not really correct, have you checked the schema to make sure there are no choices?, what happens if you print/get the specific rows that have issues in the csv

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.

Linee guida per rispondere alle domande