AWS glue combining multiple input into a single output csv

0

I have a AWS Glue customer doing POC and found that each output job creates a separate output csv file. They want to be able to create a single csv output file combining multiple input jobs or files. Any Guidance?

1 Antwort
0

Hi , could you please clarify:

  1. is it AWS Glue or Glue DataBrew?

  2. can you describe better the flow in the Job?

When speaking about combining multiple files and having a single output, there might be multiple methods:

  1. joining the input files (if the files have a common key and they want to combine the fields of both files)
  2. using a union (if they have the same schema and they just want to append a file to another)

obviously after they have combined the files they will be able to use a single Target node to write out. AWS Glue and Glue DataBrew, are running on top of Spark, so the output even of a single target would be split in multiple files, if they need one single file:

  1. in AWS Glue they could just add a step to reduce the number of Spark partitions to 1 using the function coalesce(1)
  2. In DataBrew, unless it is recently changed, you do not have this option please check this other question/answer.

hope this helps.

AWS
EXPERTE
beantwortet vor 2 Jahren

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen