Glue ETL job write part-r-00 files to same bucket as my input. Any way to change this?

0

I read in files from an S3 bucket, convert to a Spark DataFrame, transform, convert back to a Dyanmic DataFrame and then write to Data Catalog. This creates a bunch of part-r-00 files in the same bucket as my input so then my script then tries to read and process those files as well! Does it have to create these files? Is it possible to set a different bucket for these files? If not is it possible to have my ETL only read files that end in .csv?

S3bucket_node1 = glueContext.create_dynamic_frame.from_options(
    format_options={"quoteChar": '"', "withHeader": True, "separator": ","},
    connection_type="s3",
    format="csv",
    connection_options={"paths": ["s3://bpf-load-forecast/lfo_data/"], "recurse": True},
    transformation_ctx="S3bucket_node1",
)
.
# convert from Dynamic DataFrame to Spark DataFrame
.
.
# transformations
.
.
# convert from Spark DataFrame to Dyanmic DataFrame
.
.
DataCatalogtable_node2 = glueContext.write_dynamic_frame.from_catalog(
    frame = dynamic_df,
    database = db_name,
    table_name = tbl_name,
    transformation_ctx = "DataCatalogtable_node2",
)
bfeeny
feita há 2 anos1166 visualizações
2 Respostas
1
Resposta aceita

I figured this out. When Glue Data Catalog wanted my "Data Store" folder (which is where it stores the part-r files), I entered the same folder as my S3 source files. Simply changed this to a new empty folder and that fixed this.

bfeeny
respondido há 2 anos
AWS
ESPECIALISTA
avaliado há 2 anos
0

I am facing the same challenge now, but I don't see the "Data Store" section in the new interface. Can you kindly share some pointers?

Mike
respondido há 24 dias
  • Actually I have been able to resolve this with a classifier. that uses OpenCSVSerDe and identifies the delimiter, quotechar etc in the file.

Você não está conectado. Fazer login para postar uma resposta.

Uma boa resposta responde claramente à pergunta, dá feedback construtivo e incentiva o crescimento profissional de quem perguntou.

Diretrizes para responder a perguntas