glue job - Issues when reading from the glue catalog table using dynamic frame

0

Hi All, I have some issues when running my glue job, I landed my pipe delimited csv file in a s3 bucket and after running the crawler pointing to the folder where the file is placed, a glue catalog table is created.

However when I tried to read the data(code below) from the catalog table in a glue job for additional processing and converting to parquet, its not picking all the records.

dyf = glueContext.create_dynamic_frame.from_catalog(
            database=DATABASE,
            table_name=table_name,
            transformation_ctx="dyf-" + table_name,
    )
    rows = dyf.count()
    print(f"DataFrame records count : {rows}")

Please can someone suggest what could be the reason for the missing records? I see that there are three columns in the catalog table with incorrect data type( bigint in place of string). I went and manually corrected the data type and set the infer_schema = True in the above code. job is still not picking up the correct number of records.

  • How much is the difference? Is possible you have bookmarks enabled? How many do you get if you do spark.table(f"{DATABASE}.{table_name}").count()

  • Hi,

    Th Glue data Catalog is not a constraint, it is for information. You can force the schema using the glue resolve function. you have to manually force the schema

Pradeep
질문됨 일 년 전1432회 조회
1개 답변
0

Sounds your CSV has formatting issues, I would try to parse/validate it some other way to identify what's wrong, sometimes invisible characters can break the parsing but if it's missing records my guess is that some quote is not properly closed/escaped.

profile pictureAWS
전문가
답변함 일 년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인