MSCK REPAIR TABLE behaves differently when executed via Spark Context vs Athena Console/boto3

0

I have a Glue ETL job which creates partitions during the job

    additionalOptions = {"enableUpdateCatalog": True, "updateBehavior": "LOG"} 
    additionalOptions["partitionKeys"] = ["year", "month", "day"]

I don’t have it Update the Data Catalog because doing so changes all my Table Data Types. So after I am done, the way I get the Data Catalog updated with the correct partition information is to run MSCK REPAIR TABLE. If I do this inside the Glue ETL job using the Spark Context like so:

   spark.sql("use gp550_load_database_beta")
   spark.sql("msck repair table gp550_load_table_beta").show()

The following happens:

Serde Properties of my table are updated with “serialization.format : 1” Table Properties are updated with: EXTERNAL : TRUE spark.sql.partitionProvider : catalog

ALL Data Types in my table are set to “String” with a comment of “from deserializer”

Basically it makes a mess.

If I instead run MSCK REPAIR TABLE from boto3, or if I manually run it from Athena Console, then there are no issues. No Serde Properties are changes, no table properties, no data types are changed, it simply adds my partitions to the Data Catalog like I want. I do like so in my ETL job:

    client = boto3.client('athena')
    sql = 'MSCK REPAIR TABLE gp550_load_database_beta.gp550_load_table_beta'
    context = {'Database': 'gp550_load_database_beta'}

    client.start_query_execution(QueryString = sql, 
                                 QueryExecutionContext = context,
                                 ResultConfiguration= { 'OutputLocation': 's3://aws-glue-assets-977465404123-us-east-1/temporary/' })

Why does it behave so differently? Is it because somehow I need to tell Spark to work with HIVE? I had thought that since I already had a spark context it would be easy to use that to kick off the MSCK REPAIR TABLE, but obviously I was surprised at the result!

bfeeny
질문됨 2년 전111회 조회
답변 없음

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인