I can't use governed tables with Glue

I'm trying to use lake formation governed tables but have encountered some issues when I try to use them with Glue. I followed these steps:

Firstly, I created a table in a database in lake formation following this schema:;
Afterwards, I created a glue job that turns a non-governed table into a governed table, using the script that aws used in this link https://docs.aws.amazon.com/lake-formation/latest/dg/transactions-code-examples. The script is

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ["JOB_NAME"])

sc = SparkContext()
glueContext = GlueContext(sc)
logger = glueContext.get_logger()
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

db = "db_source_zone"
tbl = "software_review"
output_db = "db_curated_zone"
output_tbl = "software_review_gov"
tx_id = glueContext.start_transaction(False)

# Script generated for node AWS Glue Data Catalog
datasource0 = glueContext.create_dynamic_frame.from_catalog(
    database=db,
    table_name=tbl,
    transformation_ctx="datasource0",
)
nrows = datasource0.toDF().count()
logger.info(f'Number of rows: {nrows}')

datasink0 = glueContext.write_dynamic_frame.from_catalog(
    frame = datasource0, 
    database=output_db, 
    table_name=output_tbl, 
    additional_options={
        "useGlueParquetWriter":True,
        "transactionId":tx_id, 
        "partitionKeys":["year_month"],
        "callDeleteObjectsOnCancel":"true"
    }
)
#glueContext.purge_table(output_db, output_tbl)
#wr.lakeformation.commit_transaction(tx_id)
is_commited = glueContext.commit_transaction(tx_id)
logger.info(f'Commit status: {is_commited}')
job. Commit()

The job runs successfully, I saw that the commit result was True, but when I do a query in athena, the table is empty A second script I found was this one

sc = SparkContext()
glueContext = GlueContext(sc)
logger = glueContext.get_logger()
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

db = "db_source_zone"
tbl = "software_review"
output_db = "db_curated_zone"
output_tbl = "software_review_gov"
tx_id = glueContext.start_transaction(False)

# Script generated for node AWS Glue Data Catalog
datasource0 = glueContext.create_dynamic_frame.from_catalog(
    database=db,
    table_name=tbl,
    transformation_ctx="datasource0",
)
nrows = datasource0.toDF().count()
logger.info(f'Number of rows: {nrows}')

dest_path = "s3://bbs-data-lake/curated_zone/software_review/"

sink = glueContext.getSink(
    connection_type="s3", path=dest_path,
    enableUpdateCatalog=True,
    updateBehavior="UPDATE_IN_DATABASE",
    transactionId=tx_id,
	additional_options={
	        "useGlueParquetWriter":True,
	        "partitionKeys":["year_month"],
            "callDeleteObjectsOnCancel":"true"
        }
)
sink.setFormat("glueparquet")
sink.setCatalogInfo(
    catalogDatabase=output_db, catalogTableName=output_tbl
)

try:
    sink.writeFrame(datasource0)
    glueContext.commit_transaction(tx_id)
except Exception:
    glueContext.cancel_transaction(tx_id)
    raise
job. Commit()

in this case the job does not run and returns the following error related to writing to s3 Script error I tried with different configurations, like changing the glue version, using the useGlueParquetWriter parameter, but without success. Remembering that, if I update the tables governed by the AWS data wrangler, it works perfectly, but I cannot use spark, only pandas in this case.

Gonzalo Herreros EXPERTO
hace un año
The stack trace in the logs should give more information about the cause of that InvocationTargetException.

Temas

Análisis

Etiquetas

AWS Glue AWS Lake Formation

Idioma

English

bruno_souza_bs

preguntada hace un año88 visualizaciones

No hay respuestas

Más nuevo
Más votos
Más comentarios

Contenido relevante

¿Por qué mi rastreador de AWS Glue no agrega particiones nuevas a la tabla?
OFICIAL DE AWSActualizada hace 3 años
¿Cómo se comparten las bases de datos y las tablas de AWS Glue Data Catalog entre cuentas mediante AWS Lake Formation?
OFICIAL DE AWSActualizada hace 2 años
¿Por qué faltan algunas de mis tablas de AWS Glue en Athena?
OFICIAL DE AWSActualizada hace un año
¿De qué manera puedo solucionar los errores de permisos de Lake Formation en AWS Glue?
OFICIAL DE AWSActualizada hace un año