AWS Glue ETL from S3 to RDS Sql Server succeeds but there is no data in the table

0

My job succeeds with no error but when I query the table, its empty The CSV is a single column with no header The table has a single column, record_id

SELECT TOP 1000 [record_id]
  FROM 
 [h3-glue-db].[dbo].[h3_load]

Here's the job script

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

# Script generated for node S3 bucket
S3bucket_node1 = glueContext.create_dynamic_frame.from_options(
    format_options={
        "quoteChar": '"',
        "withHeader": False,
        "separator": ",",
        "optimizePerformance": False,
    },
    connection_type="s3",
    format="csv",
    connection_options={"paths": ["s3://h3-public/testfile2.csv"]},
    transformation_ctx="S3bucket_node1",
)

# Script generated for node ApplyMapping
ApplyMapping_node2 = ApplyMapping.apply(
    frame=S3bucket_node1, mappings=[("col0", "string", "record_id", "string")], transformation_ctx="ApplyMapping_node2"
)

# Script generated for node SQL Server table
SQLServertable_node3 = glueContext.write_dynamic_frame.from_catalog(
    frame=ApplyMapping_node2,
    database="h3-sql-database",
    table_name="h3_glue_db_dbo_h3_load",
    transformation_ctx="SQLServertable_node3",
)

job.commit()

  • Have you verified if the S3 endpoint to the file is correct?

  • Have you done a ApplyMapping_node2.show() to make sure it has data loaded and the mapping was correctly applied?

  • vtjean Yes the endpoint is correct, I verified

    Gonzalo Herreros ApplyMapping_node2.show() does not display anything, does that mean it is not mapping? I put the statement after the 23/05/22 16:01:18 WARN main: End - .show() below

    23/05/22 16:01:23 INFO DAGScheduler: Job 0 finished: save at JDBCUtils.scala:897, took 0.020324 s 23/05/22 16:01:19 INFO GlueContext: The DataSink in action for the given format/connectionType (sqlserver) is com.amazonaws.services.glue.sinks.SQLServerDataSink 23/05/22 16:01:19 INFO GlueContext: Glue secret manager integration: secretId is not provided. 23/05/22 16:01:19 INFO GlueContext: Using location: h3-glue-db.dbo.h3_load 23/05/22 16:01:19 INFO GlueContext: getCatalogSink: catalogId: null, nameSpace: h3-sql-database, tableName: h3_glue_db_dbo_h3_load, isRegisteredWithLF: false 23/05/22 16:01:18 WARN main: End - .show() 23/05/22 16:01:17 WARN main: Start - .show() 23/05/22 16:01:16 INFO GlueContext: The DataSource in action : com.amazonaws.services.glue.HadoopDataSource 23/05/22 16:01:15 INFO GlueContext: Glue secret manager integration: secretId is not provided. 23/05/22 16:01:14 INFO GlueContext: GlueMetrics configured and enabled 23/05/22 16:01:12 INFO Utils: Successfully started service 'sparkDriver' on port 44775.

posta un anno fa365 visualizzazioni
2 Risposte
0

Here is the documentation on how to view your AWS Glue Job Logs: https://docs.aws.amazon.com/glue/latest/dg/monitor-continuous-logging-view.html

AWS
vtjean
con risposta un anno fa
0

The problem ended up being permissions problems This video helped https://www.youtube.com/watch?v=UUoQAe_NzaA&list=PL7bE4nSzLSWfYAc3q1vEYFi145Mt_DLcF&index=2

Changing my S3 source to one that started with "aws-glue-" solved it. When the AWSGlueService role is added it is defaulted to buckets that start with aws-glue

Why it did not trigger an actual error and said the job succeeded, I don't know

con risposta un anno fa
profile picture
ESPERTO
verificato 22 giorni fa

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.

Linee guida per rispondere alle domande