AWS Glue ETL from S3 to RDS Sql Server succeeds but there is no data in the table

0

My job succeeds with no error but when I query the table, its empty The CSV is a single column with no header The table has a single column, record_id

SELECT TOP 1000 [record_id]
  FROM 
 [h3-glue-db].[dbo].[h3_load]

Here's the job script

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

# Script generated for node S3 bucket
S3bucket_node1 = glueContext.create_dynamic_frame.from_options(
    format_options={
        "quoteChar": '"',
        "withHeader": False,
        "separator": ",",
        "optimizePerformance": False,
    },
    connection_type="s3",
    format="csv",
    connection_options={"paths": ["s3://h3-public/testfile2.csv"]},
    transformation_ctx="S3bucket_node1",
)

# Script generated for node ApplyMapping
ApplyMapping_node2 = ApplyMapping.apply(
    frame=S3bucket_node1, mappings=[("col0", "string", "record_id", "string")], transformation_ctx="ApplyMapping_node2"
)

# Script generated for node SQL Server table
SQLServertable_node3 = glueContext.write_dynamic_frame.from_catalog(
    frame=ApplyMapping_node2,
    database="h3-sql-database",
    table_name="h3_glue_db_dbo_h3_load",
    transformation_ctx="SQLServertable_node3",
)

job.commit()

  • Have you verified if the S3 endpoint to the file is correct?

  • Have you done a ApplyMapping_node2.show() to make sure it has data loaded and the mapping was correctly applied?

  • vtjean Yes the endpoint is correct, I verified

    Gonzalo Herreros ApplyMapping_node2.show() does not display anything, does that mean it is not mapping? I put the statement after the 23/05/22 16:01:18 WARN main: End - .show() below

    23/05/22 16:01:23 INFO DAGScheduler: Job 0 finished: save at JDBCUtils.scala:897, took 0.020324 s 23/05/22 16:01:19 INFO GlueContext: The DataSink in action for the given format/connectionType (sqlserver) is com.amazonaws.services.glue.sinks.SQLServerDataSink 23/05/22 16:01:19 INFO GlueContext: Glue secret manager integration: secretId is not provided. 23/05/22 16:01:19 INFO GlueContext: Using location: h3-glue-db.dbo.h3_load 23/05/22 16:01:19 INFO GlueContext: getCatalogSink: catalogId: null, nameSpace: h3-sql-database, tableName: h3_glue_db_dbo_h3_load, isRegisteredWithLF: false 23/05/22 16:01:18 WARN main: End - .show() 23/05/22 16:01:17 WARN main: Start - .show() 23/05/22 16:01:16 INFO GlueContext: The DataSource in action : com.amazonaws.services.glue.HadoopDataSource 23/05/22 16:01:15 INFO GlueContext: Glue secret manager integration: secretId is not provided. 23/05/22 16:01:14 INFO GlueContext: GlueMetrics configured and enabled 23/05/22 16:01:12 INFO Utils: Successfully started service 'sparkDriver' on port 44775.

gefragt vor einem Jahr364 Aufrufe
2 Antworten
0

Here is the documentation on how to view your AWS Glue Job Logs: https://docs.aws.amazon.com/glue/latest/dg/monitor-continuous-logging-view.html

AWS
vtjean
beantwortet vor einem Jahr
0

The problem ended up being permissions problems This video helped https://www.youtube.com/watch?v=UUoQAe_NzaA&list=PL7bE4nSzLSWfYAc3q1vEYFi145Mt_DLcF&index=2

Changing my S3 source to one that started with "aws-glue-" solved it. When the AWSGlueService role is added it is defaulted to buckets that start with aws-glue

Why it did not trigger an actual error and said the job succeeded, I don't know

beantwortet vor einem Jahr
profile picture
EXPERTE
überprüft vor 22 Tagen

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen