AWS Glue ETL from S3 to RDS Sql Server succeeds but there is no data in the table

0

My job succeeds with no error but when I query the table, its empty The CSV is a single column with no header The table has a single column, record_id

SELECT TOP 1000 [record_id]
  FROM 
 [h3-glue-db].[dbo].[h3_load]

Here's the job script

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

# Script generated for node S3 bucket
S3bucket_node1 = glueContext.create_dynamic_frame.from_options(
    format_options={
        "quoteChar": '"',
        "withHeader": False,
        "separator": ",",
        "optimizePerformance": False,
    },
    connection_type="s3",
    format="csv",
    connection_options={"paths": ["s3://h3-public/testfile2.csv"]},
    transformation_ctx="S3bucket_node1",
)

# Script generated for node ApplyMapping
ApplyMapping_node2 = ApplyMapping.apply(
    frame=S3bucket_node1, mappings=[("col0", "string", "record_id", "string")], transformation_ctx="ApplyMapping_node2"
)

# Script generated for node SQL Server table
SQLServertable_node3 = glueContext.write_dynamic_frame.from_catalog(
    frame=ApplyMapping_node2,
    database="h3-sql-database",
    table_name="h3_glue_db_dbo_h3_load",
    transformation_ctx="SQLServertable_node3",
)

job.commit()

  • Have you verified if the S3 endpoint to the file is correct?

  • Have you done a ApplyMapping_node2.show() to make sure it has data loaded and the mapping was correctly applied?

  • vtjean Yes the endpoint is correct, I verified

    Gonzalo Herreros ApplyMapping_node2.show() does not display anything, does that mean it is not mapping? I put the statement after the 23/05/22 16:01:18 WARN main: End - .show() below

    23/05/22 16:01:23 INFO DAGScheduler: Job 0 finished: save at JDBCUtils.scala:897, took 0.020324 s 23/05/22 16:01:19 INFO GlueContext: The DataSink in action for the given format/connectionType (sqlserver) is com.amazonaws.services.glue.sinks.SQLServerDataSink 23/05/22 16:01:19 INFO GlueContext: Glue secret manager integration: secretId is not provided. 23/05/22 16:01:19 INFO GlueContext: Using location: h3-glue-db.dbo.h3_load 23/05/22 16:01:19 INFO GlueContext: getCatalogSink: catalogId: null, nameSpace: h3-sql-database, tableName: h3_glue_db_dbo_h3_load, isRegisteredWithLF: false 23/05/22 16:01:18 WARN main: End - .show() 23/05/22 16:01:17 WARN main: Start - .show() 23/05/22 16:01:16 INFO GlueContext: The DataSource in action : com.amazonaws.services.glue.HadoopDataSource 23/05/22 16:01:15 INFO GlueContext: Glue secret manager integration: secretId is not provided. 23/05/22 16:01:14 INFO GlueContext: GlueMetrics configured and enabled 23/05/22 16:01:12 INFO Utils: Successfully started service 'sparkDriver' on port 44775.

asked a year ago339 views
2 Answers
0

Here is the documentation on how to view your AWS Glue Job Logs: https://docs.aws.amazon.com/glue/latest/dg/monitor-continuous-logging-view.html

AWS
vtjean
answered a year ago
0

The problem ended up being permissions problems This video helped https://www.youtube.com/watch?v=UUoQAe_NzaA&list=PL7bE4nSzLSWfYAc3q1vEYFi145Mt_DLcF&index=2

Changing my S3 source to one that started with "aws-glue-" solved it. When the AWSGlueService role is added it is defaulted to buckets that start with aws-glue

Why it did not trigger an actual error and said the job succeeded, I don't know

answered a year ago
profile picture
EXPERT
reviewed 5 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions