AWS Glue Input file name is returning empty string when use data catalog

0

we are using crawler and custom classifier to parse fixed length file. As part of our requirement, need to extract input file name. Input files stores into S3 Folder

S3 Folder ----> Crawler (custom classifier) ----> data catalog<-------AWS Glue job (ETL) ---> Store into S3

import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job from pyspark.sql.functions import input_file_name from awsglue.dynamicframe import DynamicFrame

sc = SparkContext() glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) job.init(args['JOB_NAME'], args)

datasource0 = glueContext.create_dynamic_frame.from_catalog( database="test-poc", table_name="test-raw", transformation_ctx="datasource0", groupFiles='none', )

Create a DataFrame and add a new column in the containing the file name of every DataRecord

dataframe1 = datasource0.toDF().withColumn("filename", input_file_name()) dataframe1.show()

input_file_name is returning empty string

질문됨 5달 전217회 조회
1개 답변
0

That function is a DataFrame feature, you are creating a DynamicFrame and then converting, I don't think it can track the source files if you do that.
Why don't you just read a DataFrame, using spark.table, spark.sql() or the GlueContext method to create DataFrames.

profile pictureAWS
전문가
답변함 5달 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠