I get an "Unable to infer schema" error when I run my AWS Glue job to process Parquet or ORC files that I store in Amazon Simple Storage Service (Amazon S3).
Short description
Parquet or ORC files must follow a Hive-style key=value partition path format. If the files use a hierarchical path structure instead, then AWS Glue doesn't understand the schema and fails.
For example, if your AWS Glue job processes files from s3://s3-bucket/parquet-data/, then the files must use the following partitioned format:
s3://s3-bucket/parquet-data/year=2018/month=10/day=10/file1.parquet
If the files use the following non-partitioned format, then the AWS Glue job fails:
s3://s3-bucket/parquet-data/year/month/day/file1.parquet
Resolution
To resolve the "Unable to infer schema" error in AWS Glue, use one of the following methods for your use case.
Restructure your data
Copy the files into a new S3 bucket and use Hive-style partitioned paths. Then, run the job.
Replace partition column names with asterisks
If you can't restructure your data, then create the DynamicFrame directly from Amazon S3. Use asterisks (*) in place of partition column names. AWS Glue includes only the data in the DynamicFrame, not the partition columns.
For example, if you store your files in an S3 bucket with the s3://s3-bucket/parquet-data/year/month/day/files.parquet file path, then use the following DynamicFrame:
dynamic_frame0 = glueContext.create_dynamic_frame_from_options(
's3',
connection_options={'paths': ['s3://s3-bucket/parquet-data/*/*/*']},
format='parquet',
transformation_ctx='dynamic_frame0'
)
Use a map class transformation to add partition columns
To include the partition columns in the DynamicFrame, read the data into the DataFrame and add a column for the Amazon S3 file path. Then, apply a map class transformation.
Example code:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql.functions import input_file_name
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
df = spark.read.parquet("s3://s3-bucket/parquet-data/*/*/*")
modified_df = df.withColumn('partitions_column', input_file_name())
dyf_0 = DynamicFrame.fromDF(modified_df, glueContext, "dyf_0")
def modify_col(x):
if x['partitions_column']:
new_columns = x['partitions_column'].split('/')
x['year'], x['month'], x['day'] = new_columns[4], new_columns[5], new_columns[6]
del x['partitions_column']
return x
modified_dyf = Map.apply(dyf_0, f=modify_col)
datasink2 = glueContext.write_dynamic_frame.from_options(
frame=modified_dyf,
connection_type="s3",
connection_options={
"path": "s3://my-output-bucket/output/",
"partitionKeys": ["year", "month", "day"]
},
format="parquet",
transformation_ctx="datasink2"
)
Note: Replace the example S3 paths with your S3 paths and customize the partition columns for your use case.
Resolve files or prefixes that don't exist
If no files are in the path, then check whether you deleted or archived the files. If the files use a different prefix, then update the connection_options parameter in your AWS Glue script to point to the correct path. Also, check whether the catalog table references a missing or outdated S3 location. If the table points to missing files, then the job fails because there's no data to process.
Resolve issues when a job with the job bookmark parameter scans old files
When you use a job bookmark, AWS Glue tracks previously processed files and skips files with older timestamps. If the job doesn't find new eligible files, then the job fails because there's no data to process.
To resolve this issue, take the following actions:
- Confirm that the files' modified timestamps are within the expected range.
- Turn off bookmarks to reprocess all files.
- Rename or update the files to have newer last-modified timestamps so that AWS Glue detects them as new files and includes them in the next run.
Related information
Managing partitions for ETL output in AWS Glue