Skip to content

How do I resolve the "Unable to infer schema" error in AWS Glue?

3 minute read
0

I get an "Unable to infer schema" error when I run my AWS Glue job to process Parquet or ORC files that I store in Amazon Simple Storage Service (Amazon S3).

Short description

Parquet or ORC files must follow a Hive-style key=value partition path format. If the files use a hierarchical path structure instead, then AWS Glue doesn't understand the schema and fails.

For example, if your AWS Glue job processes files from s3://s3-bucket/parquet-data/, then the files must use the following partitioned format:

s3://s3-bucket/parquet-data/year=2018/month=10/day=10/file1.parquet

If the files use the following non-partitioned format, then the AWS Glue job fails:

s3://s3-bucket/parquet-data/year/month/day/file1.parquet

Resolution

To resolve the "Unable to infer schema" error in AWS Glue, use one of the following methods for your use case.

Restructure your data

Copy the files into a new S3 bucket and use Hive-style partitioned paths. Then, run the job.

Replace partition column names with asterisks

If you can't restructure your data, then create the DynamicFrame directly from Amazon S3. Use asterisks (*) in place of partition column names. AWS Glue includes only the data in the DynamicFrame, not the partition columns.

For example, if you store your files in an S3 bucket with the s3://s3-bucket/parquet-data/year/month/day/files.parquet file path, then use the following DynamicFrame:

dynamic_frame0 = glueContext.create_dynamic_frame_from_options(
    's3',
    connection_options={'paths': ['s3://s3-bucket/parquet-data/*/*/*']},
    format='parquet',
    transformation_ctx='dynamic_frame0'
)

Use a map class transformation to add partition columns

To include the partition columns in the DynamicFrame, read the data into the DataFrame and add a column for the Amazon S3 file path. Then, apply a map class transformation.

Example code: 

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql.functions import input_file_name

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

df = spark.read.parquet("s3://s3-bucket/parquet-data/*/*/*")
modified_df = df.withColumn('partitions_column', input_file_name())
dyf_0 = DynamicFrame.fromDF(modified_df, glueContext, "dyf_0")

def modify_col(x):
    if x['partitions_column']:
        new_columns = x['partitions_column'].split('/')
        x['year'], x['month'], x['day'] = new_columns[4], new_columns[5], new_columns[6]
        del x['partitions_column']
    return x

modified_dyf = Map.apply(dyf_0, f=modify_col)

datasink2 = glueContext.write_dynamic_frame.from_options(
    frame=modified_dyf,
    connection_type="s3",
    connection_options={
        "path": "s3://my-output-bucket/output/",
        "partitionKeys": ["year", "month", "day"]
    },
    format="parquet",
    transformation_ctx="datasink2"
)

Note: Replace the example S3 paths with your S3 paths and customize the partition columns for your use case.

Resolve files or prefixes that don't exist

If no files are in the path, then check whether you deleted or archived the files. If the files use a different prefix, then update the connection_options parameter in your AWS Glue script to point to the correct path. Also, check whether the catalog table references a missing or outdated S3 location. If the table points to missing files, then the job fails because there's no data to process.

Resolve issues when a job with the job bookmark parameter scans old files 

When you use a job bookmark, AWS Glue tracks previously processed files and skips files with older timestamps. If the job doesn't find new eligible files, then the job fails because there's no data to process.

To resolve this issue, take the following actions:

  • Confirm that the files' modified timestamps are within the expected range.
  • Turn off bookmarks to reprocess all files.
  • Rename or update the files to have newer last-modified timestamps so that AWS Glue detects them as new files and includes them in the next run.

Related information

Managing partitions for ETL output in AWS Glue

AWS OFFICIALUpdated 10 months ago
2 Comments

There could be two reasons :-

  1. AWS may not be able to verify the file. Make sure that the file is of correct file type and you are selecting the correct options (like delimiters, etc.)

  2. [MOST PROBABLE REASON] The IAM permissions to the given S3 object may not be available. You not only have to see if the Correct permissions are given, you need to make sure that the permissions are set for Correct Resources.

e.g

"Effect" : "Allow",
"Action": [
     "s3:GetObject"
],
"Resource" :[
     "arn:aws:s3:::{your_bucket_name}",
     "arn:aws:s3:::{your_bucket_name}/*"
]

In your case, replaceyour_bucket_namewith s3-bucket

replied 2 years ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

AWS
EXPERT
replied 2 years ago