[Glue Crawler, Parquet] - Crawler infers timestamp[ns] as bigint

0

I'm using Pandas to create a DataFrame with a single column of dates, which I then save as a Parquet file to an S3 bucket. Here's the code snippet I'm using:

import pandas as pd

df = pd.DataFrame({'date': pd.date_range('2022-10-01', periods=10)})
df.to_parquet('s3://<bucket>/data/athena/test/test.parquet', version='2.6')

After uploading the Parquet file to S3, I use a crawler to populate the schema in a data catalog. However, I've noticed that the data type of the date column is inferred as bigint rather than a timestamp type.

Interestingly, when I explicitly convert the column to timestamp[us] before saving the DataFrame, the crawler correctly identifies the column as a timestamp type, and everything works as expected.

However, I'm curious why this adjustment is necessary. Is there a known limitation with timestamp[ns] data types in Parquet files or with how crawler interpret them? How can I ensure that my timestamp[ns] columns are correctly identified as timestamp types without needing to convert them to timestamp[us]?

Thank you for your help!

asked 3 months ago367 views
1 Answer
0

There is a known limitation where Athena may not correctly infer timestamp columns from Parquet files generated in certain ways.

When writing the DataFrame to Parquet, Pandas uses nanosecond resolution timestamps which Parquet supports as INT96. However, some data catalogs and query engines may expect microsecond resolution instead. Explicitly converting the timestamp column to microsecond resolution before writing to Parquet ensures the data type will be correctly identified.

df['date'] = df['date'].astype('datetime64[us]')

Another option is to set the proper metadata in the Parquet file itself to specify nanosecond resolution timestamps. Tools like Spark handle this automatically, but you may need to configure other systems like DMS to do the same.

profile picture
EXPERT
answered 3 months ago
  • I used parquet format version 2.6 which stores nanosecond resolution timestamps as int64.

    import pyarrow.parquet as pq
    metadata = pq.read_metadata('s3://<bucket>/data/athena/test/test.parquet')
    print(metadata.schema)
    print(metadata.schema.to_arrow_schema().pandas_metadata)

    The previous code snippet returns the following values:

    <pyarrow._parquet.ParquetSchema object at 0x7f25a0bdd680>
    required group field_id=-1 schema {
      optional int64 field_id=-1 date (Timestamp(isAdjustedToUTC=false, timeUnit=nanoseconds, is_from_converted_type=false, force_set_converted_type=false));
    }
    
    {'index_columns': [{'kind': 'range',
       'name': None,
       'start': 0,
       'stop': 10,
       'step': 1}],
     'column_indexes': [{'name': None,
       'field_name': None,
       'pandas_type': 'unicode',
       'numpy_type': 'object',
       'metadata': {'encoding': 'UTF-8'}}],
     'columns': [{'name': 'date',
       'field_name': 'date',
       'pandas_type': 'datetime',
       'numpy_type': 'datetime64[ns]',
       'metadata': None}],
     'creator': {'library': 'pyarrow', 'version': '14.0.2'},
     'pandas_version': '2.1.4'}
    

    So for me it's strange why I get this behavior with bigint

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions