Does Glue Crawler or catalog tables have 50 columns max limit?

0

I try to use Glue Crawler to read CSV files from S3 and create catalog table from it. Crawler run succesfully and it will create catalog table but those tables are empty (without columns) if I have more than 50 columns in my CSV. For example with 49 columns it works without problems and create catalog table with correct columns. So is there somekind of limit in crawler or in catalog tables? I did not find any documentation about this.

asked 2 years ago1.4K views
3 Answers
0

Thanks for the answers. We find out that problem was on S3 files. They were added in the S3 with wrong encode setting. They should be encoded with UTF-8 so that Glue can read them properly.

answered 2 years ago
  • Glad you were able to sort this out. I request you close out this question by accepting an answer.

0

Hi,

As such the Glue Crawlers do not have a column limit. During the first crawler run, the crawler reads either the first 1,000 records or the first megabyte of each file to infer the schema. The amount of data read depends on the file format and availability of a valid record. For CSV files, the crawler reads either the first 1000 records or the first 1 MB of data, whatever comes first. If the crawler can't infer the schema after reading the first 1 MB, then the crawler reads up to a maximum of 10 MB of the file, incrementing 1 MB at a time. The crawler compares the schemas inferred from all the subfolders and files, and then creates one or more tables. When a crawler creates a table, it considers the following factors:

  1. Data compatibility to check if the data is of the same format, compression type, and include path
  2. Schema similarity to check how closely similar the schemas are in terms partition threshold and the number of different schemas

However, please feel free raise a support case to further troubleshoot the issue and to find its root cause.

Thank you.

AWS
answered 2 years ago
0

I did a test with a CSV file having 63 columns and I see the catalog table is showing all columns. So I can say for sure that there is no 50 column limit.

Having said that, try using a smaller file with say 100 rows and verify manually that you see data for all columns. Thinking out loud here, maybe if you are using files in multiple partitions and not all files have all data for columns populated, possibly you got unlucky and the crawler read only those rows where data was populated for under 50 columns. Or just try creating a completely new crawler and writing to a new database and table.

profile pictureAWS
EXPERT
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions