- Newest
- Most votes
- Most comments
Check this documentation out.
All the following conditions must be true for AWS Glue to create a partitioned table for an Amazon S3 folder:
- The schemas of the files are similar, as determined by AWS Glue.
- The data format of the files is the same.
- The compression format of the files is the same.
On that document, it gives the following example which can likely explain what's happening.
"...you might own an Amazon S3 bucket named my-app-bucket, where you store both iOS and Android app sales data. The data is partitioned by year, month, and day. The data files for iOS and Android sales have the same schema, data format, and compression format. In the AWS Glue Data Catalog, the AWS Glue crawler creates one table definition with partitioning keys for year, month, and day.
The following Amazon S3 listing of my-app-bucket shows some of the partitions. The = symbol is used to assign partition key values.
my-app-bucket/Sales/year=2010/month=feb/day=1/iOS.csv
my-app-bucket/Sales/year=2010/month=feb/day=1/Android.csv
my-app-bucket/Sales/year=2010/month=feb/day=2/iOS.csv
my-app-bucket/Sales/year=2010/month=feb/day=2/Android.csv
...
my-app-bucket/Sales/year=2017/month=feb/day=4/iOS.csv
my-app-bucket/Sales/year=2017/month=feb/day=4/Android.csv
Sounds like the above conditions are the deciding factor in how AWS Glue defines your table definitions. You can format the data in a way that standardizes on the conditions you are looking for, or manually create a table in the AWS Glue Data Catalog.
Here's a great re:Post article called "How can I prevent the AWS Glue Crawler from creating multiple tables?" that dives into the weeds on how AWS Glue decides the schema and what you can do to stop it from creating multiple tables from a single data source.
Relevant content
- asked 2 years ago
- asked a year ago
- asked 3 months ago
- AWS OFFICIALUpdated 3 months ago
- AWS OFFICIALUpdated 6 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
would like to add some context 1) I had these files in my s3 bucket-- my-app-bucket/data-store-db/emp-data/Employee1.csv my-app-bucket/data-store-db/emp-data/Employee2.csv when i ran crawler for my-app-bucket/data-store-db/emp-data/ location it created one table for both the file since they have same schema, format and compression
2nd)
When i put the above files in the different folder within the same directory my-app-bucket/data-store-db/emp-data/empfolder1/Employee1.csv my-app-bucket/data-store-db/emp-data/empfolder2/Employee2.csv Crawler gave me a table with a partition on folder level
3rd) When i put the above files with some more files with different schema into same folder my-app-bucket/data-store-db/emp-data/empfolder1/Employee1.csv my-app-bucket/data-store-db/emp-data/empfolder1/Employee2.csv my-app-bucket/data-store-db/emp-data/empfolder1/Sales1.csv my-app-bucket/data-store-db/emp-data/empfolder1/Sales2.csv
Employee files have same schema and Sales files have same schema and i was expecting only 2 tables to be created by the crawler but they created 4 tables out of it thats my doubt as to why it created 4 tables and not 2 when 2 files have the same schema so why shouldnt they be clubbed together in the same table