Access s3 metadata while upload/download or at rest for analytics

Question

Is there a way to access metadata of an S3 object using one of the analytics tool that amazon has? AWS Glue currently supports crawlers for the s3 data but I was not able to find any information on metadata crawlers.

Answer

You could use AWS Lambda to extract metadata and store it in Amazon DynamoDB, then use AWS Glue to create a catalog and facilitate queries with Athena. For example you can get the metadata using `s3.head_object` in boto3:

```
import boto3

# Initialize a session using Amazon S3
s3 = boto3.client('s3')

response = s3.head_object(Bucket='your-bucket-name', Key='path/to/your/object')

# Print the metadata
print(response['Metadata'])

# You can also access other metadata attributes, such as:
print("Size:", response['ContentLength'])
print("Last Modified:", response['LastModified'])
print("ETag:", response['ETag'])

```

Resources:
* [S3 Head Object](https://docs.aws.amazon.com/AmazonS3/latest/API/API_HeadObject.html)
* [Using Metadata](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingMetadata.html)
* [Some idea found on internet](https://www.reddit.com/r/aws/comments/cexe8h/how_to_query_s3_metadata_using_athena/?rdt=60426)

Also you can use S3 Inventory. This can provide CSV, ORC, or Parquet output files that list objects and their metadata on a daily or weekly basis for an S3 bucket. This inventory can include details like the object key, version ID, size, and last modified date. You can analyze these inventory files using Amazon Athena, AWS Glue, or other analytics tools.

Other Resources:
* [Amazon S3 Inventory](https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory.html)
* [Querying Amazon S3 Inventory with Amazon Athena](https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory-athena-query.html)

Answer

AWS Glue Crawler infers the schema of the data stored in Amazon S3, not the metadata associated with the S3 objects. To leverage your S3 metadata for further analysis, first you can use the 'HeadObject' API through AWS SDK or AWS CLI to extract the metadata of your S3 objects, then store the retrieved metadata in S3 and analyze the result using analytics services such as Athena directly from S3.

Head Object

[Head Object](https://docs.aws.amazon.com/AmazonS3/latest/API/API_HeadObject.html)

Answer

AWS Glue can crawl S3 data to generate metadata like file types and schemas. This metadata is stored in the AWS Glue Data Catalog.
You can query the catalog to get metadata about S3 objects, like listing files by type. The metadata can also create Athena tables for SQL queries.

So in summary, Glue crawlers populate metadata in the Glue Data Catalog, which enables discovery and analysis of S3 data via metadata queries, Athena, and more.

Some sources:
* [Get started running AWS Glue crawlers and jobs using an AWS SDK - AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/example_glue_Scenario_GetStartedCrawlersJobs_section.html)

* [How AWS Glue crawlers work](https://docs.aws.amazon.com/glue/latest/dg/crawler-running.html)

* [Using AWS Glue crawlers](https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html#schema-crawlers)

Access s3 metadata while upload/download or at rest for analytics

Relevant content