Access s3 metadata while upload/download or at rest for analytics

0

Is there a way to access metadata of an S3 object using one of the analytics tool that amazon has? AWS Glue currently supports crawlers for the s3 data but I was not able to find any information on metadata crawlers.

AWS
asked 2 months ago156 views
3 Answers
1

You could use AWS Lambda to extract metadata and store it in Amazon DynamoDB, then use AWS Glue to create a catalog and facilitate queries with Athena. For example you can get the metadata using s3.head_object in boto3:

import boto3

# Initialize a session using Amazon S3
s3 = boto3.client('s3')

response = s3.head_object(Bucket='your-bucket-name', Key='path/to/your/object')

# Print the metadata
print(response['Metadata'])

# You can also access other metadata attributes, such as:
print("Size:", response['ContentLength'])
print("Last Modified:", response['LastModified'])
print("ETag:", response['ETag'])

Resources:

Also you can use S3 Inventory. This can provide CSV, ORC, or Parquet output files that list objects and their metadata on a daily or weekly basis for an S3 bucket. This inventory can include details like the object key, version ID, size, and last modified date. You can analyze these inventory files using Amazon Athena, AWS Glue, or other analytics tools.

Other Resources:

profile picture
EXPERT
answered 2 months ago
  • Thanks for the response. Lambda is definitely one of the options but I think it will be really expensive to use it per event (talking about moving couple of billion objects/day) . Using s3 Inventory looks like a better candidate here though. Does glue have support of metadata crawling like it does for data? That will make it easier to build a pipeline where only glue can be used with some sort of batch processing without adding one more dependency in the pipeline with datastores like ddb

0

AWS Glue can crawl S3 data to generate metadata like file types and schemas. This metadata is stored in the AWS Glue Data Catalog. You can query the catalog to get metadata about S3 objects, like listing files by type. The metadata can also create Athena tables for SQL queries.

So in summary, Glue crawlers populate metadata in the Glue Data Catalog, which enables discovery and analysis of S3 data via metadata queries, Athena, and more.

Some sources:

AWS
answered 2 months ago
  • I think we are talking about two different things. I understand that glue can crawl the data in s3 and creates metadata based on the data but this does not cover the s3 metadata that was created either by customer or added by default by s3 sdk. I was asking if glue can crawl the s3 metadata itself (not the data)

0

AWS Glue Crawler infers the schema of the data stored in Amazon S3, not the metadata associated with the S3 objects. To leverage your S3 metadata for further analysis, first you can use the 'HeadObject' API through AWS SDK or AWS CLI to extract the metadata of your S3 objects, then store the retrieved metadata in S3 and analyze the result using analytics services such as Athena directly from S3.

Head Object

profile pictureAWS
BezuW
answered 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions