How can I use Athena to query my Amazon S3 Inventory reports?
I want to use Amazon Athena to query my Amazon Simple Storage Service (Amazon S3) Inventory files.
Resolution
Follow these steps to query Amazon S3 inventory files with an ORC-formatted, Parquet-formatted, or CSV-formatted inventory report.
-
Configure the Amazon S3 inventory for your S3 bucket. Note the values for the destination bucket where the inventory reports are saved.
-
Open the Athena console.
Note: Before you run your first query, you might need to set up a query result location in Amazon S3. -
In the Query editor, run a DDL statement to create a database.
create database s3_inventory_db
-
Create a table schema in the database.
Example query for an ORC-formatted inventory report:
Note: Replace the following variables with your variables:
- For your_table_name, enter your Athena table name.
- For LOCATION, enter the S3 bucket and prefix path. Make sure to include a forward slash (/) at the end of the prefix, for example, s3://doc-example-bucket/prefix/. If you don't use a prefix, then include a forward slash (/) at the end of the bucket name, for example, s3://doc-example-bucket/.
- For the 2022-01-01-00-00 date under projection.dt.range, enter the first day of the time range within which you partition the data in Athena.
- Remove any optional fields that you didn't choose for your inventory.
CREATE EXTERNAL TABLE your_table_name( bucket string, key string, version_id string, is_latest boolean, is_delete_marker boolean, size bigint, last_modified_date timestamp, e_tag string, storage_class string, is_multipart_uploaded boolean, replication_status string, encryption_status string, object_lock_retain_until_date bigint, object_lock_mode string, object_lock_legal_hold_status string, intelligent_tiering_access_tier string, bucket_key_status string, checksum_algorithm string, object_access_control_list string, object_owner string ) PARTITIONED BY ( dt string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' LOCATION 's3://source-bucket/config-ID/hive/' TBLPROPERTIES ( "projection.enabled" = "true", "projection.dt.type" = "date", "projection.dt.format" = "yyyy-MM-dd-HH-mm", "projection.dt.range" = "2022-01-01-00-00,NOW", "projection.dt.interval" = "1", "projection.dt.interval.unit" = "HOURS" );
Note: To query a Parquet-formatted inventory report, use the following Parquet SerDe in place of the ORC SerDe in the ROW FORMAT SERDE statement:
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
Example query for an CSV-formatted inventory report:
CREATE EXTERNAL TABLE your_table_name( bucket string, key string, version_id string, is_latest boolean, is_delete_marker boolean, size string, last_modified_date string, e_tag string, storage_class string, is_multipart_uploaded boolean, replication_status string, encryption_status string, object_lock_retain_until_date string, object_lock_mode string, object_lock_legal_hold_status string, intelligent_tiering_access_tier string, bucket_key_status string, checksum_algorithm string, object_access_control_list string, object_owner string ) PARTITIONED BY ( dt string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' LOCATION 's3://source-bucket/config-ID/hive/' TBLPROPERTIES ( "projection.enabled" = "true", "projection.dt.type" = "date", "projection.dt.format" = "yyyy-MM-dd-HH-mm", "projection.dt.range" = "2022-01-01-00-00,NOW", "projection.dt.interval" = "1", "projection.dt.interval.unit" = "HOURS" );
-
In the left pane, under Tables, choose the ellipsis next to the table name, and then choose Preview table. If you see data from the server access logs in the Results window, then you successfully created the Athena table. The data looks like values such as bucket, key, version_id, size, and objectowner.
You can now query the Amazon S3 inventory files.
For detailed instructions on how to query your Amazon S3 Inventory files, see Querying Amazon S3 Inventory with Amazon Athena.
Example queries
Use the following example queries to run Amazon S3 inventory reports. You can also create your own queries for your use case.
Sort objects by size and report
SELECT DISTINCT size FROM your_table_name ORDER BY 1 DESC limit 10; SELECT size, count(*) FROM your_table_name GROUP BY size;
Check the encryption status to identify security exposure to activate encryption
SELECT encryption_status, count(*) FROM your_table_name GROUP BY encryption_status;
Count by last modified date to check for active data
SELECT last_modified_date, count(*) FROM your_table_name GROUP BY last_modified_date;
Count or list objects greater than or equal to 5 GB from an inventory date
SELECT COUNT(*) FROM your_table_name WHERE size >= 5000000000; SELECT bucket,key,size FROM your_table_name WHERE size>5000000000;
Report infrequently accessed data by tier
SELECT intelligent_tiering_tier,count (*) FROM your_table_name GROUP BY intelligent_tiering_tier;
Get the S3 keys that have Object ACL grants with public access
SELECT key, CAST( json_extract(from_utf8(from_base64(object_access_control_list)), '$.grants') AS ARRAY(MAP(VARCHAR, VARCHAR)) ) AS grants_array FROM your_table_name ) SELECT key, grants_array, grant FROM grants, UNNEST(grants_array) AS t(grant) WHERE element_at(grant, 'uri') = 'http://acs.amazonaws.com/groups/global/AllUsers'
Get the S3 keys that have Object ACL grantees and object owner
WITH grants AS (SELECT key, from_utf8(from_base64(object_access_control_list)) AS object_access_control_list, object_owner, CAST(json_extract(from_utf8(from_base64(object_access_control_list)), '$.grants') AS ARRAY(MAP(VARCHAR, VARCHAR))) AS grants_array FROM your_table_name) SELECT key, grant, objectowner FROM grants, UNNEST(grants_array) AS t(grant) WHERE cardinality(grants_array) > 1 AND element_at(grant, 'canonicalId') != object_owner;
Get the S3 keys with READ permission that's granted in the Object ACL
WITH grants AS ( SELECT key, CAST( json_extract(from_utf8(from_base64(object_access_control_list)), '$.grants') AS ARRAY(MAP(VARCHAR, VARCHAR)) ) AS grants_array FROM your_table_name ) SELECT key, grants_array, grant FROM grants, UNNEST(grants_array) AS t(grant) WHERE element_at(grant, 'permission') = 'READ';
Get the S3 keys that have Object ACL grants to a canonical user ID
WITH grants AS ( SELECT key, CAST( json_extract(from_utf8(from_base64(object_access_control_list)), '$.grants') AS ARRAY(MAP(VARCHAR, VARCHAR)) ) AS grants_array FROM your_table_name ) SELECT key, grants_array, grant FROM grants, UNNEST(grants_array) AS t(grant) WHERE element_at(grant, 'canonicalId') = 'user-canonical-id';
Get the number of grantees on the Object ACL
SELECT key, object_access_control_list, json_array_length(json_extract(object_access_control_list,'$.grants')) AS grants_count FROM your_table_name;
Related information
How do I use Amazon Athena to analyze my Amazon S3 server access logs?
Relevant content
- asked 2 years agolg...
- asked 2 years agolg...
- asked a year agolg...
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated a month ago
- AWS OFFICIALUpdated 9 months ago
- AWS OFFICIALUpdated 5 months ago