How can I use Athena to query my Amazon S3 Inventory reports?

5 minute read
0

I want to use Amazon Athena to query my Amazon Simple Storage Service (Amazon S3) Inventory files.

Resolution

Follow these steps to query Amazon S3 inventory files with an ORC-formatted, Parquet-formatted, or CSV-formatted inventory report.

  1. Configure the Amazon S3 inventory for your S3 bucket. Note the values for the destination bucket where the inventory reports are saved.

  2. Open the Athena console.
    Note: Before you run your first query, you might need to set up a query result location in Amazon S3.

  3. In the Query editor, run a DDL statement to create a database.

    create database s3_inventory_db
  4. Create a table schema in the database.

    Example query for an ORC-formatted inventory report:

    Note: Replace the following variables with your variables:

    • For your_table_name, enter your Athena table name.
    • For LOCATION, enter the S3 bucket and prefix path. Make sure to include a forward slash (/) at the end of the prefix, for example, s3://doc-example-bucket/prefix/. If you don't use a prefix, then include a forward slash (/) at the end of the bucket name, for example, s3://doc-example-bucket/.
    • For the 2022-01-01-00-00 date under projection.dt.range, enter the first day of the time range within which you partition the data in Athena.
    • Remove any optional fields that you didn't choose for your inventory.
    CREATE EXTERNAL TABLE your_table_name(
             bucket string,
             key string,
             version_id string,
             is_latest boolean,
             is_delete_marker boolean,
             size bigint,
             last_modified_date timestamp,
             e_tag string,
             storage_class string,
             is_multipart_uploaded boolean,
             replication_status string,
             encryption_status string,
             object_lock_retain_until_date bigint,
             object_lock_mode string,
             object_lock_legal_hold_status string,
             intelligent_tiering_access_tier string,
             bucket_key_status string,
             checksum_algorithm string,
             object_access_control_list string,
             object_owner string
    ) PARTITIONED BY (
            dt string
    )
    ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
      STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
      OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
      LOCATION 's3://source-bucket/config-ID/hive/'
      TBLPROPERTIES (
        "projection.enabled" = "true",
        "projection.dt.type" = "date",
        "projection.dt.format" = "yyyy-MM-dd-HH-mm",
        "projection.dt.range" = "2022-01-01-00-00,NOW",
        "projection.dt.interval" = "1",
        "projection.dt.interval.unit" = "HOURS"
      );

    Note: To query a Parquet-formatted inventory report, use the following Parquet SerDe in place of the ORC SerDe in the ROW FORMAT SERDE statement:

    ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'

    Example query for an CSV-formatted inventory report:

    CREATE EXTERNAL TABLE your_table_name(
             bucket string,
             key string,
             version_id string,
             is_latest boolean,
             is_delete_marker boolean,
             size string,
             last_modified_date string,
             e_tag string,
             storage_class string,
             is_multipart_uploaded boolean,
             replication_status string,
             encryption_status string,
             object_lock_retain_until_date string,
             object_lock_mode string,
             object_lock_legal_hold_status string,
             intelligent_tiering_access_tier string,
             bucket_key_status string,
             checksum_algorithm string,
             object_access_control_list string,
             object_owner string
    ) PARTITIONED BY (
            dt string
    )
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
      STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
      OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
      LOCATION 's3://source-bucket/config-ID/hive/'
      TBLPROPERTIES (
        "projection.enabled" = "true",
        "projection.dt.type" = "date",
        "projection.dt.format" = "yyyy-MM-dd-HH-mm",
        "projection.dt.range" = "2022-01-01-00-00,NOW",
        "projection.dt.interval" = "1",
        "projection.dt.interval.unit" = "HOURS"
      );
  5. In the left pane, under Tables, choose the ellipsis next to the table name, and then choose Preview table. If you see data from the server access logs in the Results window, then you successfully created the Athena table. The data looks like values such as bucket, key, version_id, size, and objectowner.

You can now query the Amazon S3 inventory files.

For detailed instructions on how to query your Amazon S3 Inventory files, see Querying Amazon S3 Inventory with Amazon Athena.

Example queries

Use the following example queries to run Amazon S3 inventory reports. You can also create your own queries for your use case.

Sort objects by size and report

SELECT DISTINCT size FROM your_table_name ORDER BY 1 DESC limit 10;
SELECT size, count(*) FROM your_table_name GROUP BY size;

Check the encryption status to identify security exposure to activate encryption

SELECT encryption_status, count(*) FROM your_table_name GROUP BY encryption_status;

Count by last modified date to check for active data

SELECT last_modified_date, count(*) FROM your_table_name GROUP BY last_modified_date;

Count or list objects greater than or equal to 5 GB from an inventory date

SELECT COUNT(*) FROM your_table_name WHERE size >= 5000000000;
SELECT bucket,key,size FROM your_table_name WHERE size>5000000000;

Report infrequently accessed data by tier

SELECT intelligent_tiering_tier,count (*) FROM your_table_name GROUP BY intelligent_tiering_tier;

Get the S3 keys that have Object ACL grants with public access

SELECT key,
CAST(
json_extract(from_utf8(from_base64(object_access_control_list)), '$.grants') AS ARRAY(MAP(VARCHAR, VARCHAR))
) AS grants_array
FROM your_table_name
)
SELECT key,
grants_array,
grant
FROM grants, UNNEST(grants_array) AS t(grant)
WHERE element_at(grant, 'uri') = 'http://acs.amazonaws.com/groups/global/AllUsers'

Get the S3 keys that have Object ACL grantees and object owner

WITH grants AS 
    (SELECT key,
    from_utf8(from_base64(object_access_control_list)) AS object_access_control_list,
         object_owner,
         CAST(json_extract(from_utf8(from_base64(object_access_control_list)),
         '$.grants') AS ARRAY(MAP(VARCHAR, VARCHAR))) AS grants_array
    FROM your_table_name)
SELECT key,
       grant,
       objectowner
FROM grants, UNNEST(grants_array) AS t(grant)
WHERE cardinality(grants_array) > 1 AND element_at(grant, 'canonicalId') != object_owner;

Get the S3 keys with READ permission that's granted in the Object ACL

WITH grants AS (
    SELECT key,
        CAST(
            json_extract(from_utf8(from_base64(object_access_control_list)), '$.grants') AS ARRAY(MAP(VARCHAR, VARCHAR))
        ) AS grants_array
    FROM your_table_name
)
SELECT key,
       grants_array,
       grant
FROM grants, UNNEST(grants_array) AS t(grant)
WHERE element_at(grant, 'permission') = 'READ';

Get the S3 keys that have Object ACL grants to a canonical user ID

WITH grants AS (
    SELECT key,
        CAST(
            json_extract(from_utf8(from_base64(object_access_control_list)), '$.grants') AS ARRAY(MAP(VARCHAR, VARCHAR))
        ) AS grants_array
    FROM your_table_name
)
SELECT key,
       grants_array,
       grant
FROM grants, UNNEST(grants_array) AS t(grant)
WHERE element_at(grant, 'canonicalId') = 'user-canonical-id';

Get the number of grantees on the Object ACL

SELECT key,
object_access_control_list,
json_array_length(json_extract(object_access_control_list,'$.grants')) AS grants_count
FROM your_table_name;

Related information

Partitioning data in Athena

Using Athena SQL

Locating your inventory list

How do I use Amazon Athena to analyze my Amazon S3 server access logs?

AWS OFFICIAL
AWS OFFICIALUpdated 7 months ago