如何使用 Athena 查询我的 Amazon S3 清单报告?

3 分钟阅读
0

我想使用 Amazon Athena 查询 Amazon Simple Storage Service(Amazon S3)清单文件。

解决方法

按照以下步骤查询 Amazon S3 清单文件,并生成 ORC 格式、Parquet 格式或 CSV 格式的清单报告。

  1. 为您的 S3 存储桶配置 Amazon S3 清单。记下保存清单报告的目标存储桶的值。

  2. 打开 Athena 控制台
    **注意:**在运行第一个查询之前,您可能需要在 Amazon S3 中设置查询结果位置

  3. 查询编辑器中运行 DDL 语句,以此来创建数据库

    create database s3\_inventory\_db
  4. 在数据库中创建表架构

    ORC 格式清单报告的示例查询:

    **注意:**将以下变量替换为您的变量:

    • 对于 your_table_name,请输入您的 Athena 表名。
    • 对于 LOCATION,输入 S3 存储桶和前缀路径。确保在前缀末尾加上一个正斜杠(/)(例如 s3://doc-example-bucket/prefix/)。如没有使用前缀,则在存储桶名称末尾加上一个正斜杠(/),例如 s3://doc-example-bucket/
    • 对于 projection.dt.range 下的 2022-01-01-00-00 日期,输入在 Athena 中对数据进行分区的时间范围内的第一天。
    • 删除您未为清单选择的所有可选字段。
    CREATE EXTERNAL TABLE your_table_name(
             bucket string,
             key string,
             version_id string,
             is_latest boolean,
             is_delete_marker boolean,
             size bigint,
             last_modified_date timestamp,
             e_tag string,
             storage_class string,
             is_multipart_uploaded boolean,
             replication_status string,
             encryption_status string,
             object_lock_retain_until_date bigint,
             object_lock_mode string,
             object_lock_legal_hold_status string,
             intelligent_tiering_access_tier string,
             bucket_key_status string,
             checksum_algorithm string,
             object_access_control_list string,
             object_owner string
    ) PARTITIONED BY (
            dt string
    )
    ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
      STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
      OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
      LOCATION 's3://source-bucket/config-ID/hive/'
      TBLPROPERTIES (
        "projection.enabled" = "true",
        "projection.dt.type" = "date",
        "projection.dt.format" = "yyyy-MM-dd-HH-mm",
        "projection.dt.range" = "2022-01-01-00-00,NOW",
        "projection.dt.interval" = "1",
        "projection.dt.interval.unit" = "HOURS"
      );

    **注意:**要查询 Parquet 格式的清单报告,请在 ROW FORMAT SERDE 语句中使用以下 Parquet SerDe 代替 ORC SerDe:

    ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'

    CSV 格式清单报告的示例查询:

    CREATE EXTERNAL TABLE your_table_name(
             bucket string,
             key string,
             version_id string,
             is_latest boolean,
             is_delete_marker boolean,
             size string,
             last_modified_date string,
             e_tag string,
             storage_class string,
             is_multipart_uploaded boolean,
             replication_status string,
             encryption_status string,
             object_lock_retain_until_date string,
             object_lock_mode string,
             object_lock_legal_hold_status string,
             intelligent_tiering_access_tier string,
             bucket_key_status string,
             checksum_algorithm string,
             object_access_control_list string,
             object_owner string
    ) PARTITIONED BY (
            dt string
    )
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
      STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
      OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
      LOCATION 's3://source-bucket/config-ID/hive/'
      TBLPROPERTIES (
        "projection.enabled" = "true",
        "projection.dt.type" = "date",
        "projection.dt.format" = "yyyy-MM-dd-HH-mm",
        "projection.dt.range" = "2022-01-01-00-00,NOW",
        "projection.dt.interval" = "1",
        "projection.dt.interval.unit" = "HOURS"
      );
  5. 在左侧窗格的下,选择表名称旁边的省略号,然后选择预览表。如在结果窗口中看到来自服务器访问日志的数据,则表示成功创建了 Athena 表。此数据看起来像 bucketkeyversion_idsizeobjectowner 等值。

您现在可以查询 Amazon S3 清单文件。

有关如何查询 Amazon S3 清单文件的详细说明,请参阅使用 Amazon Athena 查询 Amazon S3 清单

查询示例

使用以下示例查询运行 Amazon S3 清单报告。您也可以为自己的用例创建查询。

按大小和报告对对象进行排序

SELECT DISTINCT size FROM your_table_name ORDER BY 1 DESC limit 10;
SELECT size, count(*) FROM your_table_name GROUP BY size;

检查加密状态以识别安全风险并激活加密

SELECT encryption_status, count(*) FROM your_table_name GROUP BY encryption_status;

按上次修改日期进行计数以检查活动数据

SELECT last_modified_date, count(*) FROM your_table_name GROUP BY last_modified_date;

统计或列出自清点之日起大于或等于 5 GB 的对象

SELECT COUNT(*) FROM your_table_name WHERE size >= 5000000000;
SELECT bucket,key,size FROM your_table_name WHERE size>5000000000;

按层级报告不经常访问的数据

SELECT intelligent_tiering_tier,count (*) FROM your_table_name GROUP BY intelligent_tiering_tier;

获取具有公共访问权限的对象 ACL 授权的 S3 密钥

SELECT key,
CAST(
json_extract(from_utf8(from_base64(object_access_control_list)), '$.grants') AS ARRAY(MAP(VARCHAR, VARCHAR))
) AS grants_array
FROM your_table_name
)
SELECT key,
grants_array,
grant
FROM grants, UNNEST(grants_array) AS t(grant)
WHERE element_at(grant, 'uri') = 'http://acs.amazonaws.com/groups/global/AllUsers'

获取拥有对象 ACL 被授权者和对象拥有者的 S3 密钥

WITH grants AS
    (SELECT key,
    from_utf8(from_base64(object_access_control_list)) AS object_access_control_list,
         object_owner,
         CAST(json_extract(from_utf8(from_base64(object_access_control_list)),
         '$.grants') AS ARRAY(MAP(VARCHAR, VARCHAR))) AS grants_array
    FROM your_table_name)
SELECT key,
       grant,
       objectowner
FROM grants, UNNEST(grants_array) AS t(grant)
WHERE cardinality(grants_array) > 1 AND element_at(grant, 'canonicalId') != object_owner;

获取对象 ACL 中授予读取权限的 S3 密钥

WITH grants AS (
    SELECT key,
        CAST(
            json_extract(from_utf8(from_base64(object_access_control_list)), '$.grants') AS ARRAY(MAP(VARCHAR, VARCHAR))
        ) AS grants_array
    FROM your_table_name
)
SELECT key,
       grants_array,
       grant
FROM grants, UNNEST(grants_array) AS t(grant)
WHERE element_at(grant, 'permission') = 'READ';

获取向规范用户 ID 授予对象 ACL 的 S3 密钥

WITH grants AS (
    SELECT key,
        CAST(
            json_extract(from_utf8(from_base64(object_access_control_list)), '$.grants') AS ARRAY(MAP(VARCHAR, VARCHAR))
        ) AS grants_array
    FROM your_table_name
)
SELECT key,
       grants_array,
       grant
FROM grants, UNNEST(grants_array) AS t(grant)
WHERE element_at(grant, 'canonicalId') = 'user-canonical-id';

获取对象 ACL 中的被授权者数量

SELECT key,
object_access_control_list,
json_array_length(json_extract(object_access_control_list,'$.grants')) AS grants_count
FROM your_table_name;

相关信息

在 Athena 中对数据进行分区

使用 Athena SQL

查找清单列表

如何使用 Amazon Athena 分析我的 Amazon S3 服务器访问日志?

AWS 官方
AWS 官方已更新 7 个月前