如何使用 Athena 查询我的 Amazon S3 清单报告?
我想使用 Amazon Athena 查询 Amazon Simple Storage Service(Amazon S3)清单文件。
解决方法
按照以下步骤查询 Amazon S3 清单文件,并生成 ORC 格式、Parquet 格式或 CSV 格式的清单报告。
-
为您的 S3 存储桶配置 Amazon S3 清单。记下保存清单报告的目标存储桶的值。
-
打开 Athena 控制台。
**注意:**在运行第一个查询之前,您可能需要在 Amazon S3 中设置查询结果位置。 -
在查询编辑器中运行 DDL 语句,以此来创建数据库。
create database s3\_inventory\_db
-
在数据库中创建表架构。
ORC 格式清单报告的示例查询:
**注意:**将以下变量替换为您的变量:
- 对于 your_table_name,请输入您的 Athena 表名。
- 对于 LOCATION,输入 S3 存储桶和前缀路径。确保在前缀末尾加上一个正斜杠(/)(例如 s3://doc-example-bucket/prefix/)。如没有使用前缀,则在存储桶名称末尾加上一个正斜杠(/),例如 s3://doc-example-bucket/。
- 对于 projection.dt.range 下的 2022-01-01-00-00 日期,输入在 Athena 中对数据进行分区的时间范围内的第一天。
- 删除您未为清单选择的所有可选字段。
CREATE EXTERNAL TABLE your_table_name( bucket string, key string, version_id string, is_latest boolean, is_delete_marker boolean, size bigint, last_modified_date timestamp, e_tag string, storage_class string, is_multipart_uploaded boolean, replication_status string, encryption_status string, object_lock_retain_until_date bigint, object_lock_mode string, object_lock_legal_hold_status string, intelligent_tiering_access_tier string, bucket_key_status string, checksum_algorithm string, object_access_control_list string, object_owner string ) PARTITIONED BY ( dt string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' LOCATION 's3://source-bucket/config-ID/hive/' TBLPROPERTIES ( "projection.enabled" = "true", "projection.dt.type" = "date", "projection.dt.format" = "yyyy-MM-dd-HH-mm", "projection.dt.range" = "2022-01-01-00-00,NOW", "projection.dt.interval" = "1", "projection.dt.interval.unit" = "HOURS" );
**注意:**要查询 Parquet 格式的清单报告,请在 ROW FORMAT SERDE 语句中使用以下 Parquet SerDe 代替 ORC SerDe:
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
CSV 格式清单报告的示例查询:
CREATE EXTERNAL TABLE your_table_name( bucket string, key string, version_id string, is_latest boolean, is_delete_marker boolean, size string, last_modified_date string, e_tag string, storage_class string, is_multipart_uploaded boolean, replication_status string, encryption_status string, object_lock_retain_until_date string, object_lock_mode string, object_lock_legal_hold_status string, intelligent_tiering_access_tier string, bucket_key_status string, checksum_algorithm string, object_access_control_list string, object_owner string ) PARTITIONED BY ( dt string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' LOCATION 's3://source-bucket/config-ID/hive/' TBLPROPERTIES ( "projection.enabled" = "true", "projection.dt.type" = "date", "projection.dt.format" = "yyyy-MM-dd-HH-mm", "projection.dt.range" = "2022-01-01-00-00,NOW", "projection.dt.interval" = "1", "projection.dt.interval.unit" = "HOURS" );
-
在左侧窗格的表下,选择表名称旁边的省略号,然后选择预览表。如在结果窗口中看到来自服务器访问日志的数据,则表示成功创建了 Athena 表。此数据看起来像 bucket、key、version_id、size 和 objectowner 等值。
您现在可以查询 Amazon S3 清单文件。
有关如何查询 Amazon S3 清单文件的详细说明,请参阅使用 Amazon Athena 查询 Amazon S3 清单。
查询示例
使用以下示例查询运行 Amazon S3 清单报告。您也可以为自己的用例创建查询。
按大小和报告对对象进行排序
SELECT DISTINCT size FROM your_table_name ORDER BY 1 DESC limit 10; SELECT size, count(*) FROM your_table_name GROUP BY size;
检查加密状态以识别安全风险并激活加密
SELECT encryption_status, count(*) FROM your_table_name GROUP BY encryption_status;
按上次修改日期进行计数以检查活动数据
SELECT last_modified_date, count(*) FROM your_table_name GROUP BY last_modified_date;
统计或列出自清点之日起大于或等于 5 GB 的对象
SELECT COUNT(*) FROM your_table_name WHERE size >= 5000000000; SELECT bucket,key,size FROM your_table_name WHERE size>5000000000;
按层级报告不经常访问的数据
SELECT intelligent_tiering_tier,count (*) FROM your_table_name GROUP BY intelligent_tiering_tier;
获取具有公共访问权限的对象 ACL 授权的 S3 密钥
SELECT key, CAST( json_extract(from_utf8(from_base64(object_access_control_list)), '$.grants') AS ARRAY(MAP(VARCHAR, VARCHAR)) ) AS grants_array FROM your_table_name ) SELECT key, grants_array, grant FROM grants, UNNEST(grants_array) AS t(grant) WHERE element_at(grant, 'uri') = 'http://acs.amazonaws.com/groups/global/AllUsers'
获取拥有对象 ACL 被授权者和对象拥有者的 S3 密钥
WITH grants AS (SELECT key, from_utf8(from_base64(object_access_control_list)) AS object_access_control_list, object_owner, CAST(json_extract(from_utf8(from_base64(object_access_control_list)), '$.grants') AS ARRAY(MAP(VARCHAR, VARCHAR))) AS grants_array FROM your_table_name) SELECT key, grant, objectowner FROM grants, UNNEST(grants_array) AS t(grant) WHERE cardinality(grants_array) > 1 AND element_at(grant, 'canonicalId') != object_owner;
获取对象 ACL 中授予读取权限的 S3 密钥
WITH grants AS ( SELECT key, CAST( json_extract(from_utf8(from_base64(object_access_control_list)), '$.grants') AS ARRAY(MAP(VARCHAR, VARCHAR)) ) AS grants_array FROM your_table_name ) SELECT key, grants_array, grant FROM grants, UNNEST(grants_array) AS t(grant) WHERE element_at(grant, 'permission') = 'READ';
获取向规范用户 ID 授予对象 ACL 的 S3 密钥
WITH grants AS ( SELECT key, CAST( json_extract(from_utf8(from_base64(object_access_control_list)), '$.grants') AS ARRAY(MAP(VARCHAR, VARCHAR)) ) AS grants_array FROM your_table_name ) SELECT key, grants_array, grant FROM grants, UNNEST(grants_array) AS t(grant) WHERE element_at(grant, 'canonicalId') = 'user-canonical-id';
获取对象 ACL 中的被授权者数量
SELECT key, object_access_control_list, json_array_length(json_extract(object_access_control_list,'$.grants')) AS grants_count FROM your_table_name;
相关信息
相关内容
- AWS 官方已更新 2 年前
- AWS 官方已更新 1 年前
- AWS 官方已更新 1 年前
- AWS 官方已更新 3 年前