Help us improve the AWS re:Post Knowledge Center by sharing your feedback in a brief survey. Your input can influence how we create and update our content to better support your AWS journey.
如何使用 Athena 查询我的 Amazon S3 清单报告?
我想使用 Amazon Athena 查询 Amazon Simple Storage Service(Amazon S3)清单文件。
解决方法
按照以下步骤查询 Amazon S3 清单文件,并生成 ORC 格式、Parquet 格式或 CSV 格式的清单报告。
-
为您的 S3 存储桶配置 Amazon S3 清单。记下保存清单报告的目标存储桶的值。
-
打开 Athena 控制台。
**注意:**在运行第一个查询之前,您可能需要在 Amazon S3 中设置查询结果位置。 -
在查询编辑器中运行 DDL 语句,以此来创建数据库。
create database s3_inventory_db -
在数据库中创建表架构。
ORC 格式清单报告的示例查询:
**注意:**将以下变量替换为您的变量:
- 对于 your_table_name,请输入您的 Athena 表名。
- 对于 LOCATION,输入 S3 存储桶和前缀路径。确保在前缀末尾加上一个正斜杠(/)(例如 s3://doc-example-bucket/prefix/)。如没有使用前缀,则在存储桶名称末尾加上一个正斜杠(/),例如 s3://doc-example-bucket/。
- 对于 projection.dt.range 下的 2022-01-01-00-00 日期,输入在 Athena 中对数据进行分区的时间范围内的第一天。
- 删除您未为清单选择的所有可选字段。
CREATE EXTERNAL TABLE your_table_name( bucket string, key string, version_id string, is_latest boolean, is_delete_marker boolean, size bigint, last_modified_date timestamp, e_tag string, storage_class string, is_multipart_uploaded boolean, replication_status string, encryption_status string, object_lock_retain_until_date bigint, object_lock_mode string, object_lock_legal_hold_status string, intelligent_tiering_access_tier string, bucket_key_status string, checksum_algorithm string, object_access_control_list string, object_owner string ) PARTITIONED BY ( dt string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' LOCATION 's3://source-bucket/config-ID/hive/' TBLPROPERTIES ( "projection.enabled" = "true", "projection.dt.type" = "date", "projection.dt.format" = "yyyy-MM-dd-HH-mm", "projection.dt.range" = "2022-01-01-00-00,NOW", "projection.dt.interval" = "1", "projection.dt.interval.unit" = "HOURS" );**注意:**要查询 Parquet 格式的清单报告,请在 ROW FORMAT SERDE 语句中使用以下 Parquet SerDe 代替 ORC SerDe:
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'CSV 格式清单报告的示例查询:
CREATE EXTERNAL TABLE your_table_name( bucket string, key string, version_id string, is_latest boolean, is_delete_marker boolean, size string, last_modified_date string, e_tag string, storage_class string, is_multipart_uploaded boolean, replication_status string, encryption_status string, object_lock_retain_until_date string, object_lock_mode string, object_lock_legal_hold_status string, intelligent_tiering_access_tier string, bucket_key_status string, checksum_algorithm string, object_access_control_list string, object_owner string ) PARTITIONED BY ( dt string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' LOCATION 's3://source-bucket/config-ID/hive/' TBLPROPERTIES ( "projection.enabled" = "true", "projection.dt.type" = "date", "projection.dt.format" = "yyyy-MM-dd-HH-mm", "projection.dt.range" = "2022-01-01-00-00,NOW", "projection.dt.interval" = "1", "projection.dt.interval.unit" = "HOURS" ); -
在左侧窗格的表下,选择表名称旁边的省略号,然后选择预览表。如在结果窗口中看到来自服务器访问日志的数据,则表示成功创建了 Athena 表。此数据看起来像 bucket、key、version_id、size 和 objectowner 等值。
您现在可以查询 Amazon S3 清单文件。
有关如何查询 Amazon S3 清单文件的详细说明,请参阅使用 Amazon Athena 查询 Amazon S3 清单。
查询示例
使用以下示例查询运行 Amazon S3 清单报告。您也可以为自己的用例创建查询。
按大小和报告对对象进行排序
SELECT DISTINCT size FROM your_table_name ORDER BY 1 DESC limit 10; SELECT size, count(*) FROM your_table_name GROUP BY size;
检查加密状态以识别安全风险并激活加密
SELECT encryption_status, count(*) FROM your_table_name GROUP BY encryption_status;
按上次修改日期进行计数以检查活动数据
SELECT last_modified_date, count(*) FROM your_table_name GROUP BY last_modified_date;
统计或列出自清点之日起大于或等于 5 GB 的对象
SELECT COUNT(*) FROM your_table_name WHERE size >= 5000000000; SELECT bucket,key,size FROM your_table_name WHERE size>5000000000;
按层级报告不经常访问的数据
SELECT intelligent_tiering_tier,count (*) FROM your_table_name GROUP BY intelligent_tiering_tier;
获取具有公共访问权限的对象 ACL 授权的 S3 密钥
SELECT key, CAST( json_extract(from_utf8(from_base64(object_access_control_list)), '$.grants') AS ARRAY(MAP(VARCHAR, VARCHAR)) ) AS grants_array FROM your_table_name ) SELECT key, grants_array, grant FROM grants, UNNEST(grants_array) AS t(grant) WHERE element_at(grant, 'uri') = 'http://acs.amazonaws.com/groups/global/AllUsers'
获取拥有对象 ACL 被授权者和对象拥有者的 S3 密钥
WITH grants AS (SELECT key, from_utf8(from_base64(object_access_control_list)) AS object_access_control_list, object_owner, CAST(json_extract(from_utf8(from_base64(object_access_control_list)), '$.grants') AS ARRAY(MAP(VARCHAR, VARCHAR))) AS grants_array FROM your_table_name) SELECT key, grant, objectowner FROM grants, UNNEST(grants_array) AS t(grant) WHERE cardinality(grants_array) > 1 AND element_at(grant, 'canonicalId') != object_owner;
获取对象 ACL 中授予读取权限的 S3 密钥
WITH grants AS ( SELECT key, CAST( json_extract(from_utf8(from_base64(object_access_control_list)), '$.grants') AS ARRAY(MAP(VARCHAR, VARCHAR)) ) AS grants_array FROM your_table_name ) SELECT key, grants_array, grant FROM grants, UNNEST(grants_array) AS t(grant) WHERE element_at(grant, 'permission') = 'READ';
获取向规范用户 ID 授予对象 ACL 的 S3 密钥
WITH grants AS ( SELECT key, CAST( json_extract(from_utf8(from_base64(object_access_control_list)), '$.grants') AS ARRAY(MAP(VARCHAR, VARCHAR)) ) AS grants_array FROM your_table_name ) SELECT key, grants_array, grant FROM grants, UNNEST(grants_array) AS t(grant) WHERE element_at(grant, 'canonicalId') = 'user-canonical-id';
获取对象 ACL 中的被授权者数量
SELECT key, object_access_control_list, json_array_length(json_extract(object_access_control_list,'$.grants')) AS grants_count FROM your_table_name;
相关信息
- 语言
- 中文 (简体)

相关内容
AWS 官方已更新 1 年前