Monitor glue tables queried via Athena

0

How can I monitor who is querying which glue tables?

After some trial and error I found that the BatchGetTable Glue API event is recorded using CloudTrail every time I run an Athena query, and stores the Glue databases and tables under requestParameters as an array , but this API event is not documented.

Any advice would be much appreciated.

Soumaya
asked 2 months ago493 views
2 Answers
0

Hello,

Thank you for reaching out about this. I have tried replicating the same, and can confirm your findings; when Athena queries are executed, the Glue 'BatchGetTable' API event/call for the same is listed in CloudTrail. Part of the 'requestParameters' field of this event is the 'entries' array, containing the Glue databases and tables as you have mentioned as well.

I also could not find documentation on the same at the moment; however, in order to monitor who is querying which Glue tables you can ensure that you have CloudTrail logs set up and that these are being delivered to S3 - if you create a Glue table for this data in Athena, you can query the CloudTrail logs in Athena to list the IAM users/roles that are involved with executing each query and the Glue database/tables associated with this event from the array. In you query, you could use filters such as the following when querying your CloudTrail table, to ensure that the API call is made by Athena:

WHERE eventname = 'BatchGetTable' AND useragent = 'athena.amazonaws.com'

Another option could be to monitor the StartQueryExecution API calls that are used to run queries in Athena, and the IAM principals associated with these calls. You could get these on an individual query-by-query basis as needed, or use the Athena "ListQueryExecutions" API call or CLI command for the same. For the query execution ID returned for each StartQueryExecution API call, you could then use the Athena 'GetQueryExecution' API or CLI command to get the query string and manually parse the query string to find the table scanned - another option is to use the Athena 'GetQueryRuntimeStatistics' API or CLI command, which would return details about the query execution and it's stages. Part of this output result would be the "QueryStagePlan" field, which contains a "Children" array field that mentions table identifiers used in the query in the format "awsdatacatalog:database_name:table_name". These approaches using API calls are much more involved, but offer some flexibility - it could be a useful approach if you would like to look into or monitor who is querying which tables on an ad-hoc basis or for a few specific query executions when needed, as you would not have to have CloudTrail data already captured in S3, and a table for the same in Glue to query with Athena.

More information on some of the Athena APIs mentioned can be found in the list here (https://docs.aws.amazon.com/athena/latest/APIReference/API_Operations.html) or on the CLI or relevant SDK documentation pages.

I hope that you find this helpful. Please let me know if you have additional questions, I'd be happy to help further wherever I can.

Regards, Zubair

AWS
SUPPORT ENGINEER
answered 2 months ago
0

Hello Zubair

Thanks for the detailed answer and the alternative options. We're looking to publish daily statistics on who is using what tables so we cannot rely on an ad-hoc approach. I was hoping there was an off-the-shelf solution but it seems we will have to build something ourselves. I think the simplest solution would be to use the Glue 'BatchGetTable' API event, provided I can get confirmation that this API event will not get deprecated!

Regards Soumaya

Soumaya
answered 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions