How do I see the Amazon S3 source file for a row in an Athena table?

3 minute read
2

I want the results from my Amazon Athena query to return the Amazon Simple Storage Service (Amazon S3) source file locations for each row in the results.

Short description

You can write Athena queries that return the paths of the source files for the data returned by your query.

  • To find the Amazon S3 file that contains data that's returned by a specific row in an Athena table, run a SELECT $path query.
  • To find the Athena table rows that are associated with a specific Amazon S3 file, run a SELECT query with the WHERE $path condition.

Resolution

Prerequisites: Before you run the queries in this article, you must have the following:

  • An Athena table. If you don't have a table, then run a CREATE TABLE statement.
  • An AWS Identity and Access Management (IAM) user or role that has permissions to run Athena queries.

Find the S3 file that's associated with a row from your Athena table

To find the path of the S3 source file for a row from your Athena table, run queries that are similar to the following examples:

  1. To return the data that you want, run a SELECT query against your table:

    SELECT * FROM "my_database"."my_table" WHERE year=2019;
  2. To find the Amazon S3 source file for the data, run a query similar to the following:

    SELECT "$path" FROM "my_database"."my_table" WHERE year=2019;

The previous query returns the Amazon S3 path for the data:

s3://awsexamplebucket/datasets_mytable/year=2019/data_file1.json

Find the Athena table rows that come from a specific S3 file

To find all the rows from your Athena table that have a specific S3 source file, refer to the following steps:

  1. Confirm the name and location of the Amazon S3 object that you want to retrieve rows for. If you don't know the object name and location, then run the ls command to list objects under a specific S3 bucket or prefix. In the following example, replace s3://awsexamplebucket/my_table/my_partition/ with your partition path.

    aws s3 ls s3://awsexamplebucket/my_table/my_partition/
  2. To return data that's associated with the Amazon S3 file, in Athena, run a SELECT query against your table:

    SELECT *, "$path" FROM "my_database"."my_table" WHERE regexp_like("$path", 's3://awsexamplebucket/my_table/my_partition/file-01.csv')

The previous query returns data that comes from the Amazon S3 path that you specified:

id    name    year    $path
3     John    1999    's3://awsexamplebucket/my_table/my_partition/file-01.csv'
4     Jane    2000    's3://awsexamplebucket/my_table/my_partition/file-01.csv'

Athena queries support wildcard functionality with the regexp_like function. For example, to return data that comes from all source files that share a string of characters in their path, run a query like this:

SELECT *, "$path" FROM "my_database"."my_table" WHERE regexp_like("$path", 'file-1')

This query returns all rows with file path names that contain file-1:

id    name    year    $path
13    Mia     2009    's3://awsexamplebucket/my_table/my_partition/file-10.csv'
14    Mary    2010    's3://awsexamplebucket/my_table/my_partition/file-10.csv'
15    Max     2011    's3://awsexamplebucket/my_table/my_partition/file-11.csv'
16    Merriam 2012    's3://awsexamplebucket/my_table/my_partition/file-12.csv'

Related information

Parquet SerDe

Connect to data sources

Getting the file locations for source data in Amazon S3

AWS OFFICIAL
AWS OFFICIALUpdated 4 months ago