I want the results from my Amazon Athena query to return the Amazon Simple Storage Service (Amazon S3) source file locations for each row in the results.
Short description
You can write Athena queries that return the paths of the source files for the data returned by your query.
- To find the Amazon S3 file that contains data that's returned by a specific row in an Athena table, run a SELECT $path query.
- To find the Athena table rows that are associated with a specific Amazon S3 file, run a SELECT query with the WHERE $path condition.
Resolution
Prerequisites: Before you run the queries in this article, you must have the following:
- An Athena table. If you don't have a table, then run a CREATE TABLE statement.
- An AWS Identity and Access Management (IAM) user or role that has permissions to run Athena queries.
Find the S3 file that's associated with a row from your Athena table
To find the path of the S3 source file for a row from your Athena table, run queries that are similar to the following examples:
-
To return the data that you want, run a SELECT query against your table:
SELECT * FROM "my_database"."my_table" WHERE year=2019;
-
To find the Amazon S3 source file for the data, run a query similar to the following:
SELECT "$path" FROM "my_database"."my_table" WHERE year=2019;
The previous query returns the Amazon S3 path for the data:
s3://awsexamplebucket/datasets_mytable/year=2019/data_file1.json
Find the Athena table rows that come from a specific S3 file
To find all the rows from your Athena table that have a specific S3 source file, refer to the following steps:
-
Confirm the name and location of the Amazon S3 object that you want to retrieve rows for. If you don't know the object name and location, then run the ls command to list objects under a specific S3 bucket or prefix. In the following example, replace s3://awsexamplebucket/my_table/my_partition/ with your partition path.
aws s3 ls s3://awsexamplebucket/my_table/my_partition/
-
To return data that's associated with the Amazon S3 file, in Athena, run a SELECT query against your table:
SELECT *, "$path" FROM "my_database"."my_table" WHERE regexp_like("$path", 's3://awsexamplebucket/my_table/my_partition/file-01.csv')
The previous query returns data that comes from the Amazon S3 path that you specified:
id name year $path
3 John 1999 's3://awsexamplebucket/my_table/my_partition/file-01.csv'
4 Jane 2000 's3://awsexamplebucket/my_table/my_partition/file-01.csv'
Athena queries support wildcard functionality with the regexp_like function. For example, to return data that comes from all source files that share a string of characters in their path, run a query like this:
SELECT *, "$path" FROM "my_database"."my_table" WHERE regexp_like("$path", 'file-1')
This query returns all rows with file path names that contain file-1:
id name year $path
13 Mia 2009 's3://awsexamplebucket/my_table/my_partition/file-10.csv'
14 Mary 2010 's3://awsexamplebucket/my_table/my_partition/file-10.csv'
15 Max 2011 's3://awsexamplebucket/my_table/my_partition/file-11.csv'
16 Merriam 2012 's3://awsexamplebucket/my_table/my_partition/file-12.csv'
Related information
Parquet SerDe
Connect to data sources
Getting the file locations for source data in Amazon S3