Skip to content

Athena (Iceberg tables) failed to query history table using PySpark from AWS Glue

0

Hi, having troubles querying Iceberg history table using PySpark from AWS Glue 4.0, same queries working in the Athena web query editor. The query that failing:

def get_last_snapshot_id_by_table(table, start_date, end_date):
    query = """
WITH snapshot_versions AS (
    SELECT
        snapshot_id,
        made_current_at,
        LEAD(made_current_at) OVER (ORDER BY made_current_at) as valid_until
    FROM {table}.history
    WHERE made_current_at BETWEEN TIMESTAMP '{start_date}' AND TIMESTAMP '{end_date}'
)
SELECT *
FROM snapshot_versions
ORDER BY made_current_at DESC
LIMIT 1
""".format(table=table, start_date=start_date, end_date=end_date)
    df = spark.sql(query)
    df.show()
    rows = df.collect()
    if len(rows) > 0:
        return rows[0]["snapshot_id"]

    return None

Getting error: Error Category: RESOURCE_NOT_FOUND_ERROR; Failed Line Number: 411; AnalysisException: Table or view not found: [table_name].history; line 3 pos 5;

pyspark.sql.utils.AnalysisException: Table or view not found: [table_name].history; line 3 pos 5;
'GlobalLimit 1
+- 'LocalLimit 1
   +- 'Sort ['made_current_at DESC NULLS LAST], true
      +- 'Project ['snapshot_id]
         +- 'UnresolvedRelation [my_iceberg_table, history], [], false

Query directly to [table_name] working fine as well.

Tried different steps with no luck:

  • using [table_name]$history
  • different tables, new or existing
asked a year ago529 views
1 Answer
0

The issue you're experiencing is likely due to differences in how Athena and PySpark in AWS Glue access Iceberg table metadata. While Athena has built-in support for querying Iceberg metadata tables like 'history', PySpark in AWS Glue may not have the same level of integration out of the box.

To resolve this, you can try the following approaches:

  1. Use the Iceberg API directly in PySpark: Instead of querying the history table using SQL, you can use the Iceberg API to access the table's history. This method bypasses the need for a SQL query and directly interacts with the Iceberg table metadata.

Here's an example of how you might modify your function:

from pyspark.sql import SparkSession
from iceberg.api import Expressions

def get_last_snapshot_id_by_table(table, start_date, end_date):
spark = SparkSession.builder.getOrCreate()
catalog = spark.catalog()
iceberg_table = catalog.loadTable(table)

snapshots = iceberg_table.snapshots()
filtered_snapshots = [s for s in snapshots if start_date <= s.timestampMillis() <= end_date]

if filtered_snapshots:
return max(filtered_snapshots, key=lambda s: s.timestampMillis()).snapshotId()
return None
  1. Use the Iceberg catalog API: If the above method doesn't work, you can try using the Iceberg catalog API to access the table's metadata directly:
from pyspark.sql import SparkSession
from iceberg import catalog

def get_last_snapshot_id_by_table(table, start_date, end_date):
spark = SparkSession.builder.getOrCreate()
iceberg_catalog = catalog.load_catalog(spark)
iceberg_table = iceberg_catalog.load_table(table)

snapshots = iceberg_table.snapshots()
filtered_snapshots = [s for s in snapshots if start_date <= s.timestamp_millis <= end_date]

if filtered_snapshots:
return max(filtered_snapshots, key=lambda s: s.timestamp_millis).snapshot_id
return None
  1. Ensure proper Iceberg dependencies: Make sure that your AWS Glue job has the necessary Iceberg dependencies. You might need to include the appropriate Iceberg JAR files in your job's extra JAR files.

  2. Use Athena JDBC driver: If the above methods don't work, you could consider using the Athena JDBC driver within your Glue job to execute the query directly against Athena, which you've confirmed works correctly.

Remember to replace the date comparisons with the appropriate timestamp comparisons based on your specific date format and requirements.

These approaches should help you access the Iceberg table history in your AWS Glue PySpark job. If you continue to face issues, you may need to consult with AWS support or the Iceberg community for more specific guidance related to your environment and configuration.
Sources
Query Iceberg table data - Amazon Athena
Query Apache Iceberg tables - Amazon Athena

answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.