- Newest
- Most votes
- Most comments
The issue you're experiencing is likely due to differences in how Athena and PySpark in AWS Glue access Iceberg table metadata. While Athena has built-in support for querying Iceberg metadata tables like 'history', PySpark in AWS Glue may not have the same level of integration out of the box.
To resolve this, you can try the following approaches:
- Use the Iceberg API directly in PySpark: Instead of querying the history table using SQL, you can use the Iceberg API to access the table's history. This method bypasses the need for a SQL query and directly interacts with the Iceberg table metadata.
Here's an example of how you might modify your function:
from pyspark.sql import SparkSession from iceberg.api import Expressions def get_last_snapshot_id_by_table(table, start_date, end_date): spark = SparkSession.builder.getOrCreate() catalog = spark.catalog() iceberg_table = catalog.loadTable(table) snapshots = iceberg_table.snapshots() filtered_snapshots = [s for s in snapshots if start_date <= s.timestampMillis() <= end_date] if filtered_snapshots: return max(filtered_snapshots, key=lambda s: s.timestampMillis()).snapshotId() return None
- Use the Iceberg catalog API: If the above method doesn't work, you can try using the Iceberg catalog API to access the table's metadata directly:
from pyspark.sql import SparkSession from iceberg import catalog def get_last_snapshot_id_by_table(table, start_date, end_date): spark = SparkSession.builder.getOrCreate() iceberg_catalog = catalog.load_catalog(spark) iceberg_table = iceberg_catalog.load_table(table) snapshots = iceberg_table.snapshots() filtered_snapshots = [s for s in snapshots if start_date <= s.timestamp_millis <= end_date] if filtered_snapshots: return max(filtered_snapshots, key=lambda s: s.timestamp_millis).snapshot_id return None
-
Ensure proper Iceberg dependencies: Make sure that your AWS Glue job has the necessary Iceberg dependencies. You might need to include the appropriate Iceberg JAR files in your job's extra JAR files.
-
Use Athena JDBC driver: If the above methods don't work, you could consider using the Athena JDBC driver within your Glue job to execute the query directly against Athena, which you've confirmed works correctly.
Remember to replace the date comparisons with the appropriate timestamp comparisons based on your specific date format and requirements.
These approaches should help you access the Iceberg table history in your AWS Glue PySpark job. If you continue to face issues, you may need to consult with AWS support or the Iceberg community for more specific guidance related to your environment and configuration.
Sources
Query Iceberg table data - Amazon Athena
Query Apache Iceberg tables - Amazon Athena
Relevant content
- asked 3 years ago
- asked 4 years ago
- AWS OFFICIALUpdated 10 months ago
- AWS OFFICIALUpdated 2 years ago
