Skip to content

How to handle large result set pagination with Neptune-Athena connector without query timeout?

0

Hi AWS Community,

We're using the Neptune-Athena connector to query Neptune data through SQL and are running into query timeout issues when the dataset becomes large — even when fetching only one level of relationships.

Here's a simplified version of our SQL query:

    SELECT *
    FROM "product"
      JOIN "in" ON "in"."in" = "product"."id"
      JOIN "account" ON "account"."id" = "in"."out";

This works fine for small graphs, but times out when more data is returned.

We also tried using Gremlin pass-through queries via system.query:

    SELECT *
    FROM TABLE(
      system.query(
        DATABASE => 'neptune-db-identifier',
        COLLECTION => 'product',
        QUERY => 'g.V().hasLabel("account").out("in").hasLabel("product").valueMap(true)'
      )
    )

Still, the query times out when the dataset is large.

We understand we could paginate the data, but that would require custom logic to:

  • Track pagination tokens or cursors
  • Make repeated Athena calls until all data is fetched

This adds significant orchestration overhead.

Our questions:

  1. Is there a built-in way to enable automatic pagination or streaming of large result sets via the Neptune-Athena connector?
  2. Are there best practices for querying and retrieving large graph datasets efficiently through Athena?
  3. Would AWS Glue, or exporting to S3 be better alternatives for this use case?

Any advice, recommendations, or links to relevant documentation would be greatly appreciated.

Thanks in advance!

asked 9 months ago215 views
1 Answer
0

Based on the available information, there isn't a built-in automatic pagination mechanism specifically for the Neptune-Athena connector when dealing with large result sets.

When working with DynamoDB through its Query API, pagination is handled by using the ExclusiveStartKey and LastEvaluatedKey elements to paginate over large result sets that exceed the 1 MB limit. However, this pagination approach is specific to DynamoDB and not directly applicable to the Neptune-Athena connector.

For your Neptune-Athena connector use case, here are some recommendations:

  1. Query Optimization:
  • Consider using more specific predicates in your queries to reduce the result set size
  • Add filters to narrow down the data being returned
  1. Alternative Approaches:
  • For large datasets, exporting to S3 might be a better approach as it would allow you to work with the complete dataset without timeout constraints
  • AWS Glue could be used to transform and process the data in batches
  1. Gremlin Query Optimization:
  • When using passthrough queries, you might be able to optimize your Gremlin query by adding limits or ranges
  • Consider breaking down complex queries into smaller, more manageable parts
  1. Custom Pagination Implementation:
  • While it does add orchestration overhead, implementing custom pagination logic might be necessary for very large datasets
  • You could use application-level pagination by tracking the last processed node and continuing from there in subsequent queries

The Neptune connector for Athena does support passthrough queries for both Gremlin and SPARQL, which gives you flexibility in how you structure your queries, but managing large result sets will likely require some combination of query optimization and custom pagination logic.
Sources
Community | 20 Days of DynamoDB
Paginating table query results in DynamoDB - Amazon DynamoDB
Amazon Athena Neptune connector - Amazon Athena

answered 9 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.