Traversing through complete path (Graph DB, Gremlin, Neptune Cluster)

Question

Hi Team,
I am working on a complex graph data that has central node and multiple nodes connected to it and further connected to multiple nodes and so on.
I would like to traverse through a complete path from central node to end.
 
I am using the query:
%%gremlin
g.V('12345').repeat(bothE().otherV()).simplePath().path()
 
and it throws below error:
{ 
     "code": "MemoryLimitExceededException",
     "requestId": "c83563f5-821b-45a3-8503-694617e4bec7",
     "detailedMessage": "Query cannot be completed due to memory limitations."
 }

The problem I have got is, I don’t know how many times to repeat the traversal. From manual observation, the query works when repeated 6 times (i.e, when I mention times(6)) for this specific ID.
I can’t specify any ‘until’ condition as I don’t know where the starting node traverses through.
I have got millions of IDs and I want to traverse the path through each of the IDs (see where the path ends) and work on centrality calculations of the nodes associated within the graphs of individual IDs.
Is there a way, I can implement this?

Any help is greatly appreciated.

Thanks.

Answer

It is likely going to come down to how big of a fan-out you see at each hop in the graph.  Neptune performs queries, by default, using a BFS pattern.  So when doing shortest-path traversals, it comes down to traversing all vertices at each level in the graph to return the shortest path.  When using a `limit(x)` step, the query will stop once it has found the first x results.  At present, Gremlin queries are natively run on a single execution thread.   There are a few things that you may want to try....

1. Each Neptune instance has a finite number of execution threads equal to 2 x the number of vCPUs on the instance.  About 2/3rds of the instance memory is reserved for buffer pool cache.  The remaining memory is allocated (some what) evenly across the worker threads.  Smaller instances (such as a t3/t4g.medium) will have very little memory per execution thread.   Memory per thread will increase slightly until you get to the 4xlarge sized instances.   The 4xlarge or larger instances will have the largest amount of thread memory allocated.  So increasing the instance up to a 4xlarge can improve OOM related queries.

2. Neptune has two different query execution engines - a native engine and a newer engine called the DFE engine.  Starting in version 1.1.1.0, DFE is enabled by default but it only used by default for openCypher or SPARQL queries.  To use this with Gremlin queries, you'll need to leverage a query hint [1].  DFE can improve certain parts of a query, so it is worth a try.

[1] https://docs.aws.amazon.com/neptune/latest/userguide/neptune-dfe-engine.html

If you're trying to get a sense for how many objects a query is having to access to perform a query, you can use the Neptune Gremlin Profile API (not the TinkerPop `profile()` step, this is different):  https://docs.aws.amazon.com/neptune/latest/userguide/gremlin-profile-api.html

Answer

I have simillar problem. My graph is very small, only 6 000 nodes and 30 000 edges, but when I try just to find one shortest path between two nodes I end with MemoryLimitExceededException.

If I run this query, that shoud find shortes path between node lru99309 and lru306193:
g.V('lru99309').repeat(outE().inV().simplePath()).until(hasId('lru306193')).path().limit(1).group().by(count(local)).next()

I got the response:
==>15=[path[v[lru99309], e[lru99309_lru168327][lru99309-dansecon->lru168327], v[lru168327], e[lru168327_lru140198][lru168327-dansecon->lru140198], v[lru140198], e[lru140198_lru170399][lru140198-dansecon->lru170399], v[lru170399], e[lru170399_lru69973][lru170399-dansecon->lru69973], v[lru69973], e[lru69973_lru316679][lru69973-dansecon->lru316679], v[lru316679], e[lru316679_lru316652][lru316679-dansecon->lru316652], v[lru316652], e[lru316652_lru306193][lru316652-dansecon->lru306193], v[lru306193]]]

But if I try only node that is next to lru306193, but a little far away from starting node I obtain MemoryLimitExceededException.

I do not understand if I am doing something wrong or if the Neptune is not possible to use for searching for shortest path.

Normally I use OSRM engine to search for shortest paths in the road network, but I was interested if I can use Neptune to solve basic road network questions.

My graph is:
neptunegraphtraversalsource[neptunegraph[], standard]

Maybe I need some other type of graph, such as graphtraversalsource[tinkergraph[], graphcomputer], but I think that Neptune does not support graphcomputer version of the graph.

Answer

Moving to a larger cluster instance really helps for these kind of issues.

Slightly modifying the query to exclude duplicates and moving to a larger cluster instance resolved memory limitation error.

Modified query:
g.V('12345').as('sourceID').repeat(both().as('destID').simplePath().dedup('sourceID', 'destID')).emit().path()

This query tries to find the shortest path between source node and destination nodes and outputs path excluding duplicates.

Thanks.

Traversing through complete path (Graph DB, Gremlin, Neptune Cluster)

Relevant content