ArangoDB Not Using Index During Traversal - arangodb

I have a simple graph traversal query:
FOR e in 0..3 ANY 'Node/5025926' Edge
FILTER
e.ModelType == "A.Model" &&
e.TargetType == "A.Target" &&
e.SourceType == "A.Source"
RETURN e
The 'Edge' edge collection has a hash index defined for attributes ModelType, TargetType, SourceType, in that order.
When checking the execution plan, the results are:
Query string:
FOR e in 0..3 ANY 'Node/5025926' Edge
FILTER
e.ModelType == "A.Model" &&
e.TargetType == "A.Target" &&
e.SourceType == "A.Source"
RETURN e
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
2 TraversalNode 7 - FOR e /* vertex */ IN 0..3 /* min..maxPathDepth */ ANY 'Node/5025926' /* startnode */ Edge
3 CalculationNode 7 - LET #1 = (((e.`ModelType` == "A.Model") && (e.`TargetType` == "A.Target")) && (e.`SourceType` == "A.Source")) /* simple expression */
4 FilterNode 7 - FILTER #1
5 ReturnNode 7 - RETURN e
Indexes used:
none
Traversals on graphs:
Id Depth Vertex collections Edge collections Filter conditions
2 0..3 Edge
Optimization rules applied:
none
Notice that the execution plan indicates that no indices will be used to process the query.
Is there anything I need to do to make the engine use the index on the Edge collection to process the results?
Thanks

In ArangoDB 3.0 a traversal will always use the edge index to find connected vertices, regardless of which filter conditions are present in the query and regardless of which indexes exist.
In ArangoDB 3.1 the optimizer will try to find the best possible index for each level of the traversal. It will inspect the traversal's filter condition and for each level pick the index for which it estimates the lowest cost. If there are no user-defined indexes, it will still use the edge index to find connected vertices. Other indexes will be used if there are filter conditions on edge attributes which are also indexed and the index has a better estimated average selectivity than the edge index.
In 3.1.0 the explain output will always show "Indexes used: none" for traversals, even though a traversal will always use an index. The index display is just missing in the explain output. This has been fixed in ArangoDB 3.1.1, which will show the individual indexes selected by the optimizer for each level of the traversal.
For example, the following query shows the following explain output in 3.1:
Query string:
FOR v, e, p in 0..3 ANY 'v/test0' e
FILTER p.edges[0].type == 1 && p.edges[2].type == 2
RETURN p.vertices
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
2 TraversalNode 8000 - FOR v /* vertex */, p /* paths */ IN 0..3 /* min..maxPathDepth */ ANY 'v/test0' /* startnode */ e
3 CalculationNode 8000 - LET #5 = ((p.`edges`[0].`type` == 1) && (p.`edges`[2].`type` == 2)) /* simple expression */
4 FilterNode 8000 - FILTER #5
5 CalculationNode 8000 - LET #7 = p.`vertices` /* attribute expression */
6 ReturnNode 8000 - RETURN #7
Indexes used:
By Type Collection Unique Sparse Selectivity Fields Ranges
2 edge e false false 10.00 % [ `_from`, `_to` ] base INBOUND
2 edge e false false 10.00 % [ `_from`, `_to` ] base OUTBOUND
2 hash e false false 63.60 % [ `_to`, `type` ] level 0 INBOUND
2 hash e false false 64.40 % [ `_from`, `type` ] level 0 OUTBOUND
2 hash e false false 63.60 % [ `_to`, `type` ] level 2 INBOUND
2 hash e false false 64.40 % [ `_from`, `type` ] level 2 OUTBOUND
Additional indexes are present on [ "_to", "type" ] and [ "_from", "type" ]. Those are used on levels 0 and 2 of the traversal because there are filter conditions for the edges on these levels that can use these indexes. For all other levels, the traversal will use the indexes labeled with "base" in the "Ranges" column.
The explain output fix will become available with 3.1.1, which will be released soon.

Related

Strange behaviour with vertex centric indexes

I have some trouble understanding how to properly use vertex centric indexes in ArangoDB.
In my cooking app, I have the following graph schema : (recipe)-[hasConstituent]->(ingredient)
Let say I want all the recipes that need less than 0g of carrots. Result will be empty of course.
FOR recipe, constituent, p IN INBOUND 'ingredients/carrot' hasConstituent
FILTER constituent.quantity.value < 0
RETURN recipe._key
With carrot having 400.000 recipes associated, this query takes ~3.9s. Fine.
Now I create a vertex centric index in hasConstituent collection on _to,quantity.value properties, with an estimated selectivity of 100%.
I expected it to sort indexes in a numeric order, and then to significantly increase the speed of FILTER or SORT/LIMIT requests, but now the previous request takes ~7.9s... If I make the index "sparse", it takes the same time as without index (~3.9s)
What am I missing here ?
The most hard part to understand is that the execution plan given by the explain result is different from the profile result. Here is the explain, where all is fine and should fetch the result instantly :
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
5 TraversalNode 1 - FOR recipe /* vertex */, constituent /* edge */ IN 1..1 /* min..maxPathDepth */ INBOUND 'ingredients/carrot' /* startnode */ hasConstituent
6 CalculationNode 1 - LET #8 = (constituent.`quantity`.`value` < 0) /* simple expression */
7 FilterNode 1 - FILTER #8
8 CalculationNode 1 - LET #10 = recipe.`_key` /* attribute expression */
9 ReturnNode 1 - RETURN #10
But in the profile :
Execution plan:
Id NodeType Calls Items Runtime [s] Comment
1 SingletonNode 1 1 0.00000 * ROOT
5 TraversalNode 433 432006 7.64893 - FOR recipe /* vertex */, constituent /* edge */ IN 1..1 /* min..maxPathDepth */ INBOUND 'ingredients/carrot' /* startnode */ hasConstituent
6 CalculationNode 433 432006 0.28761 - LET #8 = (constituent.`quantity`.`value` < 0) /* simple expression */
7 FilterNode 1 0 0.08704 - FILTER #8
8 CalculationNode 1 0 0.00000 - LET #10 = recipe.`_key` /* attribute expression */
9 ReturnNode 1 0 0.00001 - RETURN #10
I precise the index is used in both results :
Indexes used:
By Name Type Collection Unique Sparse Selectivity Fields Ranges
5 recipeByIngrQty persistent hasConstituent false false 100.00 % [ `_to`, `quantity.value` ] base INBOUND
Any help is very welcome
For a traversal FOR vertex, edge, path IN ..., filtering on either vertex or edge only applies to the results, but not what's actually visited. As to why that makes sense, keep in mind that generally, not all vertices or edges visited during the traversal are actually part of the result: For example, if min in the IN min..max argument is larger than zero - it's one by default - vertices (and their incoming edges) with distance lower than that are not part of the result, but have to be visited.
That's why, if you want to restrict the edges visited during a traversal, you must add the filter on the path variable instead. For your example:
FOR recipe, constituent, p IN INBOUND 'ingredients/carrot' hasConstituent
FILTER p.edges[*].quantity.value ALL < 0
RETURN recipe._key
That should make use of the index as you expected. See vertex centric indexes and the AQL graph traversal documentation for more details.
I think this answers the core of your question, now to clear up some of the ones you found on the way.
I expected it to sort indexes in a numeric order, and then to
significantly increase the speed of FILTER or SORT/LIMIT requests, but
now the previous request takes ~7.9s... If I make the index "sparse",
it takes the same time as without index (~3.9s)
Two things here.
First, it sounds like the optimizer preferred your index over the edge index. That probably shouldn't be the case, as (without the change I described above) it's not more specific than the edge index, but just somewhat slower. You haven't specified the version of ArangoDB your using, so I can't comment specifically. But if you are using the most recent patch release of a supported minor version, e.g. 3.7.10 or 3.6.12 at the time of this writing, you can report this as an issue on Github.
Second, a sparse index does not index non-existent or null values. It thus cannot be used for a query that could report null values. Now note that null < 0 is true, see type and value order in the documentation for details. So your query constituent.quantity.value < 0 could report null values, and that's why the sparse index is treated differently (i.e. cannot be used at all).
Now to the final point:
The most hard part to understand is that the execution plan given by the explain result is different from the profile result.
The explain output shows a column "Est.", which is an estimate of the number of rows / iterations this node will emit / do. The column "Items" in the profile output in contrast is the corresponding exact number. Now this estimate can be good in some cases, but bad in others. This is not necessarily a problem, and it cannot be exact without actually executing the query. If it happens to result in a problem, because the estimates get the optimizer to choose the wrong index for the job, you can use index hints. But this isn't the case here.
Apart from that, the two plans you showed seem to be exactly identical.

AQL's PRUNE: How to combine conditions?

I am running ArangoDB 3.4.5 and I've been playing around with the PRUNE statements. I am having some difficulties combining conditions.
Assuming some vertices v on my path p have integer attributes ia and some v have boolean attributes ba. Even index v along p such as p.vertices[2] all have ba.
PRUNE HAS(v, "ia") AND v.ia != 5
works by itself.
PRUNE p.vertices[2].ba == false OR p.vertices[4].ba == false
also works by itself.
I observe, that I cannot combine them in one query, neither by multiple PRUNE statements nor by putting them in one
PRUNE (condition_1) OR (condition_2). Also I cannot put one in a PRUNE and the next in a FILTER statement.
Is anyone else experiencing this or is it just me?
UPDATE:
The FILTER and PRUNE statements did not return the desired results, the reason was however the missing OPTIONS {uniqueEdges: "none"}. As opposed to the uniqueVertices, none is not default.
I can't reproduce your issue in ArangoDB 3.4.5
If you create collections edge and vertex and populate these with an example tree:
FOR n in 0..100000
INSERT {_key: TO_STRING(n), val: n, modulo: n%2} INTO vertex
FILTER n > 0
INSERT {_from: CONCAT("vertex/", FLOOR((n-1)/2)), _to: NEW._id} INTO edge
Now I run a traversal:
WITH vertex
FOR v,e,p IN 0..5 OUTBOUND "vertex/0" edge
RETURN TO_STRING(p.vertices[*].val)
Result:
[
"[0]",
"[0,1]",
"[0,1,3]",
"[0,1,3,7]",
"[0,1,3,7,15]",
"[0,1,3,7,15,31]",
"[0,1,3,7,15,32]",
"[0,1,3,7,16]",
"[0,1,3,7,16,33]",
"[0,1,3,7,16,34]",
"[0,1,3,8]",
"[0,1,3,8,17]",
"[0,1,3,8,17,35]",
"[0,1,3,8,17,36]",
"[0,1,3,8,18]",
"[0,1,3,8,18,37]",
"[0,1,3,8,18,38]",
"[0,1,4]",
...
Next, I add "stop": true and "hide": 1 to the vertex _key: 7 and some other combinations to vertex 17 and 18. Now a PRUNE should stop traversing if the condition is meet. Be careful, the vertex itself is included in the results.
WITH vertex
FOR v,e,p IN 0..5 OUTBOUND "vertex/0" edge
PRUNE v.hide == 1 AND v.stop == true
RETURN TO_STRING(p.vertices[*].val)
Result:
[
"[0]",
"[0,1]",
"[0,1,3]",
"[0,1,3,7]", <-- stop: true, hide: 1
"[0,1,3,8]",
"[0,1,3,8,17]", <-- stop: true, hide: 1
"[0,1,3,8,18]",
"[0,1,3,8,18,37]",
"[0,1,3,8,18,38]",
...
The PRUNE condition can use AND / OR, but just one PRUNE condition is supported (in contrast to FILTERS).

ArangoDB: Find all shortest paths

i want to get all shortest paths between 2 vertex.
Example: Give me all shortest path between node A and B should only return the 2 blue paths
this is what i have got so far:
LET source = (FOR x IN Entity FILTER x.objectID == "organization_1"
return x)[0]
LET destination = (FOR x IN Entity FILTER x.objectID == "organization_129"
return x)[0]
FOR node, edge, path IN 1..2 ANY source._id GRAPH "m"
FILTER LAST(path.vertices)._id == destination._id
LIMIT 100
RETURN path
Problems:
1. it is very slow (took 18 seconds on a graph with like 70 mio nodes)
2. it finds every path, but i want only all shortest path
UPDATE
i tried the 2-step query solution from the comments.
the problem is that the second query is also very slow
Query string:
FOR source IN Entity FILTER source.objectID == "organization_1"
LIMIT 1
FOR node, edge, path
IN 1..#depth ANY source._id
GRAPH "m"
OPTIONS {uniqueVertices: "path"}
FILTER node.objectID == "organization_129"
RETURN path
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
11 IndexNode 1 - FOR source IN Entity /* hash index scan */
5 LimitNode 1 - LIMIT 0, 1
6 CalculationNode 1 - LET #6 = source.`_id` /* attribute expression */ /* collections used: source : Entity */
7 TraversalNode 346 - FOR node /* vertex */, path /* paths */ IN 1..2 /* min..maxPathDepth */ ANY #6 /* startnode */ GRAPH 'm'
8 CalculationNode 346 - LET #10 = (node.`objectID` == "organization_129") /* simple expression */
9 FilterNode 346 - FILTER #10
10 ReturnNode 346 - RETURN path
Indexes used:
By Type Collection Unique Sparse Selectivity Fields Ranges
11 hash Entity false false 100.00 % [ `objectID` ] (source.`objectID` == "organization_1")
7 edge ACTIVITYPARTY false false 100.00 % [ `_from`, `_to` ] base INBOUND
7 edge ACTIVITYPARTY false false 100.00 % [ `_from`, `_to` ] base OUTBOUND
7 edge ACTIVITY_LINK false false 100.00 % [ `_from`, `_to` ] base INBOUND
7 edge ACTIVITY_LINK false false 100.00 % [ `_from`, `_to` ] base OUTBOUND
7 edge ENTITY_LINK false false 70.38 % [ `_from`, `_to` ] base INBOUND
7 edge ENTITY_LINK false false 70.38 % [ `_from`, `_to` ] base OUTBOUND
7 edge RELATION false false 20.49 % [ `_from`, `_to` ] base INBOUND
7 edge RELATION false false 20.49 % [ `_from`, `_to` ] base OUTBOUND
7 edge SOFT_LINK false false 100.00 % [ `_from`, `_to` ] base INBOUND
7 edge SOFT_LINK false false 100.00 % [ `_from`, `_to` ] base OUTBOUND
Traversals on graphs:
Id Depth Vertex collections Edge collections Options Filter conditions
7 1..2 Activity, Entity, SOFT_LINK, Property ACTIVITYPARTY, ENTITY_LINK, SOFT_LINK, RELATION, ACTIVITY_LINK uniqueVertices: path, uniqueEdges: path
Optimization rules applied:
Id RuleName
1 move-calculations-up
2 move-filters-up
3 move-calculations-up-2
4 move-filters-up-2
5 use-indexes
6 remove-filter-covered-by-index
7 remove-unnecessary-calculations-2
8 optimize-traversals
9 move-calculations-down
First of all you need a hash index on field objectID in collection Entity to avoid the full collection scans, which heavily slows down your performance.
To get all shortest path I would first search for one shortest path with the AQL SHORTEST_PATH and return the number of visited vertices. There is also no need of subqueries (like in your query).
FOR source IN Entity FILTER source.objectID == "organization_1"
LIMIT 1
FOR destination IN Entity FILTER destination.objectID == "organization_129"
LIMIT 1
RETURN sum(
FOR v, e
IN ANY
SHORTEST_PATH source._id TO destination._id
GRAPH "m"
RETURN 1)-1
After that I would execute another query with the result from the first query as bind parameter #depth, which is used to limit the depth of the traversal.
FOR source IN Entity FILTER source.objectID == "organization_1"
LIMIT 1
FOR node, edge, path
IN 1..#depth ANY source._id
GRAPH "m"
OPTIONS {uniqueVertices: "path"}
FILTER node.objectID == "organization_129"
RETURN path
Note: To filter the last vertex in the path you don't have to use LAST(path.vertices), you can simply use node because it is already the last vertex (the same applies for edge).

Is my recurrence relation right for subset sum?

Is this recurrence relation correct for the subset sum problem?
Statement: Print Yes or No depending on whether there is a subset of the given array a[ ] which sums up to a given number n.
dp[i][j] = true, if 0 to j elements in array sum up to i and false otherwise.
dp[i][j] = min(dp[i-a[j]][j], dp[i][j-1])
Base case values :
dp[0][0] = true
dp[1...i][0] = false
Just trying to see if I have the recurrence relation right or not.Thanks for guiding.
You are almost correct ( not sure why you used min ). But let dp[i][j] store the answer of whether a subset of arr[0],arr[1],....arr[j] (here arr[] is the array of elements ) can sum upto i.
That is dp[i][j] is 1 if answer is yes and 0 if answer is no. Ignoring the base cases, the recurrence relation is dp[i][j]=(dp[i][j-1] | dp[i-arr[j]][j-1]). To get the exact code and base cases and implementation you can have a look here : http://www.geeksforgeeks.org/dynamic-programming-subset-sum-problem/.

Find the cross node for number of nodes in ArangoDB?

I have a number of nodes connected through intermediate node of other type. Like on picture There are can be multiple middle nodes. I need to find all the middle nodes for a given number of nodes and sort it by number of links between my initial nodes. In my example given A, B, C, D it should return node E (4 links) folowing node F (3 links). Is this possible? If not may be it can be done using multiple requests? I was thinking about using SHORTEST_PATH function but seems it can only find path between nodes from the same collection?
Very nice question, it challenged the AQL part of my brain ;)
Good news: it is totally possible with only one query utilizing GRAPH_COMMON_NEIGHBORS and a portion of math.
Common neighbors will count for how many of your selected vertices a cross is the connecting component (taking into account ordering A-E-B is different from B-E-A) using combinatorics we end up having a*(a-1)=c many combinations, where c is comupted. We use p/q formula to identify a (the number of connected vertices given in your set).
If the type of vertex is encoded in an attribute of the vertex object
the resulting AQL looks like this:
FOR x in (
(
let nodes = ["nodes/A","nodes/B","nodes/C","nodes/D"]
for n in GRAPH_COMMON_NEIGHBORS("myGraph",nodes , nodes)
for f in VALUES(n)
for s in VALUES(f)
for candidate in s
filter candidate.type == "cross"
collect crosses = candidate._key into counter
return {crosses: crosses, connections: 0.5 + SQRT(0.25 + LENGTH(counter))}
)
)
sort x.connections DESC
return x
If you put the crosses in a different collection and filter by collection name the query will even get more efficient, we do not need to open any vertices that are not of type cross at all.
FOR x in (
(
let nodes = ["nodes/A","nodes/B","nodes/C","nodes/D"]
for n in GRAPH_COMMON_NEIGHBORS("myGraph",nodes, nodes,
{"vertexCollectionRestriction": "crosses"}, {"vertexCollectionRestriction": "crosses"})
for f in VALUES(n)
for s in VALUES(f)
for candidate in s
collect crosses = candidate._key into counter
return {crosses: crosses, connections: 0.5 + SQRT(0.25 + LENGTH(counter))}
)
)
sort x.connections DESC
return x
Both queries will yield the result on your dataset:
[
{
"crosses": "E",
"connections": 4
},
{
"crosses": "F",
"connections": 3
}
]

Resources