I have some trouble understanding how to properly use vertex centric indexes in ArangoDB.
In my cooking app, I have the following graph schema : (recipe)-[hasConstituent]->(ingredient)
Let say I want all the recipes that need less than 0g of carrots. Result will be empty of course.
FOR recipe, constituent, p IN INBOUND 'ingredients/carrot' hasConstituent
FILTER constituent.quantity.value < 0
RETURN recipe._key
With carrot having 400.000 recipes associated, this query takes ~3.9s. Fine.
Now I create a vertex centric index in hasConstituent collection on _to,quantity.value properties, with an estimated selectivity of 100%.
I expected it to sort indexes in a numeric order, and then to significantly increase the speed of FILTER or SORT/LIMIT requests, but now the previous request takes ~7.9s... If I make the index "sparse", it takes the same time as without index (~3.9s)
What am I missing here ?
The most hard part to understand is that the execution plan given by the explain result is different from the profile result. Here is the explain, where all is fine and should fetch the result instantly :
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
5 TraversalNode 1 - FOR recipe /* vertex */, constituent /* edge */ IN 1..1 /* min..maxPathDepth */ INBOUND 'ingredients/carrot' /* startnode */ hasConstituent
6 CalculationNode 1 - LET #8 = (constituent.`quantity`.`value` < 0) /* simple expression */
7 FilterNode 1 - FILTER #8
8 CalculationNode 1 - LET #10 = recipe.`_key` /* attribute expression */
9 ReturnNode 1 - RETURN #10
But in the profile :
Execution plan:
Id NodeType Calls Items Runtime [s] Comment
1 SingletonNode 1 1 0.00000 * ROOT
5 TraversalNode 433 432006 7.64893 - FOR recipe /* vertex */, constituent /* edge */ IN 1..1 /* min..maxPathDepth */ INBOUND 'ingredients/carrot' /* startnode */ hasConstituent
6 CalculationNode 433 432006 0.28761 - LET #8 = (constituent.`quantity`.`value` < 0) /* simple expression */
7 FilterNode 1 0 0.08704 - FILTER #8
8 CalculationNode 1 0 0.00000 - LET #10 = recipe.`_key` /* attribute expression */
9 ReturnNode 1 0 0.00001 - RETURN #10
I precise the index is used in both results :
Indexes used:
By Name Type Collection Unique Sparse Selectivity Fields Ranges
5 recipeByIngrQty persistent hasConstituent false false 100.00 % [ `_to`, `quantity.value` ] base INBOUND
Any help is very welcome
For a traversal FOR vertex, edge, path IN ..., filtering on either vertex or edge only applies to the results, but not what's actually visited. As to why that makes sense, keep in mind that generally, not all vertices or edges visited during the traversal are actually part of the result: For example, if min in the IN min..max argument is larger than zero - it's one by default - vertices (and their incoming edges) with distance lower than that are not part of the result, but have to be visited.
That's why, if you want to restrict the edges visited during a traversal, you must add the filter on the path variable instead. For your example:
FOR recipe, constituent, p IN INBOUND 'ingredients/carrot' hasConstituent
FILTER p.edges[*].quantity.value ALL < 0
RETURN recipe._key
That should make use of the index as you expected. See vertex centric indexes and the AQL graph traversal documentation for more details.
I think this answers the core of your question, now to clear up some of the ones you found on the way.
I expected it to sort indexes in a numeric order, and then to
significantly increase the speed of FILTER or SORT/LIMIT requests, but
now the previous request takes ~7.9s... If I make the index "sparse",
it takes the same time as without index (~3.9s)
Two things here.
First, it sounds like the optimizer preferred your index over the edge index. That probably shouldn't be the case, as (without the change I described above) it's not more specific than the edge index, but just somewhat slower. You haven't specified the version of ArangoDB your using, so I can't comment specifically. But if you are using the most recent patch release of a supported minor version, e.g. 3.7.10 or 3.6.12 at the time of this writing, you can report this as an issue on Github.
Second, a sparse index does not index non-existent or null values. It thus cannot be used for a query that could report null values. Now note that null < 0 is true, see type and value order in the documentation for details. So your query constituent.quantity.value < 0 could report null values, and that's why the sparse index is treated differently (i.e. cannot be used at all).
Now to the final point:
The most hard part to understand is that the execution plan given by the explain result is different from the profile result.
The explain output shows a column "Est.", which is an estimate of the number of rows / iterations this node will emit / do. The column "Items" in the profile output in contrast is the corresponding exact number. Now this estimate can be good in some cases, but bad in others. This is not necessarily a problem, and it cannot be exact without actually executing the query. If it happens to result in a problem, because the estimates get the optimizer to choose the wrong index for the job, you can use index hints. But this isn't the case here.
Apart from that, the two plans you showed seem to be exactly identical.
I am running ArangoDB 3.4.5 and I've been playing around with the PRUNE statements. I am having some difficulties combining conditions.
Assuming some vertices v on my path p have integer attributes ia and some v have boolean attributes ba. Even index v along p such as p.vertices[2] all have ba.
PRUNE HAS(v, "ia") AND v.ia != 5
works by itself.
PRUNE p.vertices[2].ba == false OR p.vertices[4].ba == false
also works by itself.
I observe, that I cannot combine them in one query, neither by multiple PRUNE statements nor by putting them in one
PRUNE (condition_1) OR (condition_2). Also I cannot put one in a PRUNE and the next in a FILTER statement.
Is anyone else experiencing this or is it just me?
UPDATE:
The FILTER and PRUNE statements did not return the desired results, the reason was however the missing OPTIONS {uniqueEdges: "none"}. As opposed to the uniqueVertices, none is not default.
I can't reproduce your issue in ArangoDB 3.4.5
If you create collections edge and vertex and populate these with an example tree:
FOR n in 0..100000
INSERT {_key: TO_STRING(n), val: n, modulo: n%2} INTO vertex
FILTER n > 0
INSERT {_from: CONCAT("vertex/", FLOOR((n-1)/2)), _to: NEW._id} INTO edge
Now I run a traversal:
WITH vertex
FOR v,e,p IN 0..5 OUTBOUND "vertex/0" edge
RETURN TO_STRING(p.vertices[*].val)
Result:
[
"[0]",
"[0,1]",
"[0,1,3]",
"[0,1,3,7]",
"[0,1,3,7,15]",
"[0,1,3,7,15,31]",
"[0,1,3,7,15,32]",
"[0,1,3,7,16]",
"[0,1,3,7,16,33]",
"[0,1,3,7,16,34]",
"[0,1,3,8]",
"[0,1,3,8,17]",
"[0,1,3,8,17,35]",
"[0,1,3,8,17,36]",
"[0,1,3,8,18]",
"[0,1,3,8,18,37]",
"[0,1,3,8,18,38]",
"[0,1,4]",
...
Next, I add "stop": true and "hide": 1 to the vertex _key: 7 and some other combinations to vertex 17 and 18. Now a PRUNE should stop traversing if the condition is meet. Be careful, the vertex itself is included in the results.
WITH vertex
FOR v,e,p IN 0..5 OUTBOUND "vertex/0" edge
PRUNE v.hide == 1 AND v.stop == true
RETURN TO_STRING(p.vertices[*].val)
Result:
[
"[0]",
"[0,1]",
"[0,1,3]",
"[0,1,3,7]", <-- stop: true, hide: 1
"[0,1,3,8]",
"[0,1,3,8,17]", <-- stop: true, hide: 1
"[0,1,3,8,18]",
"[0,1,3,8,18,37]",
"[0,1,3,8,18,38]",
...
The PRUNE condition can use AND / OR, but just one PRUNE condition is supported (in contrast to FILTERS).
i want to get all shortest paths between 2 vertex.
Example: Give me all shortest path between node A and B should only return the 2 blue paths
this is what i have got so far:
LET source = (FOR x IN Entity FILTER x.objectID == "organization_1"
return x)[0]
LET destination = (FOR x IN Entity FILTER x.objectID == "organization_129"
return x)[0]
FOR node, edge, path IN 1..2 ANY source._id GRAPH "m"
FILTER LAST(path.vertices)._id == destination._id
LIMIT 100
RETURN path
Problems:
1. it is very slow (took 18 seconds on a graph with like 70 mio nodes)
2. it finds every path, but i want only all shortest path
UPDATE
i tried the 2-step query solution from the comments.
the problem is that the second query is also very slow
Query string:
FOR source IN Entity FILTER source.objectID == "organization_1"
LIMIT 1
FOR node, edge, path
IN 1..#depth ANY source._id
GRAPH "m"
OPTIONS {uniqueVertices: "path"}
FILTER node.objectID == "organization_129"
RETURN path
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
11 IndexNode 1 - FOR source IN Entity /* hash index scan */
5 LimitNode 1 - LIMIT 0, 1
6 CalculationNode 1 - LET #6 = source.`_id` /* attribute expression */ /* collections used: source : Entity */
7 TraversalNode 346 - FOR node /* vertex */, path /* paths */ IN 1..2 /* min..maxPathDepth */ ANY #6 /* startnode */ GRAPH 'm'
8 CalculationNode 346 - LET #10 = (node.`objectID` == "organization_129") /* simple expression */
9 FilterNode 346 - FILTER #10
10 ReturnNode 346 - RETURN path
Indexes used:
By Type Collection Unique Sparse Selectivity Fields Ranges
11 hash Entity false false 100.00 % [ `objectID` ] (source.`objectID` == "organization_1")
7 edge ACTIVITYPARTY false false 100.00 % [ `_from`, `_to` ] base INBOUND
7 edge ACTIVITYPARTY false false 100.00 % [ `_from`, `_to` ] base OUTBOUND
7 edge ACTIVITY_LINK false false 100.00 % [ `_from`, `_to` ] base INBOUND
7 edge ACTIVITY_LINK false false 100.00 % [ `_from`, `_to` ] base OUTBOUND
7 edge ENTITY_LINK false false 70.38 % [ `_from`, `_to` ] base INBOUND
7 edge ENTITY_LINK false false 70.38 % [ `_from`, `_to` ] base OUTBOUND
7 edge RELATION false false 20.49 % [ `_from`, `_to` ] base INBOUND
7 edge RELATION false false 20.49 % [ `_from`, `_to` ] base OUTBOUND
7 edge SOFT_LINK false false 100.00 % [ `_from`, `_to` ] base INBOUND
7 edge SOFT_LINK false false 100.00 % [ `_from`, `_to` ] base OUTBOUND
Traversals on graphs:
Id Depth Vertex collections Edge collections Options Filter conditions
7 1..2 Activity, Entity, SOFT_LINK, Property ACTIVITYPARTY, ENTITY_LINK, SOFT_LINK, RELATION, ACTIVITY_LINK uniqueVertices: path, uniqueEdges: path
Optimization rules applied:
Id RuleName
1 move-calculations-up
2 move-filters-up
3 move-calculations-up-2
4 move-filters-up-2
5 use-indexes
6 remove-filter-covered-by-index
7 remove-unnecessary-calculations-2
8 optimize-traversals
9 move-calculations-down
First of all you need a hash index on field objectID in collection Entity to avoid the full collection scans, which heavily slows down your performance.
To get all shortest path I would first search for one shortest path with the AQL SHORTEST_PATH and return the number of visited vertices. There is also no need of subqueries (like in your query).
FOR source IN Entity FILTER source.objectID == "organization_1"
LIMIT 1
FOR destination IN Entity FILTER destination.objectID == "organization_129"
LIMIT 1
RETURN sum(
FOR v, e
IN ANY
SHORTEST_PATH source._id TO destination._id
GRAPH "m"
RETURN 1)-1
After that I would execute another query with the result from the first query as bind parameter #depth, which is used to limit the depth of the traversal.
FOR source IN Entity FILTER source.objectID == "organization_1"
LIMIT 1
FOR node, edge, path
IN 1..#depth ANY source._id
GRAPH "m"
OPTIONS {uniqueVertices: "path"}
FILTER node.objectID == "organization_129"
RETURN path
Note: To filter the last vertex in the path you don't have to use LAST(path.vertices), you can simply use node because it is already the last vertex (the same applies for edge).
Is this recurrence relation correct for the subset sum problem?
Statement: Print Yes or No depending on whether there is a subset of the given array a[ ] which sums up to a given number n.
dp[i][j] = true, if 0 to j elements in array sum up to i and false otherwise.
dp[i][j] = min(dp[i-a[j]][j], dp[i][j-1])
Base case values :
dp[0][0] = true
dp[1...i][0] = false
Just trying to see if I have the recurrence relation right or not.Thanks for guiding.
You are almost correct ( not sure why you used min ). But let dp[i][j] store the answer of whether a subset of arr[0],arr[1],....arr[j] (here arr[] is the array of elements ) can sum upto i.
That is dp[i][j] is 1 if answer is yes and 0 if answer is no. Ignoring the base cases, the recurrence relation is dp[i][j]=(dp[i][j-1] | dp[i-arr[j]][j-1]). To get the exact code and base cases and implementation you can have a look here : http://www.geeksforgeeks.org/dynamic-programming-subset-sum-problem/.
I have a number of nodes connected through intermediate node of other type. Like on picture There are can be multiple middle nodes. I need to find all the middle nodes for a given number of nodes and sort it by number of links between my initial nodes. In my example given A, B, C, D it should return node E (4 links) folowing node F (3 links). Is this possible? If not may be it can be done using multiple requests? I was thinking about using SHORTEST_PATH function but seems it can only find path between nodes from the same collection?
Very nice question, it challenged the AQL part of my brain ;)
Good news: it is totally possible with only one query utilizing GRAPH_COMMON_NEIGHBORS and a portion of math.
Common neighbors will count for how many of your selected vertices a cross is the connecting component (taking into account ordering A-E-B is different from B-E-A) using combinatorics we end up having a*(a-1)=c many combinations, where c is comupted. We use p/q formula to identify a (the number of connected vertices given in your set).
If the type of vertex is encoded in an attribute of the vertex object
the resulting AQL looks like this:
FOR x in (
(
let nodes = ["nodes/A","nodes/B","nodes/C","nodes/D"]
for n in GRAPH_COMMON_NEIGHBORS("myGraph",nodes , nodes)
for f in VALUES(n)
for s in VALUES(f)
for candidate in s
filter candidate.type == "cross"
collect crosses = candidate._key into counter
return {crosses: crosses, connections: 0.5 + SQRT(0.25 + LENGTH(counter))}
)
)
sort x.connections DESC
return x
If you put the crosses in a different collection and filter by collection name the query will even get more efficient, we do not need to open any vertices that are not of type cross at all.
FOR x in (
(
let nodes = ["nodes/A","nodes/B","nodes/C","nodes/D"]
for n in GRAPH_COMMON_NEIGHBORS("myGraph",nodes, nodes,
{"vertexCollectionRestriction": "crosses"}, {"vertexCollectionRestriction": "crosses"})
for f in VALUES(n)
for s in VALUES(f)
for candidate in s
collect crosses = candidate._key into counter
return {crosses: crosses, connections: 0.5 + SQRT(0.25 + LENGTH(counter))}
)
)
sort x.connections DESC
return x
Both queries will yield the result on your dataset:
[
{
"crosses": "E",
"connections": 4
},
{
"crosses": "F",
"connections": 3
}
]