Strange behaviour with vertex centric indexes - arangodb

I have some trouble understanding how to properly use vertex centric indexes in ArangoDB.
In my cooking app, I have the following graph schema : (recipe)-[hasConstituent]->(ingredient)
Let say I want all the recipes that need less than 0g of carrots. Result will be empty of course.
FOR recipe, constituent, p IN INBOUND 'ingredients/carrot' hasConstituent
FILTER constituent.quantity.value < 0
RETURN recipe._key
With carrot having 400.000 recipes associated, this query takes ~3.9s. Fine.
Now I create a vertex centric index in hasConstituent collection on _to,quantity.value properties, with an estimated selectivity of 100%.
I expected it to sort indexes in a numeric order, and then to significantly increase the speed of FILTER or SORT/LIMIT requests, but now the previous request takes ~7.9s... If I make the index "sparse", it takes the same time as without index (~3.9s)
What am I missing here ?
The most hard part to understand is that the execution plan given by the explain result is different from the profile result. Here is the explain, where all is fine and should fetch the result instantly :
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
5 TraversalNode 1 - FOR recipe /* vertex */, constituent /* edge */ IN 1..1 /* min..maxPathDepth */ INBOUND 'ingredients/carrot' /* startnode */ hasConstituent
6 CalculationNode 1 - LET #8 = (constituent.`quantity`.`value` < 0) /* simple expression */
7 FilterNode 1 - FILTER #8
8 CalculationNode 1 - LET #10 = recipe.`_key` /* attribute expression */
9 ReturnNode 1 - RETURN #10
But in the profile :
Execution plan:
Id NodeType Calls Items Runtime [s] Comment
1 SingletonNode 1 1 0.00000 * ROOT
5 TraversalNode 433 432006 7.64893 - FOR recipe /* vertex */, constituent /* edge */ IN 1..1 /* min..maxPathDepth */ INBOUND 'ingredients/carrot' /* startnode */ hasConstituent
6 CalculationNode 433 432006 0.28761 - LET #8 = (constituent.`quantity`.`value` < 0) /* simple expression */
7 FilterNode 1 0 0.08704 - FILTER #8
8 CalculationNode 1 0 0.00000 - LET #10 = recipe.`_key` /* attribute expression */
9 ReturnNode 1 0 0.00001 - RETURN #10
I precise the index is used in both results :
Indexes used:
By Name Type Collection Unique Sparse Selectivity Fields Ranges
5 recipeByIngrQty persistent hasConstituent false false 100.00 % [ `_to`, `quantity.value` ] base INBOUND
Any help is very welcome

For a traversal FOR vertex, edge, path IN ..., filtering on either vertex or edge only applies to the results, but not what's actually visited. As to why that makes sense, keep in mind that generally, not all vertices or edges visited during the traversal are actually part of the result: For example, if min in the IN min..max argument is larger than zero - it's one by default - vertices (and their incoming edges) with distance lower than that are not part of the result, but have to be visited.
That's why, if you want to restrict the edges visited during a traversal, you must add the filter on the path variable instead. For your example:
FOR recipe, constituent, p IN INBOUND 'ingredients/carrot' hasConstituent
FILTER p.edges[*].quantity.value ALL < 0
RETURN recipe._key
That should make use of the index as you expected. See vertex centric indexes and the AQL graph traversal documentation for more details.
I think this answers the core of your question, now to clear up some of the ones you found on the way.
I expected it to sort indexes in a numeric order, and then to
significantly increase the speed of FILTER or SORT/LIMIT requests, but
now the previous request takes ~7.9s... If I make the index "sparse",
it takes the same time as without index (~3.9s)
Two things here.
First, it sounds like the optimizer preferred your index over the edge index. That probably shouldn't be the case, as (without the change I described above) it's not more specific than the edge index, but just somewhat slower. You haven't specified the version of ArangoDB your using, so I can't comment specifically. But if you are using the most recent patch release of a supported minor version, e.g. 3.7.10 or 3.6.12 at the time of this writing, you can report this as an issue on Github.
Second, a sparse index does not index non-existent or null values. It thus cannot be used for a query that could report null values. Now note that null < 0 is true, see type and value order in the documentation for details. So your query constituent.quantity.value < 0 could report null values, and that's why the sparse index is treated differently (i.e. cannot be used at all).
Now to the final point:
The most hard part to understand is that the execution plan given by the explain result is different from the profile result.
The explain output shows a column "Est.", which is an estimate of the number of rows / iterations this node will emit / do. The column "Items" in the profile output in contrast is the corresponding exact number. Now this estimate can be good in some cases, but bad in others. This is not necessarily a problem, and it cannot be exact without actually executing the query. If it happens to result in a problem, because the estimates get the optimizer to choose the wrong index for the job, you can use index hints. But this isn't the case here.
Apart from that, the two plans you showed seem to be exactly identical.

Related

Reaching nth Stair

total number of ways to reach the nth floor with following types of moves:
Type 1 in a single move you can move from i to i+1 floor – you can use the this move any number of times
Type 2 in a single move you can move from i to i+2 floor – you can use this move any number of times
Type 3 in a single move you can move from i to i+3 floor – but you can use this move at most k times
i know how to reach nth floor by following step 1 ,step 2, step 3 any number of times using dp like dp[i]=dp[i-1]+dp[i-2]+dp[i-3].i am stucking in the condition of Type 3 movement with atmost k times.
someone tell me the approach here.
While modeling any recursion or dynamic programming problem, it is important to identify the goal, constraints, states, state function, state transitions, possible state variables and initial condition aka base state. Using this information we should try to come up with a recurrence relation.
In our current problem:
Goal: Our goal here is to somehow calculate number of ways to reach floor n while beginning from floor 0.
Constraints: We can move from floor i to i+3 at most K times. We name it as a special move. So, one can perform this special move at most K times.
State: In this problem, our situation of being at a floor could be one way to model a state. The exact situation can be defined by the state variables.
State variables: State variables are properties of the state and are important to identify a state uniquely. Being at a floor i alone is not enough in itself as we also have a constraint K. So to identify a state uniquely we want to have 2 state variables: i indicating floor ranging between 0..n and k indicating number of special move used out of K (capital K).
State functions: In our current problem, we are concerned with finding number of ways to reach a floor i from floor 0. We only need to define one function number_of_ways associated with corresponding state to describe the problem. Depending on problem, we may need to define more state functions.
State Transitions: Here we identify how can we transition between states. We can come freely to floor i from floor i-1 and floor i-2 without consuming our special move. We can only come to floor i from floor i-3 while consuming a special move, if i >=3 and special moves used so far k < K.
In other words, possible state transitions are:
state[i,k] <== state[i-1,k] // doesn't consume special move k
state[i,k] <== state[i-2,k] // doesn't consume special move k
state[i,k+1] <== state[i-3, k] if only k < K and i >= 3
We should now be able to form following recurrence relation using above information. While coming up with a recurrence relation, we must ensure that all the previous states needed for computation of current state are computed first. We can ensure the order by computing our states in the topological order of directed acyclic graph (DAG) formed by defined states as its vertices and possible transitions as directed edges. It is important to note that it is only possible to have such ordering if the directed graph formed by defined states is acyclic, otherwise we need to rethink if the states are correctly defined uniquely by its state variables.
Recurrence Relation:
number_of_ways[i,k] = ((number_of_ways[i-1,k] if i >= 1 else 0)+
(number_of_ways[i-2,k] if i >= 2 else 0) +
(number_of_ways[i-3,k-1] if i >= 3 and k < K else 0)
)
Base cases:
Base cases or solutions to initial states kickstart our recurrence relation and are sufficient to compute solutions of remaining states. These are usually trivial cases or smallest subproblems that can be solved without recurrence relation.
We can have as many base conditions as we require and there is no specific limit. Ideally we would want to have a minimal set of base conditions, enough to compute solutions of all remaining states. For the current problem, after initializing all not computed solutions so far as 0,
number_of_ways[0, 0] = 1
number_of_ways[0,k] = 0 where 0 < k <= K
Our required final answer will be sum(number_of_ways[n,k], for all 0<=k<=K).
You can use two-dimensional dynamic programming:
dp[i,j] is the solution value when exactly j Type-3 steps are used. Then
dp[i,j]=dp[i-1,j]+dp[i-2,j]+dp[i-3,j-1], and the initial values are dp[0,0]=0, dp[1,0]=1, and dp[3*m,m]=m for m<=k. You can build up first the d[i,0] values, then the d[i,1] values, etc. Or you can do a different order, as long as all necessary values are already computed.
Following #LaszloLadanyi approach ,below is the code snippet in python
def solve(self, n, k):
dp=[[0 for i in range(k+1)]for _ in range(n+1)]
dp[0][0]=1
for j in range(k+1):
for i in range(1,n+1):
dp[i][j]+=dp[i-1][j]
if i>1:
dp[i][j]+=dp[i-2][j]
if i>2 and j>0:
dp[i][j]+=dp[i-3][j-1]
return sum(dp[n])

Recursive method in pharo produces #SubscriptOutOfBounds:8

I created a class in Pharo known as BinarySearchTreean i implemented a method called BinarySearchTree>>PreOrder and BinarySearchTree>>index
Preorder: myArray index: position
(myArray at: position) ~= -1
ifTrue: [
Transcript show: (myArray at: position).
self Preorder: myArray index: (position * 2).
self Preorder: myArray index: (position * 2) + 1.
].
I then provided this array #(90 60 95 50) with index 1 to make a PreOrder search in my binary tree which i implemented using arrays but it does not work.
Please help...
#at: will signal SubscriptOutOfBounds when the index is < 0 or greater than the size of the array (Smalltalk collections are 1-based, i.e. the first index is 1, not 0). 8 is clearly larger than 4 (the size of myArray).
The check at the start will never evaluate to False as your array has no entry -1, and hence the conditional block will be evaluated every time.
I can't really say where your problem lies as you've excluded all the code that is actually of interest. If you add that I can tell you more.

Find path following edges with greatest value in ArangoDB

Lets say, that in my graph I've got edges that have field called value. After selecting start vertex I would like to find path by always selecting the edge that has the highest value. Unfortunatly I can't figure out how to write proper query, is it possible in ArangoDB?
Hi i am unsure what you would like to achieve, there are two possible scenarios that i can imagine from your description:
First: Shortest Path
The use-case here is you know the starting vertex and the target vertex, and you want to find the shortest (or cheapest) path between those two.
The built in SHORTEST_PATH (https://docs.arangodb.com/3.1/AQL/Graphs/ShortestPath.html#shortest-path-in-aql) feature can serve it by defining the distance attribute in the options like this:
FOR v IN OUTBOUND #start TO #end ##edgeCollections OPTIONS {weightAttribute: "value", defaultWeight: 1}
RETURN v
This will give you all vertices on the path from start to end which has the lowest some of value attributes. If you need the "highest value" you could copy the value and save it again with 1/value in a different field, to find the path with the fewest edges having in total the highest sum of values
Second: Sorting of edges
The use case is you only have the starting vertex and want to get the connected vertices, ordered by the value on the edges. There you can simply combine the traversal statement with a simple sort. (https://docs.arangodb.com/3.1/AQL/Graphs/Traversals.html#graph-traversals-in-aql):
FOR v, e IN OUTBOUND #start ##edgeCollection
SORT e.value DESC
LIMIT 1 /* Only pick the highest one */
REUTRN {v: v, e: e}
Third use-case: Iterating several depth only using the highest value
The AQL in Use-case 2 can be chained up to an arbitrary depth which has to be known a-priori. So say you would like to iterate 3 steps only using the edge with highest value:
FOR v1, e1 IN OUTBOUND #start ##edgeCollection
SORT e1.value DESC
LIMIT 1 /* Only pick the highest one */
/* Depth 1 done. now depth 2*/
FOR v2, e2 IN OUTBOUND v1 ##edgeCollection
SORT e2.value DESC
LIMIT 1 /* Only pick the highest one */
FOR v3, e3 IN OUTBOUND v2 ##edgeCollection
SORT e3.value DESC
LIMIT 1 /* Only pick the highest one */
RETURN [v1,v2,v3]
Forth use-case:
The depth is not known a-priori, in this case pure AQL in the currently release version (3.1) cannot formulate this. It will be easier to use a Foxx service (https://docs.arangodb.com/3.1/Manual/Foxx/#foxx) using the traversal module (https://docs.arangodb.com/3.1/Manual/Graphs/Traversals/UsingTraversalObjects.html#getting-started) in JavaScript which is a bit more flexible, but can only be implemented in Javascript.

ArangoDB Not Using Index During Traversal

I have a simple graph traversal query:
FOR e in 0..3 ANY 'Node/5025926' Edge
FILTER
e.ModelType == "A.Model" &&
e.TargetType == "A.Target" &&
e.SourceType == "A.Source"
RETURN e
The 'Edge' edge collection has a hash index defined for attributes ModelType, TargetType, SourceType, in that order.
When checking the execution plan, the results are:
Query string:
FOR e in 0..3 ANY 'Node/5025926' Edge
FILTER
e.ModelType == "A.Model" &&
e.TargetType == "A.Target" &&
e.SourceType == "A.Source"
RETURN e
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
2 TraversalNode 7 - FOR e /* vertex */ IN 0..3 /* min..maxPathDepth */ ANY 'Node/5025926' /* startnode */ Edge
3 CalculationNode 7 - LET #1 = (((e.`ModelType` == "A.Model") && (e.`TargetType` == "A.Target")) && (e.`SourceType` == "A.Source")) /* simple expression */
4 FilterNode 7 - FILTER #1
5 ReturnNode 7 - RETURN e
Indexes used:
none
Traversals on graphs:
Id Depth Vertex collections Edge collections Filter conditions
2 0..3 Edge
Optimization rules applied:
none
Notice that the execution plan indicates that no indices will be used to process the query.
Is there anything I need to do to make the engine use the index on the Edge collection to process the results?
Thanks
In ArangoDB 3.0 a traversal will always use the edge index to find connected vertices, regardless of which filter conditions are present in the query and regardless of which indexes exist.
In ArangoDB 3.1 the optimizer will try to find the best possible index for each level of the traversal. It will inspect the traversal's filter condition and for each level pick the index for which it estimates the lowest cost. If there are no user-defined indexes, it will still use the edge index to find connected vertices. Other indexes will be used if there are filter conditions on edge attributes which are also indexed and the index has a better estimated average selectivity than the edge index.
In 3.1.0 the explain output will always show "Indexes used: none" for traversals, even though a traversal will always use an index. The index display is just missing in the explain output. This has been fixed in ArangoDB 3.1.1, which will show the individual indexes selected by the optimizer for each level of the traversal.
For example, the following query shows the following explain output in 3.1:
Query string:
FOR v, e, p in 0..3 ANY 'v/test0' e
FILTER p.edges[0].type == 1 && p.edges[2].type == 2
RETURN p.vertices
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
2 TraversalNode 8000 - FOR v /* vertex */, p /* paths */ IN 0..3 /* min..maxPathDepth */ ANY 'v/test0' /* startnode */ e
3 CalculationNode 8000 - LET #5 = ((p.`edges`[0].`type` == 1) && (p.`edges`[2].`type` == 2)) /* simple expression */
4 FilterNode 8000 - FILTER #5
5 CalculationNode 8000 - LET #7 = p.`vertices` /* attribute expression */
6 ReturnNode 8000 - RETURN #7
Indexes used:
By Type Collection Unique Sparse Selectivity Fields Ranges
2 edge e false false 10.00 % [ `_from`, `_to` ] base INBOUND
2 edge e false false 10.00 % [ `_from`, `_to` ] base OUTBOUND
2 hash e false false 63.60 % [ `_to`, `type` ] level 0 INBOUND
2 hash e false false 64.40 % [ `_from`, `type` ] level 0 OUTBOUND
2 hash e false false 63.60 % [ `_to`, `type` ] level 2 INBOUND
2 hash e false false 64.40 % [ `_from`, `type` ] level 2 OUTBOUND
Additional indexes are present on [ "_to", "type" ] and [ "_from", "type" ]. Those are used on levels 0 and 2 of the traversal because there are filter conditions for the edges on these levels that can use these indexes. For all other levels, the traversal will use the indexes labeled with "base" in the "Ranges" column.
The explain output fix will become available with 3.1.1, which will be released soon.

scull driver from LDD - scull_read and scull_write

I am going through LDD from Rubini to learn driver programming.Currently, I am going through 3rd chapter - writing character driver "scull". However, In the example code provided by the authors, I am not able to understand the following lines in scull_read() and scull_write() methods :
item = (long)*f_pos / itemsize;
rest = (long)*f_pos % itemsize;
s_pos = rest / quantum;
q_pos = rest % quantum;
I have spent quite a time on it in vain( and still working on it) . Can someone please help me understand the functionality of the above code snippet??
Regards,
Roy
Suppose you have set quantum area size to 4000 bytes in scull driver and qset array size to 10. In that case, value of itemsize would be 40000. f_pos is a position from where read/write should start, which is coming as a parameter to read/write function. suppose read request has come and f_pos is 50000.
Now,
item = (long)*f_pos / itemsize; so item would be 50000/40000 = 1
rest = (long)*f_pos % itemsize; so rest would be 50000%40000 = 10000
s_pos = rest / quantum; so s_pos would be 10000/4000 = 2
q_pos = rest % quantum; so q_pos would be 10000%4000 = 2000
If you have read description of scull driver in chapter 3 carefully then each scull device is a linked list of pointers (of scull_qset) and in our case each scull_qset points to array of pointers which points to quantum area of 4000 bytes as we have set quantum area size 4000 bytes and array size in our case is 10. So, our each scull_qset is an array of 10 pointers and each pointer points to 4000 bytes. So, one scull_qset has capacity of 40000 bytes.
In our read request, f_pos is 50000, so obviously this position would not be in first scull_qset which is proven by calculation of item. As item is 1, it will point to second scull_qset(value of item would be 0 for first scull_qset, for more information see scull_follow function definition).
Value of rest will help to find out at which position in second scull_qset read should start. As each quantum area is of 4000 bytes, s_pos gives out of 10 pointers of second scull_qset which pointer should be used and qset tells that in a particular quantum area pointed by pointer found in s_pos, at which particular location read should start.

Resources