ArangoDB: Find all shortest paths

ArangoDB: Find all shortest paths - arangodb

i want to get all shortest paths between 2 vertex.
Example: Give me all shortest path between node A and B should only return the 2 blue paths
this is what i have got so far:
LET source = (FOR x IN Entity FILTER x.objectID == "organization_1"
return x)[0]
LET destination = (FOR x IN Entity FILTER x.objectID == "organization_129"
return x)[0]
FOR node, edge, path IN 1..2 ANY source._id GRAPH "m"
FILTER LAST(path.vertices)._id == destination._id
LIMIT 100
RETURN path
Problems:
1. it is very slow (took 18 seconds on a graph with like 70 mio nodes)
2. it finds every path, but i want only all shortest path
UPDATE
i tried the 2-step query solution from the comments.
the problem is that the second query is also very slow
Query string:
FOR source IN Entity FILTER source.objectID == "organization_1"
LIMIT 1
FOR node, edge, path
IN 1..#depth ANY source._id
GRAPH "m"
OPTIONS {uniqueVertices: "path"}
FILTER node.objectID == "organization_129"
RETURN path
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
11 IndexNode 1 - FOR source IN Entity /* hash index scan */
5 LimitNode 1 - LIMIT 0, 1
6 CalculationNode 1 - LET #6 = source.`_id` /* attribute expression */ /* collections used: source : Entity */
7 TraversalNode 346 - FOR node /* vertex */, path /* paths */ IN 1..2 /* min..maxPathDepth */ ANY #6 /* startnode */ GRAPH 'm'
8 CalculationNode 346 - LET #10 = (node.`objectID` == "organization_129") /* simple expression */
9 FilterNode 346 - FILTER #10
10 ReturnNode 346 - RETURN path
Indexes used:
By Type Collection Unique Sparse Selectivity Fields Ranges
11 hash Entity false false 100.00 % [ `objectID` ] (source.`objectID` == "organization_1")
7 edge ACTIVITYPARTY false false 100.00 % [ `_from`, `_to` ] base INBOUND
7 edge ACTIVITYPARTY false false 100.00 % [ `_from`, `_to` ] base OUTBOUND
7 edge ACTIVITY_LINK false false 100.00 % [ `_from`, `_to` ] base INBOUND
7 edge ACTIVITY_LINK false false 100.00 % [ `_from`, `_to` ] base OUTBOUND
7 edge ENTITY_LINK false false 70.38 % [ `_from`, `_to` ] base INBOUND
7 edge ENTITY_LINK false false 70.38 % [ `_from`, `_to` ] base OUTBOUND
7 edge RELATION false false 20.49 % [ `_from`, `_to` ] base INBOUND
7 edge RELATION false false 20.49 % [ `_from`, `_to` ] base OUTBOUND
7 edge SOFT_LINK false false 100.00 % [ `_from`, `_to` ] base INBOUND
7 edge SOFT_LINK false false 100.00 % [ `_from`, `_to` ] base OUTBOUND
Traversals on graphs:
Id Depth Vertex collections Edge collections Options Filter conditions
7 1..2 Activity, Entity, SOFT_LINK, Property ACTIVITYPARTY, ENTITY_LINK, SOFT_LINK, RELATION, ACTIVITY_LINK uniqueVertices: path, uniqueEdges: path
Optimization rules applied:
Id RuleName
1 move-calculations-up
2 move-filters-up
3 move-calculations-up-2
4 move-filters-up-2
5 use-indexes
6 remove-filter-covered-by-index
7 remove-unnecessary-calculations-2
8 optimize-traversals
9 move-calculations-down

First of all you need a hash index on field objectID in collection Entity to avoid the full collection scans, which heavily slows down your performance.
To get all shortest path I would first search for one shortest path with the AQL SHORTEST_PATH and return the number of visited vertices. There is also no need of subqueries (like in your query).
FOR source IN Entity FILTER source.objectID == "organization_1"
LIMIT 1
FOR destination IN Entity FILTER destination.objectID == "organization_129"
LIMIT 1
RETURN sum(
FOR v, e
IN ANY
SHORTEST_PATH source._id TO destination._id
GRAPH "m"
RETURN 1)-1
After that I would execute another query with the result from the first query as bind parameter #depth, which is used to limit the depth of the traversal.
FOR source IN Entity FILTER source.objectID == "organization_1"
LIMIT 1
FOR node, edge, path
IN 1..#depth ANY source._id
GRAPH "m"
OPTIONS {uniqueVertices: "path"}
FILTER node.objectID == "organization_129"
RETURN path
Note: To filter the last vertex in the path you don't have to use LAST(path.vertices), you can simply use node because it is already the last vertex (the same applies for edge).

Related

Getting wrong results with implementation of Dijkstra's algorithm using PriorityQueue

I have implemented Dijkstra's algorithm using the PriorityQueue class of the queue module in Python.
But I am not always getting the correct result according to the online judge. Something must be missing in the below-given code, but I have no idea what.
What is wrong with my code?
from queue import PriorityQueue
class Solution:
#Function to find the shortest distance of all the vertices
#from the source vertex S.
def dijkstra(self, V, adj, S):
#code here
q=PriorityQueue()
distance=[-1]*V
distance[S]=0
visited=set()
visited.add(S)
for i in adj[S]:
distance[i[0]]=distance[S]+i[1]
q.put([i[1],i[0]])
while not q.empty():
w,s=q.get()
visited.add(s)
for i in adj[s]:
d=distance[s]+i[1]
if distance[i[0]]==-1:
distance[i[0]]=d
elif distance[i[0]]>d:
distance[i[0]]=d
if i[0] not in visited:
q.put([i[1],i[0]])
return distance
#{
# Driver Code Starts
#Initial Template for Python 3
import atexit
import io
import sys
if __name__ == '__main__':
test_cases = int(input())
for cases in range(test_cases):
V,E = map(int,input().strip().split())
adj = [[] for i in range(V)]
for i in range(E):
u,v,w = map(int,input().strip().split())
adj[u].append([v,w])
adj[v].append([u,w])
S=int(input())
ob = Solution()
res = ob.dijkstra(V,adj,S)
for i in res:
print(i,end=" ")
print()
# } Driver Code Ends
Sample Input for one test case:
9 14
0 1 4
0 7 8
1 7 11
1 2 8
7 6 1
7 8 7
2 8 2
8 6 6
2 5 4
2 3 7
6 5 2
3 5 14
3 4 9
5 4 10
0
Expected Output:
0 4 12 19 21 11 9 8 14
Problem:
My code returns this instead:
0 4 12 19 26 16 18 8 14

The problem is that you are giving priority to the edges with the least weight, but you should give priority to paths with the least weight.
So near the end of your code change:
q.put([i[1],i[0]])
to:
q.put([d,i[0]])
This will solve it.
However, some comments:
If you use a priority queue it should not be necessary to compare a previously stored distance for a node with a new distance, as the priority queue's role is to make sure you visit a node via the shortest path upon its first visit. With a bit of code reorganisation, you can get rid of that minimal-distance test.
Once you have that in place, you also do not need to have visited, as it is enough to check that the node's distance is still at -1 (assuming weights are never negative). When that is the case, it means you haven't visited it yet.
It is also a bit more efficient if you store tuples on the queue instead of lists.
And you can reorganise the code so that you only need to push the initial cell to the queue before starting the traversal loop.
Finally, instead of one letter variables, it is more readable if you use descriptive names, like node and weight:
class Solution:
def dijkstra(self, V, adj, S):
queue = PriorityQueue()
distances = [-1] * V
queue.put((0, S))
while not queue.empty():
dist, node = queue.get()
if distances[node] == -1:
distances[node] = dist
for neighbor, weight in adj[node]:
queue.put((dist + weight, neighbor))
return distances

Strange behaviour with vertex centric indexes

I have some trouble understanding how to properly use vertex centric indexes in ArangoDB.
In my cooking app, I have the following graph schema : (recipe)-[hasConstituent]->(ingredient)
Let say I want all the recipes that need less than 0g of carrots. Result will be empty of course.
FOR recipe, constituent, p IN INBOUND 'ingredients/carrot' hasConstituent
FILTER constituent.quantity.value < 0
RETURN recipe._key
With carrot having 400.000 recipes associated, this query takes ~3.9s. Fine.
Now I create a vertex centric index in hasConstituent collection on _to,quantity.value properties, with an estimated selectivity of 100%.
I expected it to sort indexes in a numeric order, and then to significantly increase the speed of FILTER or SORT/LIMIT requests, but now the previous request takes ~7.9s... If I make the index "sparse", it takes the same time as without index (~3.9s)
What am I missing here ?
The most hard part to understand is that the execution plan given by the explain result is different from the profile result. Here is the explain, where all is fine and should fetch the result instantly :
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
5 TraversalNode 1 - FOR recipe /* vertex */, constituent /* edge */ IN 1..1 /* min..maxPathDepth */ INBOUND 'ingredients/carrot' /* startnode */ hasConstituent
6 CalculationNode 1 - LET #8 = (constituent.`quantity`.`value` < 0) /* simple expression */
7 FilterNode 1 - FILTER #8
8 CalculationNode 1 - LET #10 = recipe.`_key` /* attribute expression */
9 ReturnNode 1 - RETURN #10
But in the profile :
Execution plan:
Id NodeType Calls Items Runtime [s] Comment
1 SingletonNode 1 1 0.00000 * ROOT
5 TraversalNode 433 432006 7.64893 - FOR recipe /* vertex */, constituent /* edge */ IN 1..1 /* min..maxPathDepth */ INBOUND 'ingredients/carrot' /* startnode */ hasConstituent
6 CalculationNode 433 432006 0.28761 - LET #8 = (constituent.`quantity`.`value` < 0) /* simple expression */
7 FilterNode 1 0 0.08704 - FILTER #8
8 CalculationNode 1 0 0.00000 - LET #10 = recipe.`_key` /* attribute expression */
9 ReturnNode 1 0 0.00001 - RETURN #10
I precise the index is used in both results :
Indexes used:
By Name Type Collection Unique Sparse Selectivity Fields Ranges
5 recipeByIngrQty persistent hasConstituent false false 100.00 % [ `_to`, `quantity.value` ] base INBOUND
Any help is very welcome

For a traversal FOR vertex, edge, path IN ..., filtering on either vertex or edge only applies to the results, but not what's actually visited. As to why that makes sense, keep in mind that generally, not all vertices or edges visited during the traversal are actually part of the result: For example, if min in the IN min..max argument is larger than zero - it's one by default - vertices (and their incoming edges) with distance lower than that are not part of the result, but have to be visited.
That's why, if you want to restrict the edges visited during a traversal, you must add the filter on the path variable instead. For your example:
FOR recipe, constituent, p IN INBOUND 'ingredients/carrot' hasConstituent
FILTER p.edges[*].quantity.value ALL < 0
RETURN recipe._key
That should make use of the index as you expected. See vertex centric indexes and the AQL graph traversal documentation for more details.
I think this answers the core of your question, now to clear up some of the ones you found on the way.
I expected it to sort indexes in a numeric order, and then to
significantly increase the speed of FILTER or SORT/LIMIT requests, but
now the previous request takes ~7.9s... If I make the index "sparse",
it takes the same time as without index (~3.9s)
Two things here.
First, it sounds like the optimizer preferred your index over the edge index. That probably shouldn't be the case, as (without the change I described above) it's not more specific than the edge index, but just somewhat slower. You haven't specified the version of ArangoDB your using, so I can't comment specifically. But if you are using the most recent patch release of a supported minor version, e.g. 3.7.10 or 3.6.12 at the time of this writing, you can report this as an issue on Github.
Second, a sparse index does not index non-existent or null values. It thus cannot be used for a query that could report null values. Now note that null < 0 is true, see type and value order in the documentation for details. So your query constituent.quantity.value < 0 could report null values, and that's why the sparse index is treated differently (i.e. cannot be used at all).
Now to the final point:
The most hard part to understand is that the execution plan given by the explain result is different from the profile result.
The explain output shows a column "Est.", which is an estimate of the number of rows / iterations this node will emit / do. The column "Items" in the profile output in contrast is the corresponding exact number. Now this estimate can be good in some cases, but bad in others. This is not necessarily a problem, and it cannot be exact without actually executing the query. If it happens to result in a problem, because the estimates get the optimizer to choose the wrong index for the job, you can use index hints. But this isn't the case here.
Apart from that, the two plans you showed seem to be exactly identical.

ANOVA is significant but post-hoc test not. What next?

The data consists of weights of fresh harvested plants with four different treatments. The data is normally distributed and homogeneity of variances is given too.
Anova shows significiant differences:
anova_Ernte <- aov(Gewicht ~ Variante, data=Daten_Ernte)
Anova(anova_Ernte)
Anova Table (Type II tests)
Response: Gewicht
Sum Sq Df F value Pr(>F)
Variante 57213 3 2.9778 0.03226 *
Residuals 1511436 236
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
However post-hoc test HSD.Test() doesnt show any significiant differences:
HSD.test(anova_Ernte, "Variante", group = TRUE, console = TRUE, main = "")
4 434.70 a
1 426.90 a
3 400.95 a
2 398.20 a
Gewicht std r Min Max
1 426.90 80.08929 80 234 596
2 398.20 79.90095 80 216 561
3 400.95 74.87869 40 228 568
4 434.70 84.98754 40 264 647
Tuckey-HSD shows the following
TukeyHSD(anova_Ernte)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = Gewicht ~ Variante, data = Daten_Ernte)
$Variante
diff lwr upr p adj
2-1 -28.70 -61.439904 4.039904 0.1085058
3-1 -25.95 -66.048029 14.148029 0.3394912
4-1 7.80 -32.298029 47.898029 0.9582293
3-2 2.75 -37.348029 42.848029 0.9980106
4-2 36.50 -3.598029 76.598029 0.0887876
4-3 33.75 -12.551216 80.051216 0.2368136
And finally kurskal wallis does not show significiant differences between the groups:
kruskal(y=Daten_Ernte$Gewicht, trt=Daten_Ernte$Variante, p.adj = "bonferroni", console = TRUE)
kruskal.test(Daten_Ernte$Gewicht ~ Daten_Ernte$Variante)
4 133.9000 a
1 131.2063 a
3 109.8875 a
2 108.4000 a
Am i now safe to say that there are no significiant differences between the groups or do i have options to find out which groups differ according to anova?

How to receive an absolute instead of a decimal number for a reported ratio in percentages on NetLogo?

I've set up firms (turtles) in an industry (world) which either produce at home (firms-at-home located > ycor) or have offshored their production (offshore-firms located < ycor). I have given them a firms-own called offshored? which is answered with either true or false.
I have a monitor on my interface which shows the amount of firms which have offshored and the ones which produce at home (in %) of all firms in the setup-world:
breed [ firms firm ]
firms-own [
offshored? ;; true or false
]
to-report percentage-of-firms-at-home ;; monitors the % of firms which produce at home
report ( ( count firms with [ ycor > 0 ] ) / count firms ) * 100
end
to-report percentage-of-offshored-firms ;; monitors the % of offshored firms
report ( ( count firms with [ ycor < 0 ] ) / count firms ) * 100
end
I then plugged percentage-of-offshored-firms into a monitor on the interface. Now, I would like to have an absolute number show up for my reported percentage. How can I change the decimal number I receive so far to an absolute one?

ArangoDB Not Using Index During Traversal

I have a simple graph traversal query:
FOR e in 0..3 ANY 'Node/5025926' Edge
FILTER
e.ModelType == "A.Model" &&
e.TargetType == "A.Target" &&
e.SourceType == "A.Source"
RETURN e
The 'Edge' edge collection has a hash index defined for attributes ModelType, TargetType, SourceType, in that order.
When checking the execution plan, the results are:
Query string:
FOR e in 0..3 ANY 'Node/5025926' Edge
FILTER
e.ModelType == "A.Model" &&
e.TargetType == "A.Target" &&
e.SourceType == "A.Source"
RETURN e
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
2 TraversalNode 7 - FOR e /* vertex */ IN 0..3 /* min..maxPathDepth */ ANY 'Node/5025926' /* startnode */ Edge
3 CalculationNode 7 - LET #1 = (((e.`ModelType` == "A.Model") && (e.`TargetType` == "A.Target")) && (e.`SourceType` == "A.Source")) /* simple expression */
4 FilterNode 7 - FILTER #1
5 ReturnNode 7 - RETURN e
Indexes used:
none
Traversals on graphs:
Id Depth Vertex collections Edge collections Filter conditions
2 0..3 Edge
Optimization rules applied:
none
Notice that the execution plan indicates that no indices will be used to process the query.
Is there anything I need to do to make the engine use the index on the Edge collection to process the results?
Thanks

In ArangoDB 3.0 a traversal will always use the edge index to find connected vertices, regardless of which filter conditions are present in the query and regardless of which indexes exist.
In ArangoDB 3.1 the optimizer will try to find the best possible index for each level of the traversal. It will inspect the traversal's filter condition and for each level pick the index for which it estimates the lowest cost. If there are no user-defined indexes, it will still use the edge index to find connected vertices. Other indexes will be used if there are filter conditions on edge attributes which are also indexed and the index has a better estimated average selectivity than the edge index.
In 3.1.0 the explain output will always show "Indexes used: none" for traversals, even though a traversal will always use an index. The index display is just missing in the explain output. This has been fixed in ArangoDB 3.1.1, which will show the individual indexes selected by the optimizer for each level of the traversal.
For example, the following query shows the following explain output in 3.1:
Query string:
FOR v, e, p in 0..3 ANY 'v/test0' e
FILTER p.edges[0].type == 1 && p.edges[2].type == 2
RETURN p.vertices
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
2 TraversalNode 8000 - FOR v /* vertex */, p /* paths */ IN 0..3 /* min..maxPathDepth */ ANY 'v/test0' /* startnode */ e
3 CalculationNode 8000 - LET #5 = ((p.`edges`[0].`type` == 1) && (p.`edges`[2].`type` == 2)) /* simple expression */
4 FilterNode 8000 - FILTER #5
5 CalculationNode 8000 - LET #7 = p.`vertices` /* attribute expression */
6 ReturnNode 8000 - RETURN #7
Indexes used:
By Type Collection Unique Sparse Selectivity Fields Ranges
2 edge e false false 10.00 % [ `_from`, `_to` ] base INBOUND
2 edge e false false 10.00 % [ `_from`, `_to` ] base OUTBOUND
2 hash e false false 63.60 % [ `_to`, `type` ] level 0 INBOUND
2 hash e false false 64.40 % [ `_from`, `type` ] level 0 OUTBOUND
2 hash e false false 63.60 % [ `_to`, `type` ] level 2 INBOUND
2 hash e false false 64.40 % [ `_from`, `type` ] level 2 OUTBOUND
Additional indexes are present on [ "_to", "type" ] and [ "_from", "type" ]. Those are used on levels 0 and 2 of the traversal because there are filter conditions for the edges on these levels that can use these indexes. For all other levels, the traversal will use the indexes labeled with "base" in the "Ranges" column.
The explain output fix will become available with 3.1.1, which will be released soon.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string