How to do an efficient random walk in arangodb? - arangodb

I am working on a graph database on ArangoDB and I'm trying to have paths through a random walk. My purpose is from a start vertex V, I get for instance 4 random paths with a specified depth.
As far as from now, I've found the following code that works :
FOR vertex, edge, path IN 4..4 ANY 'Vertex/417438' edge_collection OPTIONS {bfs: TRUE, uniqueVertices : True, uniqueEdges : True}
SORT RAND()
LIMIT 3
FILTER IS_SAME_COLLECTION('Vertex', vertex)
RETURN path
This indeed gives me 3 paths with a depth of 4, but it takes quite a long time because of the SORT RAND() at the beginning. I guess it first sorts randomly all the possible solution and then returns the solution.
Do you think, there's a way to have random solutions that costs less time?
Thanks for your time and for your futures answers

I've just discussed with someone from the dev Team of Arango. Getting a random node in AQL is not possible for the moment. It's in there roadmap however.
I found a way to do a random Walk on ArangoDB though.
Let's take v, the starting vertex.
Query the number of neighbors N of v with this command :
FOR clip, edge, path IN ANY '${vertex}' hasClipped OPTIONS {bfs: TRUE, uniqueVertices : True, uniqueEdges : True}
COLLECT WITH COUNT INTO len
RETURN len
Then choose a random number rd between 0 and N and execute the following request :
FOR obj IN ANY '${vertex}' hasClipped OPTIONS {bfs : true, uniqueVertices : 'path'}
LIMIT ${rd}, 1
RETURN obj
This request will return a random neighbor among the existing ones.
Iterate to get a random walk with the depth you want.

Related

PageRank with custom initial scores

I am trying to implement a simple algorithm that will calculate PageRank on a directed network generated and handled with NetworkX. However, I'd like to add a simple change: rather than having the initial PageRank for each node be equal to 1/n, where n is the number of nodes in the graph, I want each node to have rank 1.
So far I have tried checking out the official documentations on PageRank, but I found nothing that seems to help. Apparently the 'personalization' parameter is of no use either. I tried using nstart, but to no avail. The code currently looks like this:
import networkx as nx
D=nx.DiGraph()
D.add_weighted_edges_from([('1','2',0.5),('1','3',0.5)])
nst = {n: 1 for n in D.nodes}
print(nx.pagerank(D, alpha = 0.95, nstart=nst))
At the moment, the ranks given to each node at the end of the calculation still sum up to 1, while they should sum up to 3.
Is such a thing even feasible to begin with? Should I look elsewhere to implement such an algorithm? Could there be problems with convergence if such a change is applied? Thanks in advance.
PageRank in networkx has an attribute nstart:
nstart (dictionary, optional) – Starting value of PageRank iteration for each node.
Here is source code for this:
# Choose fixed starting vector if not given
if nstart is None:
x = dict.fromkeys(W, 1.0 / N)
else:
# Normalized nstart vector
s = float(sum(nstart.values()))
x = dict((k, v / s) for k, v in nstart.items())
You can just specify nstart in your code, like this:
nst = {n: 1 for n in G.nodes}
pr = nx.pagerank(G, nstart=nst)
Edit 1: Modern PageRank algorithm forcefully normalizes start vector (you can see it in the code above). The whole algorithm is based on it and if one will force nstart values to be 1, not 1/N, it will be broken because convergence:
will never be assumed (e is increasing each iteration). If you want to use 1 as starting values, as in the original PageRank algorithm:
In the original form of PageRank, the sum of PageRank over all pages was the total number of pages on the web at that time, so each page in this example would have an initial value of 1.
You should implement the whole algorithm manually because it is deprecated.

Running time for a shortestP alg in unweighted undirected graph in python3

For this kind of problem I thought it would have been better make some BFS-like implementation. I don't know why, after some running time testing, it came out a plot that resembles an exponential function.
So I began to think of my code's correctness: is it really efficient? Can you help me to make a running time analysis for a good algorithm?
I've plotted the log in base 1.5 for the x-axis (since in the code I use a list of the first 30 powers of 1.5 as number of vertices input in a random graph generetor). Still looks exponential...
def bfs_short(graph, source, target):
visited = set()
queue = collections.deque([source])
d={}
d[source]=0
while queue:
u = queue.pop()
if u==target:
return d[target]
if u not in visited:
visited.add(u)
for w in graph[u]:
if w not in visited:
queue.appendleft(w)
d[w]=d[u]+1
The thing is... I didn't posted also the benching input trials which also may cause problems but first of all I want to be sure that the code works fine... solving the problems related to testing is my final purpose.
A problem in your code is that your queue does not take in account the distance to the origin in order to prioritize the next vertex it will pop.
Also, collections.deque is not a priority queue, and thus will not give you the closest unvisited vertex seen so far when you pop an element from it.
This should do it, using heapq, a built-in heap implementation:
import heapq
def bfs_short(graph, source, target):
visited = set()
queue = [(0, source)]
heapq.heapify(queue)
while not queue:
dist, vertex = heapq.heappop(queue)
if vertex == target:
return dist
if vertex not in visited:
visited.add(vertex)
for neighbor in graph[vertex]:
if neighbor not in visited:
heapq.heappush(queue, (dist + 1, neighbor))
In this version, pairs are pushed in the queue. To evaluate and compare tuples, Python tries to compare their first element, then the second in case of equality, and on and on.
So by pushing dist(vertex, origin) as first member of the tuple, the smallest couple (dist, vertex) in the heap will also be the closest to the origin vertex.

Find path following edges with greatest value in ArangoDB

Lets say, that in my graph I've got edges that have field called value. After selecting start vertex I would like to find path by always selecting the edge that has the highest value. Unfortunatly I can't figure out how to write proper query, is it possible in ArangoDB?
Hi i am unsure what you would like to achieve, there are two possible scenarios that i can imagine from your description:
First: Shortest Path
The use-case here is you know the starting vertex and the target vertex, and you want to find the shortest (or cheapest) path between those two.
The built in SHORTEST_PATH (https://docs.arangodb.com/3.1/AQL/Graphs/ShortestPath.html#shortest-path-in-aql) feature can serve it by defining the distance attribute in the options like this:
FOR v IN OUTBOUND #start TO #end ##edgeCollections OPTIONS {weightAttribute: "value", defaultWeight: 1}
RETURN v
This will give you all vertices on the path from start to end which has the lowest some of value attributes. If you need the "highest value" you could copy the value and save it again with 1/value in a different field, to find the path with the fewest edges having in total the highest sum of values
Second: Sorting of edges
The use case is you only have the starting vertex and want to get the connected vertices, ordered by the value on the edges. There you can simply combine the traversal statement with a simple sort. (https://docs.arangodb.com/3.1/AQL/Graphs/Traversals.html#graph-traversals-in-aql):
FOR v, e IN OUTBOUND #start ##edgeCollection
SORT e.value DESC
LIMIT 1 /* Only pick the highest one */
REUTRN {v: v, e: e}
Third use-case: Iterating several depth only using the highest value
The AQL in Use-case 2 can be chained up to an arbitrary depth which has to be known a-priori. So say you would like to iterate 3 steps only using the edge with highest value:
FOR v1, e1 IN OUTBOUND #start ##edgeCollection
SORT e1.value DESC
LIMIT 1 /* Only pick the highest one */
/* Depth 1 done. now depth 2*/
FOR v2, e2 IN OUTBOUND v1 ##edgeCollection
SORT e2.value DESC
LIMIT 1 /* Only pick the highest one */
FOR v3, e3 IN OUTBOUND v2 ##edgeCollection
SORT e3.value DESC
LIMIT 1 /* Only pick the highest one */
RETURN [v1,v2,v3]
Forth use-case:
The depth is not known a-priori, in this case pure AQL in the currently release version (3.1) cannot formulate this. It will be easier to use a Foxx service (https://docs.arangodb.com/3.1/Manual/Foxx/#foxx) using the traversal module (https://docs.arangodb.com/3.1/Manual/Graphs/Traversals/UsingTraversalObjects.html#getting-started) in JavaScript which is a bit more flexible, but can only be implemented in Javascript.

retrieve vertices with no linked edge in arangodb

What is the best way to retrieve all vertices that do not have an edge in a related edge_collection
I've tried to use the following code but it's got incredibly slow since arangodb 2.8 (It was not really fast in previous versions but round about 10 times faster as now). It takes more than 30 seconds on collection sizes of around 1000 edges and around 3000 vertices.
FOR v IN vertex_collection
FILTER LENGTH( EDGES(edge_collection, v._id, "outbound"))==0
RETURN v._id
...
update
...
After playing around a bit I came to the following query
LET vIDs = (FOR v IN vertex_collection
RETURN v._id)
LET vEdgesFrom = (FOR e IN edge_collection
FILTER e._from IN vIDs
RETURN e._from)
FOR v IN vertex_collection
FILTER v._id IN MINUS(vIDs, vEdgesFrom)
RETURN v._id
This one is much faster (around 0.05s) but still looks like some kind of work around (just thinking of more than one edge collections we need to query against).
So I'm still looking for the best method to find vertices having no edge in specific edge collections.
My sugestion was going to be similar - rather use joins than graph features.
FOR oneEdge IN edges
LET vertices=(FOR oneVertex IN vertices
FILTER oneEdge._from == oneVertex._id OR
oneEdge._to == oneVertex._id
RETURN 1)
FILTER LENGTH(vertices) < 2
RETURN {v: vertices, e: oneEdge}
to find all edges where one of _from and _to would point into nil, and then subsequently delete it.
Note the RETURN 1 which will reduce the amount of data passed up from the inner query.

Find the cross node for number of nodes in ArangoDB?

I have a number of nodes connected through intermediate node of other type. Like on picture There are can be multiple middle nodes. I need to find all the middle nodes for a given number of nodes and sort it by number of links between my initial nodes. In my example given A, B, C, D it should return node E (4 links) folowing node F (3 links). Is this possible? If not may be it can be done using multiple requests? I was thinking about using SHORTEST_PATH function but seems it can only find path between nodes from the same collection?
Very nice question, it challenged the AQL part of my brain ;)
Good news: it is totally possible with only one query utilizing GRAPH_COMMON_NEIGHBORS and a portion of math.
Common neighbors will count for how many of your selected vertices a cross is the connecting component (taking into account ordering A-E-B is different from B-E-A) using combinatorics we end up having a*(a-1)=c many combinations, where c is comupted. We use p/q formula to identify a (the number of connected vertices given in your set).
If the type of vertex is encoded in an attribute of the vertex object
the resulting AQL looks like this:
FOR x in (
(
let nodes = ["nodes/A","nodes/B","nodes/C","nodes/D"]
for n in GRAPH_COMMON_NEIGHBORS("myGraph",nodes , nodes)
for f in VALUES(n)
for s in VALUES(f)
for candidate in s
filter candidate.type == "cross"
collect crosses = candidate._key into counter
return {crosses: crosses, connections: 0.5 + SQRT(0.25 + LENGTH(counter))}
)
)
sort x.connections DESC
return x
If you put the crosses in a different collection and filter by collection name the query will even get more efficient, we do not need to open any vertices that are not of type cross at all.
FOR x in (
(
let nodes = ["nodes/A","nodes/B","nodes/C","nodes/D"]
for n in GRAPH_COMMON_NEIGHBORS("myGraph",nodes, nodes,
{"vertexCollectionRestriction": "crosses"}, {"vertexCollectionRestriction": "crosses"})
for f in VALUES(n)
for s in VALUES(f)
for candidate in s
collect crosses = candidate._key into counter
return {crosses: crosses, connections: 0.5 + SQRT(0.25 + LENGTH(counter))}
)
)
sort x.connections DESC
return x
Both queries will yield the result on your dataset:
[
{
"crosses": "E",
"connections": 4
},
{
"crosses": "F",
"connections": 3
}
]

Resources