Broadcasting a graph in Pregel Api in Graphx? - apache-spark

What I would like to do is to broadcast the graph I created to all of the vertices and then each vertex can do its own computation on this graph to compute shortest path with respect to itself as the source vertex? The code below always when I try accessing the graph in the method compute gives me:
java.lang.NullPointerException
val result=graph.pregel(graph,Int.MaxValue,EdgeDirection.Out)((id, value, msg) => compute(msg,id),triplet => Iterator.empty,(a, b) => a)

Unless you have iterative limitations, or want to compute the shortest path to a (temporally) changing node, it might be far easier to compute this with the help of org.apache.spark.graphx.lib.ShortestPaths [1], and calling this on each of your vertices.
Either way, the memory consumption for this will blow out of proportion for only medium-sized graph. Unless you have a really large cluster, or only a small graph, this will most likely be an insurmountable task.
Providing further information on your setting could maybe improve the answers given.

Here is answer to this question if someone out there is trying to do the same thing.
First, as GraphX uses RDDs to store the graph vertices and edges it won't be possible to broadcast a graph this way because we cannot access an RDD inside another RDD.
This is why you are getting a java.lang.NullPointerException.
Second, broadcasting the graph like that is a bad idea, you should probably think of distributed way to compute the shortest paths for each vertex. For example, instead of having one source vertex, you can trigger the shortest path computation from every single vertex and label your messages with both length and source to distinguish between different paths.

Related

Dynamic programming efficient network

Hello I have a dynamic programming related question. How can I compute the shortest path in hops from starting node to ending, with the constrain that the vertices and edges will have an equal or higher predefined value. For example the highest rate of data in a network. Could someone provide some pseudo-code or any thoughts, thank you in advance.
Build new graph from the given network, which does not contain the vertices and edges whose value is less than the predefined value, and from the start node, in the new graph run an algorithm to find the shortest path to the end node, such as BFS, Dijkstra (-Greedy, not Dynamic Programming), Bellman – Ford, etc.

Getting a random simple path between two nodes in a graph

Given the start node and goal node in a graph, I want to find one simple path between these two nodes. I do not want the shortest path, but need any random simple path.
I tried using all_simple_paths from networkx, but this module seems to calculate all the simple paths before returning anything. This takes a long time to run.
Is there a way to find just one simple path?
Also, I would ideally like to make sure this path does not cross any "obstacles". These obstacles are a predefined set of nodes from the same graph. Is there a way to add in this constraint?
PS: I don't necessarily need to use networkx. The code I am writing is in Python.
You could treat this as a min cost network flow problem where your start node wants to send a unit of flow (demand = -1) to your goal node (demand = 1). You can set the edge capacities to 1 and you can set all the edge weights to 0 except for those around "obstacle" nodes. For those obstacle nodes you can set all the edges either coming into or going out of them to have a weight of 1. The algorithm will try to find any arbitrary path using only edges with weight 0, but will use weight 1 edges if no path exists with only weight 0 edges.
See the nx.min_cost_flow function. This function requires you to make your graph a directed graph nx.DiGraph if it's not already.
I managed to solve this problem by using the RRT algorithm. It gives a random path between the source and destination nodes and also avoids obstacles.

Find shortest paths through predefined set of vertices and edges in arangodb

I need to find shortest paths which should pass through several nodes and edges. Few details:
It should be shortest paths according to weights.
Include set can be ordered and unordered.
Graph size - 50 000 vertices and 450 0000 edges
Is there any way to find paths like this using arangodb?
I've tried K_SHORTEST_PATHS but it is too slow for some cases.
Without a data set, this is tricky to test. Unfortunately, K_SHORTEST_PATHS is the only built-in way to add "weight" to edges, unless you build something yourself. Also, both SHORTEST_PATH methods do not implement PRUNE, which is the best way to speed graph traversal.
My suggestion would be to use a directed graph method (FOR v,e,p IN 1..9 INBOUND x...), implementing both PRUNE and FILTER clauses to reduce the number of hops, and something like COLLECT path = p AGGREGATE weight = SUM(e.weight) to calculate weight.

Clustering of facebook-users with k-means

i got a facebook-list of user-ids from following page:
Stanford Facebook-Data
If you look at the facebook_combined data, you can see that it is a list of user-connections (edges). So for instance user 0 has something to do with user 1,2,3 and so on.
Now my work is to find clusters in the dataset.
In the first step i used node.js to read the file and save the data in an array like this:
array=[[0,1],[0,2], ...]
In the second step i used a k-means plugin for node.js to cluster the data:
Cluster-Plugin
But i dont know if the result is right, because now i get clusters of edges and not clusters of users.
UPDATE:
I am trying out a markov implementation for node js. The Markov Plugin however needs an adjacency matrix to build clusters. I implemented an algorithm with java to save the matrix in a file.
Maybe you got any other suggestion how i could get clusters out of edges.
K-means assumes your input data issue an R^d vector space.
In fact, it requires the data to be this way, because it computes means as cluster centers, hence the name k-means.
So if you want to use k-means, then you need
One row per datapoint (not an edge list)
A fixed dimensionality data space where the mean is a useful center (usually, you should have continuous attributes, on binary data the mean does not make too much sense) and where least-squares is a meaningful optimization criterion (again, on binary data, least-squares does not have a strong theoretical support)
On your Faceboook data, you could try some embedding, but I'd have doubts about the trustworthiness.

Directed graph linear algorithm

I would like to know the best way to calculate the length of the shortest path between vertex s and every other vertex of the graph in linear time using dynamic programming.
The graph is weighted DAG.
What you can hope for is an algorithm linear in the number of edges and vertices, i.e. O(|E| + |V|), which also works correctly in presence of negative weights.
This is done by first computing a topological order and then 'exploring' the graph in the order given by this topological order.
Some notation: let's call d'(s,v) the shortest distance from s to v and d(u,v) the length/weight of the arc from u to v (if it exists).
Then, for a node v that is currently being visited, the shortest path from s to v is the minimum of d'(s,u)+d(u,v) for each in-neighbour u of v.
In principle, this is very similar to Dijkstra's algorithm except that we already know in which order to traverse the vertices.
The topological sorting ensures that all in-neighbours of v have already been visited and will not be updated again. So, whenever a node has been visited, the distance it is assigned is the correct shortest path from s to v. Therefore, you end up with a shortest s-v-path for each v.
A full description and implementation can be found here, which links to these lecture notes. I'm not sure where the algorithmic idea for this DAG algorithm was originally published in the literature.
This algorithm works for DAGs, even in the presence of negative weights/distances.
While a typical implementation of this algorithm will most likely not be done using dynamic programming explicitly, it can still be interpreted as such since the problem of finding a shortest path to a node v is computed using the shortest paths to the in-neighbours of v.
For further discussion on if/how this type of algorithm counts as dynamic programming, let me refer you to this question.
It's possible what you're looking for is Bellman-Ford algorithm, which is O(|V||E|) in terms of time complexity (not really linear).
Not sure if some witty dynamic-programming approach could improve on that though.
As hauron said, Bellman-Ford will give you what you're looking for in time O(|V||E|). This works even if your graph contains negative weighted edges, and Bellman-Ford uses dynamic programming at its core.
However, I must add that if your weights are non-negative, you can do Dijkstra from your vertex s in time O(|E| log |E|).
Initialize d[s] = 0.
For every vertex, calculate:
d[v] = min {d[u] + w(u,v) | (u,v) is an edge}
d[v] = ∞ if v has no incoming edges.
(The algorithm always halts since the graph is acyclic.)

Resources