Efficient path traversal using AQL on a large bipartite graph in ArangoDB - arangodb

I have a large bipartite graph (300M+ nodes) stored in ArangoDB using two collections and an edgelist. I'm trying to do an efficient traversal using AQL that starts from a node of one type with a particular label to find all other connected nodes of the same type with the same label. The resulting traversal could find anywhere between 2 and 150K nodes, though on average it will be around 10-20 nodes. It is important that a) I specify a large default max traversal depth (ie. 0..50) to ensure I find everything, but that b) AQL prunes paths so that most of the time it never reaches this max depth.
I have a query that gets the right results, but it does not appear to prune the paths, as it gets slower as I increase the max depth, even though the results do not change.
Here is the problem in miniature (picture here):
var cir = db._create("circles");
var dia = db._create("diamonds");
var owns = db._createEdgeCollection("owns");
var A = cir.save({_key: "A", color:'blue'});
var B = cir.save({_key: "B", color:'blue'});
var C = cir.save({_key: "C", color:'blue'});
var D = cir.save({_key: "D", color:'yellow'});
var E = cir.save({_key: "E", color:'yellow'});
var F = cir.save({_key: "F", color:'yellow'});
var G = cir.save({_key: "G", color:'red'});
var H = cir.save({_key: "H", color:'red'});
var d1 = dia.save({_key: "1"})_id;
var d2 = dia.save({_key: "2"})_id;
var d3 = dia.save({_key: "3"})_id;
var d4 = dia.save({_key: "4"})_id;
var d5 = dia.save({_key: "5"})_id;
var d6 = dia.save({_key: "6"})_id;
owns.save(A, d2, {});
owns.save(A, d5, {});
owns.save(A, d4, {});
owns.save(B, d4, {});
owns.save(C, d5, {});
owns.save(C, d6, {});
owns.save(D, d1, {});
owns.save(D, d2, {});
owns.save(E, d1, {});
owns.save(E, d3, {});
owns.save(F, d3, {});
owns.save(F, d4, {});
owns.save(G, d6, {});
owns.save(H, d6, {});
owns.save(H, d2, {});
Starting at the Node circle/A I want to find all connected vertices only stopping when I encounter a circle which is not blue.
The following AQL does what I want:
FOR v, e, p IN 0..5 ANY "circles/A" owns
FILTER p.vertices[* filter has(CURRENT, 'color')].color ALL == 'blue'
return v._id
But the FILTER clause does not cause any pruning to occur. At least, as I said above, in the large database I have, increasing the max depth makes it very slow, without changing the results.
So how do I ensure that the filtering of the paths causes the algorithm to prune the paths? The docs are a little thin on this. I can only find examples where exact path lengths are used (p.vertices[1] for example).

As far as I know, there is only one pattern the optimizer is currently capable of recognizing to prune paths instead of post-filtering, and that is a plain filter on the path variable in combination with the ALL operator.
The inline filter you added may prevent this optimization from being applied. I don't see why you added it in the first place. A vertex without color attribute has an implicit value of null, which is not equal to 'blue' and should thus be unnecessary.
Does this query produce the same results, but faster as you increase the traversal depth?
FOR v, e, p IN 0..5 ANY "circles/A" owns
FILTER p.vertices[*].color ALL == 'blue'
return v._id
There is an open feature request for an explicit way to prune paths. Feel free to add your use case.

Related

ArangoDB: Get every node, which is in any way related to a selected node

I have a simple node-links graph in ArangoDB. How can I traverse from 1 preselected node and return all nodes which are related to it?
For example:
A→B, B→C, C→D, C→E, F→B, F→E
Selecting any of them should return the same result (all of them).
I am very new to ArangoDB.
What you need is AQL graph traversal, available since ArangoDB 2.8. Older versions provided a set of graph-related functions, but native AQL traversal is faster, more flexible and the graph functions are no longer available starting with 3.0.
AQL traversal let's you follow edges connected to a start vertex, up to a variable depth. Each encountered vertex can be accessed, e.g. for filtering or to construct a result, as well as the edge that led you to this vertex and the full path from start to finish including both, vertices and edges.
In your case, only the names of the visited vertices need to be returned. You can run the following AQL queries, assuming there's a document collection node and an edge collection links and they contain the data for this graph:
// follow edges ("links" collection) in outbound direction, starting at A
FOR v IN OUTBOUND "node/A" links
// return the key (node name) for every vertex we see
RETURN v._key
This will only return [ "B" ], because the traversal depth is implicitly 1..1 (min=1, max=1). If we increase the max depth, then we can include nodes that are indirectly connected as well:
FOR v IN 1..10 OUTBOUND "node/A" links
RETURN v._key
This will give us [ "B", "C", "D", "E"]. If we look at the graph, this is correct: we only follow edges that point from the vertex we come from to another vertex (direction of the arrow). To do the reverse, we could use INBOUND, but in your case, we want to ignore the direction of the edge and follow anyway:
FOR v IN 1..10 ANY "node/A" links
RETURN v._key
The result might be a bit surprising at first:
[ "B", "C", "D", "E", "F", "B", "F", "E", "C", "D", "B" ]
We see duplicate nodes returned. The reason is that there are multiple paths from A to C for instance (via B and also via B-F-E), and the query returns the last node of every path as variable v. (It doesn't actually process all possible paths up to the maximum depth of 10, but you could set the traversal option OPTIONS {uniqueEdges: "none"} to do so.)
It can help to return formatted traversal paths to better understand what is going on (i.e. how nodes are reached):
FOR v, e, p IN 1..10 ANY "node/A" links OPTIONS {uniqueEdges: "path"}
RETURN CONCAT_SEPARATOR(" - ", p.vertices[*]._key)
Result:
[
"A - B",
"A - B - C",
"A - B - C - D",
"A - B - C - E",
"A - B - C - E - F",
"A - B - C - E - F - B",
"A - B - F",
"A - B - F - E",
"A - B - F - E - C",
"A - B - F - E - C - D",
"A - B - F - E - C - B"
]
There is a cycle in the graph, but there can't be an infinite loop because the maximum depth is exceeded after 10 hops. But as you can see above, it doesn't even reach the depth of 10, it rather stops because the (default) option is to not follow edges twice per path (uniqueEdges: "path").
Anyway, this is not the desired result. A cheap trick would be to use RETURN DISTINCT, COLLECT or something like that to remove duplicates. But we are better off tweaking the traversal options, to not follow edges unnecessarily.
uniqueEdges: "global" would still include the B node twice, but uniqueVertices: "global" gives the desired result. In addition, bfs: true for breadth-first search can be used in this case. The difference is that the path to the F node is shorter (A-B-F instead of A-B-C-E-F). In general, the exact options you should use largely depend on the dataset and the questions you have.
There's one more problem to solve: the traversal does not include the start vertex (other than in p.vertices[0] for every path). This can easily be solved using ArangoDB 3.0 or later by setting the minimum depth to 0:
FOR v IN 0..10 ANY "node/A" links OPTIONS {uniqueVertices: "global"}
RETURN v._key
[ "A", "B", "C", "D", "E", "F" ]
To verify that all nodes from A through F are returned, regardless of the start vertex, we can issue the following test query:
FOR doc IN node
RETURN (
FOR v IN 0..10 ANY doc links OPTIONS {uniqueVertices: "global"}
SORT v._key
RETURN v._key
)
All sub-arrays should look the same. Remove the SORT operation if you want the node names returned in traversal order. Hope this helps =)

Incremently load big RDD file into memory

val locations = filelines.map(line => line.split("\t")).map(t => (t(5).toLong, (t(2).toDouble, t(3).toDouble))).distinct().collect()
val cartesienProduct=locations.cartesian(locations).map(t=> Edge(t._1._1,t._2._1,distanceAmongPoints(t._1._2._1,t._1._2._2,t._2._2._1,t._2._2._2)))
Code executes perfectly fine up till here but when i try to use "cartesienProduct" it got stuck i.e.
val count =cartesienProduct.count()
Any help to efficiently do this will be highly appreciated.
First, the map transformation can be made more readable if written as:
locations.cartesian(locations).map {
case ((a1, (b1, c1)), (a2, (b2, c2)) =>
Edge(a1, a2, distanceAmongPoints(b1,c1,b2,c2)))
}
It seems the objective is to calculate distance between two points for all pairs. cartesian will give the pair twice, effectively computing same distance twice.
To avoid that, one approach could be to broadcast a copy of all points and then compare in parts.
val points: // an array of points.
val pointsRDD = sc.parallelize(points.zipWithIndex)
val bPoints = sc.broadcast(points)
pointsRDD.map { case (point, index) =>
(index + 1 until bPoints.value.size).map { i =>
distanceBetweenPoints(point, bPoints.value.get(i))
}
}
If size of points is N, it will compare point-0 with (point-1 to point-N-1), point-1 with (point-2 to point-N-1) etc.

Find the cross node for number of nodes in ArangoDB?

I have a number of nodes connected through intermediate node of other type. Like on picture There are can be multiple middle nodes. I need to find all the middle nodes for a given number of nodes and sort it by number of links between my initial nodes. In my example given A, B, C, D it should return node E (4 links) folowing node F (3 links). Is this possible? If not may be it can be done using multiple requests? I was thinking about using SHORTEST_PATH function but seems it can only find path between nodes from the same collection?
Very nice question, it challenged the AQL part of my brain ;)
Good news: it is totally possible with only one query utilizing GRAPH_COMMON_NEIGHBORS and a portion of math.
Common neighbors will count for how many of your selected vertices a cross is the connecting component (taking into account ordering A-E-B is different from B-E-A) using combinatorics we end up having a*(a-1)=c many combinations, where c is comupted. We use p/q formula to identify a (the number of connected vertices given in your set).
If the type of vertex is encoded in an attribute of the vertex object
the resulting AQL looks like this:
FOR x in (
(
let nodes = ["nodes/A","nodes/B","nodes/C","nodes/D"]
for n in GRAPH_COMMON_NEIGHBORS("myGraph",nodes , nodes)
for f in VALUES(n)
for s in VALUES(f)
for candidate in s
filter candidate.type == "cross"
collect crosses = candidate._key into counter
return {crosses: crosses, connections: 0.5 + SQRT(0.25 + LENGTH(counter))}
)
)
sort x.connections DESC
return x
If you put the crosses in a different collection and filter by collection name the query will even get more efficient, we do not need to open any vertices that are not of type cross at all.
FOR x in (
(
let nodes = ["nodes/A","nodes/B","nodes/C","nodes/D"]
for n in GRAPH_COMMON_NEIGHBORS("myGraph",nodes, nodes,
{"vertexCollectionRestriction": "crosses"}, {"vertexCollectionRestriction": "crosses"})
for f in VALUES(n)
for s in VALUES(f)
for candidate in s
collect crosses = candidate._key into counter
return {crosses: crosses, connections: 0.5 + SQRT(0.25 + LENGTH(counter))}
)
)
sort x.connections DESC
return x
Both queries will yield the result on your dataset:
[
{
"crosses": "E",
"connections": 4
},
{
"crosses": "F",
"connections": 3
}
]

D3 ordinal scale only returning extremes. Why isn't it interpolating between range and domain?

I'm trying to use d3.scale.ordinal(). I am having an issue where the function only returns the minimum and maximum scale values.
I am trying to use d3.map() to construct the domain. Then I use an xScale function on the same value
My data looks like this:
key,to_state,value,type
Populate,District of Columbia,836,Populate
Populate,Maryland,1938,Populate
Populate,New Jersey,836,Populate
Populate,Pennsylvania,939,Populate
Populate,New York,3455,Populate
My scale function looks like this:
xScale = d3.scale.ordinal()
.domain(d3.map(thedata, function(d){console.log(d.to_state); return d.to_state;}))
.range([0, w - marginleft - marginright]);
My map selection looks like this. The Y and height values are all being calculated properly. Just the X is giving me trouble.
var thechart = chart.selectAll("div")
.data(thedata)
.enter()
console.log(xScale("New Jersey") + " " + xScale("Pennsylvania"));
thechart.append("rect").attr("x", function(d, i) {
return xScale(d.to_state) + marginleft;
})
An ordinal scale expects the same number of items in the .range array as there are in the .domain array. It is not like a linear scale which interpolates the in-between values when given the boundaries of the range as an array. Since you give only two items in the range, they are alternated as the return values for each value given in the domain.
If you want to have a different return value for each value in your domain, you need to provide the corresponding value in the range. Otherwise, if you just want to provide the boundaries of your range, and have the in-between values calculated for you, use .rangePoints() instead of .range().
HERE is an example that demonstrates the result of using both .range() and .rangePoints() when given 10 values in the domain and 2 values in the range.
To get this to work in your case, you will also want to take #AmeliaBR's suggestion of using Array.prototype.map rather than d3.map, and combine this with .rangePoints() to tween the values of your range.
xScale = d3.scale.ordinal()
.domain(thedata.map(function(d){ return d.to_state;}) )
.rangePoints([0, w - marginleft - marginright]);
For any version above v4, you'll want to use d3.scalePoint().
E.g.
xScale = d3.scalePoint().domain([0, 1, 2]).range([margin, width - margin])
You are confusing d3.map(object), which creates a hash-map data structure, with Array.prototype.map(function), which transforms one array into another.
Try:
xScale = d3.scale.ordinal()
.domain(thedata.map(function(d){ return d.to_state;}) )
.range([0, w - marginleft - marginright]);

FLOT: How to make different colored points in same data series, connected by a line?

I think I may have stumbled onto a limitation of Flot, but I'm not sure. I'm trying to represent a single data series over time. The items' "State" is represented on the Y-Axis (there are 5 of them), and time is on the X-Axis (items can change states over time). I want the graph to have points and lines connecting those points for each data series.
In addition to tracking an item's State over time, I'd also like to represent it's "Status" at any of the particular points. This I would like to do by changing the color of the points. What this means is a single item may have different Statuses at different times, meaning for a single data series I need a line that connects different points (dots) of different colors.
The only thing I've seen so far is the ability to specify the color for all points in a given dataseries. Does anyone know if there's a way to specify colors individually?
Thanks.
There you go mate. You need to use a draw hook.
$(function () {
var d2 = [[0, 3], [4, 8], [8, 5], [9, 13]];
var colors = ["#cc4444", "#ff0000", "#0000ff", "#00ff00"];
var radius = [10, 20, 30, 40];
function raw(plot, ctx) {
var data = plot.getData();
var axes = plot.getAxes();
var offset = plot.getPlotOffset();
for (var i = 0; i < data.length; i++) {
var series = data[i];
for (var j = 0; j < series.data.length; j++) {
var color = colors[j];
var d = (series.data[j]);
var x = offset.left + axes.xaxis.p2c(d[0]);
var y = offset.top + axes.yaxis.p2c(d[1]);
var r = radius[j];
ctx.lineWidth = 2;
ctx.beginPath();
ctx.arc(x,y,r,0,Math.PI*2,true);
ctx.closePath();
ctx.fillStyle = color;
ctx.fill();
}
}
};
var plot = $.plot(
$("#placeholder"),
[{ data: d2, points: { show: true } }],
{ hooks: { draw : [raw] } }
);
});
With 3 views, it may not be worth answering my own question, but here's the solution:
My original problem was how to plot a dataseries of points and a line, but with each point being a color that I specify.
Flot only allows specifying colors of the dots at the dataseries level, meaning each color must be its own dataseries. With this in mind, the solution is to make a single dataseries for each color, and draw that dataseries with only points, and no lines. Then I must make a separate dataseries that is all of the dots I want connected by the line, and draw that one with no points, and only a line.
So if I want to show a line going through 5 points with five different colors, I need 6 dataseries: 5 for each point, and 1 for the line that connects them. Flot will simply draw everything on top of each other, and I believe there's a way to specify what gets shown on top (to make sure the dots are shown above the line).
Actually, it's not very difficult to add a feature to flot that would call back into your code to get the color for each point. It took me about an hour, and I'm not a javascript expert by any measure.
If you look at drawSeriesPoints(), all you have to do is pass a callback parameter to plotPoints() which will be used to set ctx.strokeStyle. I added an option called series.points.colorCallback, and drawSeriesPoints() either uses that, or a simple function that always returns the series.color.
One tricky point: the index you should pass to your callback probably isn't the i in plotPoints(), but rather i/ps.
Hope this helps!

Resources