Performance drop dramatically when levels get deeper in graph travelsal

Performance drop dramatically when levels get deeper in graph travelsal - arangodb

I've been working on a config management system using arangodb which collect config data for some common software and stream to a program which will generate the relationship among those softwares based on some pre-defined rules and then save the relations into arangodb. After the relations established, I provides APIs to query the data. One important query is to generate the topology of these softwares. I use graph traversal to generate the topology with following AQL:
for n in nginx for v,e,p in 0..4 outbound n forward, dispatch, route,INBOUND deployto, referto,monitoron filter #domain in p.edges[0].server_name return {id: v._id, type: v.ci_type}
which can generate the following topology:
software relation topology
Which looks fine. However, It takes around 10 seconds to finish the query which is not acceptable because the volume is not very large. I checked all the collections and the largest collection, the "forward" edge collection only has around 28000 documents. So I did some tests:
I changed depth from 0..4 to 0..2 and it only takes 0.3 second to finish the query
I changed depth from 0..4 to 0..3, it takes around 3 seconds
for 0..4, it takes around 10 seconds
Since there is a server_name property on the "forward" edge, so I add a hash index(server_name[*]) but it seems arangodb doesn't use the index from the explain execute plan
Any tips I can optimize the query? and why the index can't be used in this case?
Hope someone can help me out with this. Thanks in advance,

First of all i have tried your query and i could see that for some reason the:
filter #domain in p.edges[0].server_name
Is not optimized correctly. This seems to be an internal issue with the optimization rule not being good enough, i will take a detailed look into this and try to make sure that it works as expected.
For this reason it will not yet be able to use a different index for this case, and will not do short-circuit to abort search on level 1 correctly.
I am very sorry for the inconvenience, as the way you did it should be the correct one.
To have a quick workaround for now you could split the first part of the query in a separate step:
This is the fast version of my modified query (which will not include the nginx, see slower version)
FOR n IN nginx
FOR forwarded, e IN 1 OUTBOUND forward
FILTER #domain IN e.server_name
/* At this point we only have the relevant first depth vertices*/
FOR v IN 0..3 OUTBOUND forward, dispatch, route, INBOUND deployto, referto, monitoron
RETURN {id: v._id, type: v.ci_type}
This is a slightly slower version of my modified query (saving your output format, and i think it will be faster than the one your are working with)
FOR tmp IN(
FOR n IN nginx
FOR forwarded, e IN 1 OUTBOUND forward
FILTER #domain IN e.server_name
/* At this point we only have the relevant first depth vertices*/
RETURN APPEND([{id: n._id, type: n.ci_type}],(
FOR v IN 0..3 OUTBOUND forward, dispatch, route, INBOUND deployto, referto, monitoron
RETURN {id: v._id, type: v.ci_type}
)
)[**]
RETURN tmp
In i can give some general advise:
(This will work after we fixed the optimizer) Usage of the index: ArangoDB uses statistics/assumptions of the index selectivity (how good it is to find the data) to decide which index is better. In your case it may assume that the edge-index is better than your hash-index. You could try to create a combined hash_index on ["_from", "server_name[*]"] which is more likely to have a better estimate than the EdgeIndex and could be used.
In the example you have given i can see that there is a "large" right part starting at the apppkg node. In the query this right part an be reached in two ways:
a) nginx -> tomcat <- apppkg
b) nginx -> varnish -> lvs -> tomcat <- apppkg
This means the query could walk through the subtree starting at apppkg multiple times (once for every path leading there). With the query depth of 4 and only this topology it does not happen, but if there are shorter paths this may also be an issue. If i am not mistaken than you are only interested in the distinct vertices in the graph and the path is not important right? If so you can add OPTIONS to the query that will make sure that no vertex (and dependent subtree) is analysed twice. The modified query would look like this:
for n in nginx
for v,e,p in 0..4 outbound n forward, dispatch, route, INBOUND deployto, referto, monitoron
OPTIONS {bfs: true, uniqueVertices: "global"}
filter #domain in p.edges[0].server_name
return {id: v._id, type: v.ci_type}
the change i made is that i add options to the traversal:
bfs: true => Means we do a breadth-first-search instead of a depth-first-search, we only need this to make the result deterministic and make sure that all vertices with a path of depth 4 will be reached correctly
uniqueVertices: "global" => Means whenever a vertex is found in one traversal (so in your case for every nginx separately) it is flagged and will not be looked at again.
If you need the list of all distinct edges as well you should use uniqueEdges: "global" instead of uniqueVertices: "global" which will make this uniqueness check on edge level.

Related

ArangoDB AQL: can I traverse a graph from multiple start vertices, but ensure uniqueVertices across all traversals?

I have a graph dataset with large number of relatively small disjoint graphs. I need to find all vertices reachable from a set of vertices matching certain search criteria. I use the following query:
FOR startnode IN nodes
FILTER startnode._key IN [...set of values...]
FOR node IN 0..100000 OUTBOUND startnode edges
COLLECT k = node._key
RETURN k
The query is very slow, even though it returns the correct result. This is because Arango actually ends up traversing the same subgraphs many times. For example, say there is the following subgraph:
a -> b -> c -> d -> e
When vertices a and c are selected by the filter condition, Arango ends up doing two independent traversals starting from a and c. It visits vertices d and e during both of these traversals, which wastes time. Adding uniqueVertices option doesn't help, because the vertex uniqueness is not checked across different traversals.
To confirm the impact on performance, I created an extra root document and added links from it to all the documents found by my filter:
FOR startnode IN nodes
FILTER startnode._key IN [...set of values...]
INSERT { _from: 'fakeVertices/0', _to: startnode._id } IN fakeEdges
Now the following query runs 4x faster than my original query, while producing the same result:
FOR node IN 1..1000000 OUTBOUND 'fakeVertices/0' edges, fakeEdges
OPTIONS { uniqueVertices: 'global', bfs: true }
COLLECT k = node._key
RETURN k
Unfortunately, I cannot create fake vertex/edges for all of my queries as creating it takes even more time.
My question is: does Arango provide a way to ensure uniqueness of vertices visited across all traversals in given query? If not, are there any better way to solve the problem described above?

From what I understand, this is what the uniqueVertices option is for, but for each iteration of the FOR ... statement, it considers vertices unique for the traversal from that start node. It doesn't know about other traversals that have happened on other nodes in the FOR ... statement. It appears that you will traverse LOTS of vertices each time, and this happens from each new start node.
Just throwing this at the wall to see if it sticks, but what about a combination of the two queries, adding OPTIONS to the original?
FOR startnode IN nodes
FILTER startnode._key IN [...set of values...]
FOR node IN 0..100000 OUTBOUND startnode edges
OPTIONS { uniqueVertices: 'global', bfs: true }
COLLECT k = node._key
RETURN k
Also, I would highly recommend a named graph instead of specifying edge collections. Not only is it far more flexible, it allows you to use shortest-path calculations as well, which might help here.

How to display all the all the child/parent nodes of the given node of fully nested hierarchy in arangodb

We have an edge collection "ChildEvents" with 999999 records
Structure:
_from:events/1, _to: events/2
_from:events/2, _to: events/3
In this structure node-1 is super parent, and node-2 is child of node-1, and node-3 is child of 2. ( Nested hierarchy from top to bottom)
1-->2-->3-->4-->....999999
My requirement is to display all the nested parent/or children of the given node.
Eg. If i provide node - 4, the query should display the children from the node-5 to node-999999.
(I had this in neo4J databse working using a match query).
But in arangoDB when i tried to achieve this using the below query,its returning only 2 records.
****FOR v IN OUTBOUND "events/350" any ChildEvents RETURN v****
Could some one help on this? Your help is greatly appreciated.

Traversal queries require that you declare a min value ("0" in my example), but to satisfy your requirements, you need to declare a max value (999999) as well (see docs here).
Also, the term any is not used in this fashion, especially since you are looking for outbound relations (outbound from the parent to a child).
FOR v IN 0..999999 OUTBOUND 'events/350'
ChildEvents
RETURN v
Finds node with _key = '350' from the 'events' collection
Returns the start node ('events/350')
Returns all child nodes of 'events/350' (all 999,648 of them)
You do not need to provide an exact number for the "max" value - this is simply an upper limit on the number of edges to traverse.
If this query does not work for you, then I suggest taking a close look at your edges. You might either have a gap or an incorrect direction (swapped _from & _to).

Does neo4j have possibility to LIMIT collected data?

I have 2 types of nodes in my neo4j db: Users and Posts. Users relate to Posts as -[:OWNER]->
My aim is to query users by ids with their posts and posts should be limited (for example LIMIT 10). Is it possible to limit them using the COLLECT method and order by some parameter?
MATCH (c:Challenge)<-[:OWNER]-(u:User)
WHERE u.id IN ["c5db0d7b-55c2-4d6d-ade2-2265adee7327", "87e15e39-10c6-4c8d-934a-01bc4a1b0d06"]
RETURN u, COLLECT(c) as challenges

You can use slice notation to indicate you want to take only the first 10 elements of a collection:
MATCH (c:Challenge)<-[:OWNER]-(u:User)
WHERE u.id IN ["c5db0d7b-55c2-4d6d-ade2-2265adee7327", "87e15e39-10c6-4c8d-934a-01bc4a1b0d06"]
RETURN u, COLLECT(c)[..10] as challenges
Alternately you can use APOC's aggregation functions:
MATCH (c:Challenge)<-[:OWNER]-(u:User)
WHERE u.id IN ["c5db0d7b-55c2-4d6d-ade2-2265adee7327", "87e15e39-10c6-4c8d-934a-01bc4a1b0d06"]
RETURN u, apoc.agg.slice(c, 0, 10) as challenges
The APOC approach is supposed to be more efficient, but try out both first and see which works best for you.
EDIT
As far as sorting, that must happen prior to aggregation, so use a WITH on what you need, ORDER BY whatever, and then afterwards perform your aggregation.
If you don't see good results, we may need to make use of LIMIT, but since we want that per u instead of across all rows, you'd need to use that within an apoc.cypher.run() subquery (this would be an independent query executed per u, so we would be allowed to use LIMIT that way):
MATCH (u:User)
WHERE u.id IN ["c5db0d7b-55c2-4d6d-ade2-2265adee7327", "87e15e39-10c6-4c8d-934a-01bc4a1b0d06"]
CALL apoc.cypher.run("MATCH (c:Challenge)<-[:OWNER]-(u) WITH c ORDER BY c.name ASC LIMIT 10 RETURN collect(c) as challenges", {u:u}) YIELD value
RETURN u, value.challenges as challenges

Traversing the optimum path between nodes

in a graph where there are multiple path to go from point (:A) to (:B) through node (:C), I'd like to extract paths from (:A) to (:B) through nodes of type (c:C) where c.Value is maximum. For instance, connect all movies with only their oldest common actors.
match p=(m1:Movie) <-[:ACTED_IN]- (a:Actor) -[:ACTED_IN]-> (m2:Movie)
return m1.Name, m2.Name, a.Name, max(a.Age)
The above query returns the proper age for the oldest actor, but not always his correct name.
Conversely, I noticed that the following query returns both correct age and name.
match p=(m1:Movie) <-[:ACTED_IN]- (a:Actor) -[:ACTED_IN]-> (m2:Movie)
with m1, m2, a order by a.age desc
return m1.name, m2.name, a.name, max(a.age), head(collect(a.name))
Would this always be true? I guess so.
I there a better way to do the job without sorting which may cost much?

You need to use ORDER BY ... LIMIT 1 for this:
match p=(m1:Movie) <-[:ACTED_IN]- (a:Actor) -[:ACTED_IN]-> (m2:Movie)
return m1.Name, m2.Name, a.Name, a.Age order by a.Age desc limit 1
Be aware that you basically want to do a weighted shortest path. Neo4j can do this more efficiently using java code and the GraphAlgoFactory, see the chapter on this in the reference manual.

For those who are willing to do similar things, consider read this post from #_nicolemargaret which describe how to extract the n oldest actors acting in pairs of movies rather than just the first, as with head(collect()).

Graphite: Aggregation Rules not working

I have added many aggregation rules like
app.email.server1.total-sent.1d.sum (86400) = sum app.email.server1.total-sent.1h.sum
I want to know is there any limit on the aggregation rules count. Same kind of other aggregation rules are working.
I checked by using tcpdump also, packets containing the tag app.email.server1.total-sent.1h.sum is also coming.
Can we debug by checking logs. I tried but logs is not mentioning anything regarding the type of metrics getting aggregated.

You want to sum up all 1h to 1d, so in the rule, on the RHS, put * instead of 1h
app.email.server1.total-sent.1d.sum (86400) = sum app.email.server1.total-sent.*.sum

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string