Traversing the optimum path between nodes - node.js

in a graph where there are multiple path to go from point (:A) to (:B) through node (:C), I'd like to extract paths from (:A) to (:B) through nodes of type (c:C) where c.Value is maximum. For instance, connect all movies with only their oldest common actors.
match p=(m1:Movie) <-[:ACTED_IN]- (a:Actor) -[:ACTED_IN]-> (m2:Movie)
return m1.Name, m2.Name, a.Name, max(a.Age)
The above query returns the proper age for the oldest actor, but not always his correct name.
Conversely, I noticed that the following query returns both correct age and name.
match p=(m1:Movie) <-[:ACTED_IN]- (a:Actor) -[:ACTED_IN]-> (m2:Movie)
with m1, m2, a order by a.age desc
return m1.name, m2.name, a.name, max(a.age), head(collect(a.name))
Would this always be true? I guess so.
I there a better way to do the job without sorting which may cost much?

You need to use ORDER BY ... LIMIT 1 for this:
match p=(m1:Movie) <-[:ACTED_IN]- (a:Actor) -[:ACTED_IN]-> (m2:Movie)
return m1.Name, m2.Name, a.Name, a.Age order by a.Age desc limit 1
Be aware that you basically want to do a weighted shortest path. Neo4j can do this more efficiently using java code and the GraphAlgoFactory, see the chapter on this in the reference manual.

For those who are willing to do similar things, consider read this post from #_nicolemargaret which describe how to extract the n oldest actors acting in pairs of movies rather than just the first, as with head(collect()).

Related

Get an incrementing number in Logic App Select

I am using a Logic App to transform some data for an integration. I am trying to avoid using For Each loops as the amount of data I am working with is high, and these incur a cost for each action and iteration of the for each loop.
However the integration I am working with requires a unique incrementing number for each line. They don't have to be sequential, or even starting with 1 but the order should be kept the same.
So with the above, the first one would get LineNumber 1, the second LineNumber 2, etc.. (or like I said, it could be 67829, 67835, etc..)
I tried to set a variable with ticks(utcNow()) before the start of the mapping, and then use sub(ticks(utcNow()), variables('startTicks')) but this is evaluated once and the same number is applied to all.
My next thought is to use an azure function/inline javascript to go through afterward and assign them, but just wondering if there is a way to accomplish this in the select.
or like I said, it could be 67829, 67835, etc..
Answering to this requirement,
Inside the Select Option :
indexOf(string(variables('<DATA Variable>')),string(item()))
Explanation :
item() - current item (of all items) in the select - stringified the same & tried to find the same in stringified version of the entire data - the index number will be returned.
OUTPUT
Please note :
Did not get a chance to check on a very large dataset.
This may fail, if a specific row(all values in the row) repetitive in nature - I assume this may not
be your case (order number might unique )

How to display all the all the child/parent nodes of the given node of fully nested hierarchy in arangodb

We have an edge collection "ChildEvents" with 999999 records
Structure:
_from:events/1, _to: events/2
_from:events/2, _to: events/3
In this structure node-1 is super parent, and node-2 is child of node-1, and node-3 is child of 2. ( Nested hierarchy from top to bottom)
1-->2-->3-->4-->....999999
My requirement is to display all the nested parent/or children of the given node.
Eg. If i provide node - 4, the query should display the children from the node-5 to node-999999.
(I had this in neo4J databse working using a match query).
But in arangoDB when i tried to achieve this using the below query,its returning only 2 records.
****FOR v IN OUTBOUND "events/350" any ChildEvents RETURN v****
Could some one help on this? Your help is greatly appreciated.
Traversal queries require that you declare a min value ("0" in my example), but to satisfy your requirements, you need to declare a max value (999999) as well (see docs here).
Also, the term any is not used in this fashion, especially since you are looking for outbound relations (outbound from the parent to a child).
FOR v IN 0..999999 OUTBOUND 'events/350'
ChildEvents
RETURN v
Finds node with _key = '350' from the 'events' collection
Returns the start node ('events/350')
Returns all child nodes of 'events/350' (all 999,648 of them)
You do not need to provide an exact number for the "max" value - this is simply an upper limit on the number of edges to traverse.
If this query does not work for you, then I suggest taking a close look at your edges. You might either have a gap or an incorrect direction (swapped _from & _to).

Does neo4j have possibility to LIMIT collected data?

I have 2 types of nodes in my neo4j db: Users and Posts. Users relate to Posts as -[:OWNER]->
My aim is to query users by ids with their posts and posts should be limited (for example LIMIT 10). Is it possible to limit them using the COLLECT method and order by some parameter?
MATCH (c:Challenge)<-[:OWNER]-(u:User)
WHERE u.id IN ["c5db0d7b-55c2-4d6d-ade2-2265adee7327", "87e15e39-10c6-4c8d-934a-01bc4a1b0d06"]
RETURN u, COLLECT(c) as challenges
You can use slice notation to indicate you want to take only the first 10 elements of a collection:
MATCH (c:Challenge)<-[:OWNER]-(u:User)
WHERE u.id IN ["c5db0d7b-55c2-4d6d-ade2-2265adee7327", "87e15e39-10c6-4c8d-934a-01bc4a1b0d06"]
RETURN u, COLLECT(c)[..10] as challenges
Alternately you can use APOC's aggregation functions:
MATCH (c:Challenge)<-[:OWNER]-(u:User)
WHERE u.id IN ["c5db0d7b-55c2-4d6d-ade2-2265adee7327", "87e15e39-10c6-4c8d-934a-01bc4a1b0d06"]
RETURN u, apoc.agg.slice(c, 0, 10) as challenges
The APOC approach is supposed to be more efficient, but try out both first and see which works best for you.
EDIT
As far as sorting, that must happen prior to aggregation, so use a WITH on what you need, ORDER BY whatever, and then afterwards perform your aggregation.
If you don't see good results, we may need to make use of LIMIT, but since we want that per u instead of across all rows, you'd need to use that within an apoc.cypher.run() subquery (this would be an independent query executed per u, so we would be allowed to use LIMIT that way):
MATCH (u:User)
WHERE u.id IN ["c5db0d7b-55c2-4d6d-ade2-2265adee7327", "87e15e39-10c6-4c8d-934a-01bc4a1b0d06"]
CALL apoc.cypher.run("MATCH (c:Challenge)<-[:OWNER]-(u) WITH c ORDER BY c.name ASC LIMIT 10 RETURN collect(c) as challenges", {u:u}) YIELD value
RETURN u, value.challenges as challenges

Performance drop dramatically when levels get deeper in graph travelsal

I've been working on a config management system using arangodb which collect config data for some common software and stream to a program which will generate the relationship among those softwares based on some pre-defined rules and then save the relations into arangodb. After the relations established, I provides APIs to query the data. One important query is to generate the topology of these softwares. I use graph traversal to generate the topology with following AQL:
for n in nginx for v,e,p in 0..4 outbound n forward, dispatch, route,INBOUND deployto, referto,monitoron filter #domain in p.edges[0].server_name return {id: v._id, type: v.ci_type}
which can generate the following topology:
software relation topology
Which looks fine. However, It takes around 10 seconds to finish the query which is not acceptable because the volume is not very large. I checked all the collections and the largest collection, the "forward" edge collection only has around 28000 documents. So I did some tests:
I changed depth from 0..4 to 0..2 and it only takes 0.3 second to finish the query
I changed depth from 0..4 to 0..3, it takes around 3 seconds
for 0..4, it takes around 10 seconds
Since there is a server_name property on the "forward" edge, so I add a hash index(server_name[*]) but it seems arangodb doesn't use the index from the explain execute plan
Any tips I can optimize the query? and why the index can't be used in this case?
Hope someone can help me out with this. Thanks in advance,
First of all i have tried your query and i could see that for some reason the:
filter #domain in p.edges[0].server_name
Is not optimized correctly. This seems to be an internal issue with the optimization rule not being good enough, i will take a detailed look into this and try to make sure that it works as expected.
For this reason it will not yet be able to use a different index for this case, and will not do short-circuit to abort search on level 1 correctly.
I am very sorry for the inconvenience, as the way you did it should be the correct one.
To have a quick workaround for now you could split the first part of the query in a separate step:
This is the fast version of my modified query (which will not include the nginx, see slower version)
FOR n IN nginx
FOR forwarded, e IN 1 OUTBOUND forward
FILTER #domain IN e.server_name
/* At this point we only have the relevant first depth vertices*/
FOR v IN 0..3 OUTBOUND forward, dispatch, route, INBOUND deployto, referto, monitoron
RETURN {id: v._id, type: v.ci_type}
This is a slightly slower version of my modified query (saving your output format, and i think it will be faster than the one your are working with)
FOR tmp IN(
FOR n IN nginx
FOR forwarded, e IN 1 OUTBOUND forward
FILTER #domain IN e.server_name
/* At this point we only have the relevant first depth vertices*/
RETURN APPEND([{id: n._id, type: n.ci_type}],(
FOR v IN 0..3 OUTBOUND forward, dispatch, route, INBOUND deployto, referto, monitoron
RETURN {id: v._id, type: v.ci_type}
)
)[**]
RETURN tmp
In i can give some general advise:
(This will work after we fixed the optimizer) Usage of the index: ArangoDB uses statistics/assumptions of the index selectivity (how good it is to find the data) to decide which index is better. In your case it may assume that the edge-index is better than your hash-index. You could try to create a combined hash_index on ["_from", "server_name[*]"] which is more likely to have a better estimate than the EdgeIndex and could be used.
In the example you have given i can see that there is a "large" right part starting at the apppkg node. In the query this right part an be reached in two ways:
a) nginx -> tomcat <- apppkg
b) nginx -> varnish -> lvs -> tomcat <- apppkg
This means the query could walk through the subtree starting at apppkg multiple times (once for every path leading there). With the query depth of 4 and only this topology it does not happen, but if there are shorter paths this may also be an issue. If i am not mistaken than you are only interested in the distinct vertices in the graph and the path is not important right? If so you can add OPTIONS to the query that will make sure that no vertex (and dependent subtree) is analysed twice. The modified query would look like this:
for n in nginx
for v,e,p in 0..4 outbound n forward, dispatch, route, INBOUND deployto, referto, monitoron
OPTIONS {bfs: true, uniqueVertices: "global"}
filter #domain in p.edges[0].server_name
return {id: v._id, type: v.ci_type}
the change i made is that i add options to the traversal:
bfs: true => Means we do a breadth-first-search instead of a depth-first-search, we only need this to make the result deterministic and make sure that all vertices with a path of depth 4 will be reached correctly
uniqueVertices: "global" => Means whenever a vertex is found in one traversal (so in your case for every nginx separately) it is flagged and will not be looked at again.
If you need the list of all distinct edges as well you should use uniqueEdges: "global" instead of uniqueVertices: "global" which will make this uniqueness check on edge level.

cypher pagination total result count

I have a monstrosity of a cypher query and I need to paginate the results of it. What I am trying to do is to get the total number of results before limit is done.
Here is my test graph: http://console.neo4j.org/?id=6hq9tj
I tried to use count(o) in all parts of the query but I always get the same result: 'total_count: 1'. Like in here: http://console.neo4j.org/?id=konr7. The result what I am trying to get should be: 'total_count: 6'.
I always could make an another query just to count the results but it makes no sense to execute two queries.
Please can any one help me one this? Thanks!
Something like this should work:
MATCH (o:Brand)
WITH o
ORDER BY o.name
WITH collect({uuid:o.uuid, name:o.name}) AS brands, COUNT(distinct o.uuid) AS total
UNWIND brands AS brand_row
WITH total, brand_row
SKIP 5
LIMIT 5
RETURN COLLECT(brand_row) AS brands, total;
Note: this is untested, something similar worked for me. Also, not sure how performant it is.
The only way I've gotten this to work is by defining the query twice, I'm not sure though what the impact is on performance, I would guess or hope it was cached the first time. Be warned: This is not a real solution as my comment above to the question states, if you use an offset out of range, nothing is returned!
// first query only to get count
MATCH (x:Brand)
WITH count(*) as total
// query again to get results :(
MATCH (o:Brand)
WITH total, o
ORDER BY o.name SKIP 5 LIMIT 5
WITH total, collect({uuid:o.uuid, name:o.name}) AS brands
RETURN {total:total, brands:brands}
If anyone comes up with a better solution, I as well would love to see it, spent enough time trying to get this to work properly.
Slightly better solution that can handle offset out of range...
// first query to get results
MATCH (o:Brand)
WITH o
ORDER BY o.name SKIP 5 LIMIT 5
WITH collect({uuid:o.uuid, name:o.name}) AS brands
// then query again to get count
MATCH (x:Brand)
WITH brands, count(*) as total
RETURN {total:total, brands:brands}
But it's still two queries and isn't a valid answer to the original question

Resources