Does neo4j have possibility to LIMIT collected data? - node.js

I have 2 types of nodes in my neo4j db: Users and Posts. Users relate to Posts as -[:OWNER]->
My aim is to query users by ids with their posts and posts should be limited (for example LIMIT 10). Is it possible to limit them using the COLLECT method and order by some parameter?
MATCH (c:Challenge)<-[:OWNER]-(u:User)
WHERE u.id IN ["c5db0d7b-55c2-4d6d-ade2-2265adee7327", "87e15e39-10c6-4c8d-934a-01bc4a1b0d06"]
RETURN u, COLLECT(c) as challenges

You can use slice notation to indicate you want to take only the first 10 elements of a collection:
MATCH (c:Challenge)<-[:OWNER]-(u:User)
WHERE u.id IN ["c5db0d7b-55c2-4d6d-ade2-2265adee7327", "87e15e39-10c6-4c8d-934a-01bc4a1b0d06"]
RETURN u, COLLECT(c)[..10] as challenges
Alternately you can use APOC's aggregation functions:
MATCH (c:Challenge)<-[:OWNER]-(u:User)
WHERE u.id IN ["c5db0d7b-55c2-4d6d-ade2-2265adee7327", "87e15e39-10c6-4c8d-934a-01bc4a1b0d06"]
RETURN u, apoc.agg.slice(c, 0, 10) as challenges
The APOC approach is supposed to be more efficient, but try out both first and see which works best for you.
EDIT
As far as sorting, that must happen prior to aggregation, so use a WITH on what you need, ORDER BY whatever, and then afterwards perform your aggregation.
If you don't see good results, we may need to make use of LIMIT, but since we want that per u instead of across all rows, you'd need to use that within an apoc.cypher.run() subquery (this would be an independent query executed per u, so we would be allowed to use LIMIT that way):
MATCH (u:User)
WHERE u.id IN ["c5db0d7b-55c2-4d6d-ade2-2265adee7327", "87e15e39-10c6-4c8d-934a-01bc4a1b0d06"]
CALL apoc.cypher.run("MATCH (c:Challenge)<-[:OWNER]-(u) WITH c ORDER BY c.name ASC LIMIT 10 RETURN collect(c) as challenges", {u:u}) YIELD value
RETURN u, value.challenges as challenges

Related

Performance drop dramatically when levels get deeper in graph travelsal

I've been working on a config management system using arangodb which collect config data for some common software and stream to a program which will generate the relationship among those softwares based on some pre-defined rules and then save the relations into arangodb. After the relations established, I provides APIs to query the data. One important query is to generate the topology of these softwares. I use graph traversal to generate the topology with following AQL:
for n in nginx for v,e,p in 0..4 outbound n forward, dispatch, route,INBOUND deployto, referto,monitoron filter #domain in p.edges[0].server_name return {id: v._id, type: v.ci_type}
which can generate the following topology:
software relation topology
Which looks fine. However, It takes around 10 seconds to finish the query which is not acceptable because the volume is not very large. I checked all the collections and the largest collection, the "forward" edge collection only has around 28000 documents. So I did some tests:
I changed depth from 0..4 to 0..2 and it only takes 0.3 second to finish the query
I changed depth from 0..4 to 0..3, it takes around 3 seconds
for 0..4, it takes around 10 seconds
Since there is a server_name property on the "forward" edge, so I add a hash index(server_name[*]) but it seems arangodb doesn't use the index from the explain execute plan
Any tips I can optimize the query? and why the index can't be used in this case?
Hope someone can help me out with this. Thanks in advance,
First of all i have tried your query and i could see that for some reason the:
filter #domain in p.edges[0].server_name
Is not optimized correctly. This seems to be an internal issue with the optimization rule not being good enough, i will take a detailed look into this and try to make sure that it works as expected.
For this reason it will not yet be able to use a different index for this case, and will not do short-circuit to abort search on level 1 correctly.
I am very sorry for the inconvenience, as the way you did it should be the correct one.
To have a quick workaround for now you could split the first part of the query in a separate step:
This is the fast version of my modified query (which will not include the nginx, see slower version)
FOR n IN nginx
FOR forwarded, e IN 1 OUTBOUND forward
FILTER #domain IN e.server_name
/* At this point we only have the relevant first depth vertices*/
FOR v IN 0..3 OUTBOUND forward, dispatch, route, INBOUND deployto, referto, monitoron
RETURN {id: v._id, type: v.ci_type}
This is a slightly slower version of my modified query (saving your output format, and i think it will be faster than the one your are working with)
FOR tmp IN(
FOR n IN nginx
FOR forwarded, e IN 1 OUTBOUND forward
FILTER #domain IN e.server_name
/* At this point we only have the relevant first depth vertices*/
RETURN APPEND([{id: n._id, type: n.ci_type}],(
FOR v IN 0..3 OUTBOUND forward, dispatch, route, INBOUND deployto, referto, monitoron
RETURN {id: v._id, type: v.ci_type}
)
)[**]
RETURN tmp
In i can give some general advise:
(This will work after we fixed the optimizer) Usage of the index: ArangoDB uses statistics/assumptions of the index selectivity (how good it is to find the data) to decide which index is better. In your case it may assume that the edge-index is better than your hash-index. You could try to create a combined hash_index on ["_from", "server_name[*]"] which is more likely to have a better estimate than the EdgeIndex and could be used.
In the example you have given i can see that there is a "large" right part starting at the apppkg node. In the query this right part an be reached in two ways:
a) nginx -> tomcat <- apppkg
b) nginx -> varnish -> lvs -> tomcat <- apppkg
This means the query could walk through the subtree starting at apppkg multiple times (once for every path leading there). With the query depth of 4 and only this topology it does not happen, but if there are shorter paths this may also be an issue. If i am not mistaken than you are only interested in the distinct vertices in the graph and the path is not important right? If so you can add OPTIONS to the query that will make sure that no vertex (and dependent subtree) is analysed twice. The modified query would look like this:
for n in nginx
for v,e,p in 0..4 outbound n forward, dispatch, route, INBOUND deployto, referto, monitoron
OPTIONS {bfs: true, uniqueVertices: "global"}
filter #domain in p.edges[0].server_name
return {id: v._id, type: v.ci_type}
the change i made is that i add options to the traversal:
bfs: true => Means we do a breadth-first-search instead of a depth-first-search, we only need this to make the result deterministic and make sure that all vertices with a path of depth 4 will be reached correctly
uniqueVertices: "global" => Means whenever a vertex is found in one traversal (so in your case for every nginx separately) it is flagged and will not be looked at again.
If you need the list of all distinct edges as well you should use uniqueEdges: "global" instead of uniqueVertices: "global" which will make this uniqueness check on edge level.

Traversing the optimum path between nodes

in a graph where there are multiple path to go from point (:A) to (:B) through node (:C), I'd like to extract paths from (:A) to (:B) through nodes of type (c:C) where c.Value is maximum. For instance, connect all movies with only their oldest common actors.
match p=(m1:Movie) <-[:ACTED_IN]- (a:Actor) -[:ACTED_IN]-> (m2:Movie)
return m1.Name, m2.Name, a.Name, max(a.Age)
The above query returns the proper age for the oldest actor, but not always his correct name.
Conversely, I noticed that the following query returns both correct age and name.
match p=(m1:Movie) <-[:ACTED_IN]- (a:Actor) -[:ACTED_IN]-> (m2:Movie)
with m1, m2, a order by a.age desc
return m1.name, m2.name, a.name, max(a.age), head(collect(a.name))
Would this always be true? I guess so.
I there a better way to do the job without sorting which may cost much?
You need to use ORDER BY ... LIMIT 1 for this:
match p=(m1:Movie) <-[:ACTED_IN]- (a:Actor) -[:ACTED_IN]-> (m2:Movie)
return m1.Name, m2.Name, a.Name, a.Age order by a.Age desc limit 1
Be aware that you basically want to do a weighted shortest path. Neo4j can do this more efficiently using java code and the GraphAlgoFactory, see the chapter on this in the reference manual.
For those who are willing to do similar things, consider read this post from #_nicolemargaret which describe how to extract the n oldest actors acting in pairs of movies rather than just the first, as with head(collect()).

How to implement SUM with #QuerySqlFunction?

The examples seen so far that cover #QuerySqlFunction are trivial. I put one below. However, I'm looking for an example / solution / hint for providing a cross row calculation, e.g. average, sum, ... Is this possible?
In the example, the function returns value 0 from an array, basically an implementation of ARRAY_GET(x, 0). All other examples I've seen are similar: 1 row, get a value, do something with it. But I need to be able to calculate the sum of a grouped result, or possible a lot more business logic. If somebody could provide me with the QuerySqlFunction for SUM, I assume would allow me to do much more than just SUM.
Step 1: Write a function
public class MyIgniteFunctions {
#QuerySqlFunction
public static double value1(double[] values) {
return values[0];
}
}
Step 2: Register the function
CacheConfiguration<Long, MyFact> factResultCacheCfg = ...
factResultCacheCfg.setSqlFunctionClasses(new Class[] { MyIgniteFunctions.class });
Step 3: Use it in a query
SELECT
MyDimension.groupBy1,
MyDimension.groupBy2,
SUM(VALUE1(MyFact.values))
FROM
"dimensionCacheName".DimDimension,
"factCacheName".FactResult
WHERE
MyDimension.uid=MyFact.dimensionUid
GROUP BY
MyDimension.groupBy1,
MyDimension.groupBy2
I don't believe Ignite currently has clean API support for custom user-defined QuerySqlFunction that spans multiple rows.
If you need something like this, I would suggest that you make use of IgniteCompute APIs and distribute your computations, lambdas, or closures to the participating Ignite nodes. Then from inside of your closure, you can either execute local SQL queries, or perform any other cache operations, including predicate-based scans over locally cached data.
This approach will be executed across multiple Ignite nodes in parallel and should perform well.

cypher pagination total result count

I have a monstrosity of a cypher query and I need to paginate the results of it. What I am trying to do is to get the total number of results before limit is done.
Here is my test graph: http://console.neo4j.org/?id=6hq9tj
I tried to use count(o) in all parts of the query but I always get the same result: 'total_count: 1'. Like in here: http://console.neo4j.org/?id=konr7. The result what I am trying to get should be: 'total_count: 6'.
I always could make an another query just to count the results but it makes no sense to execute two queries.
Please can any one help me one this? Thanks!
Something like this should work:
MATCH (o:Brand)
WITH o
ORDER BY o.name
WITH collect({uuid:o.uuid, name:o.name}) AS brands, COUNT(distinct o.uuid) AS total
UNWIND brands AS brand_row
WITH total, brand_row
SKIP 5
LIMIT 5
RETURN COLLECT(brand_row) AS brands, total;
Note: this is untested, something similar worked for me. Also, not sure how performant it is.
The only way I've gotten this to work is by defining the query twice, I'm not sure though what the impact is on performance, I would guess or hope it was cached the first time. Be warned: This is not a real solution as my comment above to the question states, if you use an offset out of range, nothing is returned!
// first query only to get count
MATCH (x:Brand)
WITH count(*) as total
// query again to get results :(
MATCH (o:Brand)
WITH total, o
ORDER BY o.name SKIP 5 LIMIT 5
WITH total, collect({uuid:o.uuid, name:o.name}) AS brands
RETURN {total:total, brands:brands}
If anyone comes up with a better solution, I as well would love to see it, spent enough time trying to get this to work properly.
Slightly better solution that can handle offset out of range...
// first query to get results
MATCH (o:Brand)
WITH o
ORDER BY o.name SKIP 5 LIMIT 5
WITH collect({uuid:o.uuid, name:o.name}) AS brands
// then query again to get count
MATCH (x:Brand)
WITH brands, count(*) as total
RETURN {total:total, brands:brands}
But it's still two queries and isn't a valid answer to the original question

Batch processing/updating Monogdb documents in Nodejs

I would like to process/update every document in a Mongodb collection periodically (every 5 mins or so) and save the results back to the DB. The update function requires actual code to execute on each document (as far as I know) because it needs to perform computations such as taking the difference in timestamps and taking exponents with Math.pow, which the standard MongoDB update operators do not cover.
What is the best way to do this in NodeJS?
Full context: I am trying to implement the Hacker News ranking algorithm, which is time-dependent. The discussion I've seen around this involves using a separate thread/process to periodically update the scores on documents.
without wasting back and forth investigation it seems you have fields that i will call points, time of initial creation created_date and, then the ycombinator result of (p - 1) / (t + 2)^1.5
the easiest is to write a very simple 3 liner mongo shell script.
db.ycombinator.find().forEach(function(doc) {
var diff = ISODate() - doc.created_date; // subtract date using some form of date ISODate is available in mongo shell
var hours = diff.tomagicalhours(); // some regulr javascript
var result = (doc.points - 1) / Math.pow((hours + 2), 1.5); // perform yc algo
db.ycombinator.update({"_id":doc._id}, {$set:{"result": result} }); // write back into same collection and field, result
})
that goes into a file ycombinator_update.js and then do a 5 minute crontab.
*/5 * * * * mongo ycombinator_update.js
the performance of your reads will be noticeably slower during the writes operation contingent on the number of records in that collection.
you could assign scores based on the document timestamp at lookup time, and only keep the raw timestamps in the database. Since the score is a function of the timestamp anyway, the scoring algorithm can incorporate the exponential decay logic on the unmodified data. Scores can be converted to timestamps if to search by score.
Another option that isn't represented here is the MongoDB MapReduce or Aggregation frameworks.
Both these frameworks provide a way to iterate over all elements in a collection and output some results into a different collection. The aggregation API does not directly include the primitives we need to calculate the 1.5 exponent in the HN algorithm (no $sqrt or $pow), but there is a workaround.
I'm not certain at this point which approach is the most performant for this use case (and how it compares to the MongoDB shell script suggested by Gabe Rainbow).
I believe the next step is to run the update operations in a separate process, which is either scheduled with something like cron, or it could be kicked off via the node app itself using fork with the following logic:
On request for front page:
# when did we last update the scores for the front page?
if last_update was within last X minutes:
return list sorted by score right away
else
fork a process to sort the front page
last_update := Date.Now
return list sorted by score (either right away [stale], or after the update completes [takes a while])

Resources