I'm just getting started with arangodb and have gotten to my first real problem:
Is it possible to search for nodes connected to all of multiple others? This seems like a basic operation for a graph db but I just can't think of a solution.
For reference, if we take the 'knows' example graph I want to know which persons know Charlie AND Dave (which should be only Bob)
knows example graph (not allowed to embed images yet)
For now my best guess is to start a traversal for all of the "targets" and reduce and filter the response myself, is this really the only way?
EDIT:
OK, to further specify I have added another connection, eve knows dave too, but should NOT be returned since she does not know charlie
EDIT2:
So far I've come up with this query
FOR start IN ['persons/charlie', 'persons/dave']
LET knownBy = (FOR v,e,p IN 1 INBOUND start knows
RETURN v)
FOR p IN knownBy
COLLECT person = p
LET knows = (FOR v IN 1 OUTBOUND person._id knows
RETURN v._id)
FILTER knows ALL IN ['persons/charlie', 'persons/dave']
RETURN person
However, this feels a bit unnatural, getting the persons known by 'X' to get the persons that know 'X'... Also, the profiler shows that about a third of the time is used for optimizing the plan, there has to be a better solution, right?
If we take the knows example and only search for two connected vertices you can start a traversal on one target with a depth of 2 and filter that the third vertex on the path has to be the second target, then you can just return the second vertex on the path.
FOR v, e, p IN 2 ANY 'persons/charlie' knows
FILTER p.vertices[2]._id == 'persons/dave'
RETURN p.vertices[1]
If you search for more than two vertices, the following query should work well. It starts a traversal with a depth of 1 and collect every _id of the person neighbors and checks of your targets are all containing in the neighbors.
LET targets = ['persons/charlie','persons/dave']
FOR person IN persons
FILTER targets ALL IN FLATTEN((
FOR v, e, p IN 1 ANY person._id knows
RETURN v._id
))
RETURN person
Related
I am new to Arango and I am trying to understand the 'right' way to write some queries. I read (https://www.arangodb.com/docs/stable/graphs-traversals-using-traversal-objects.html) and (http://jsteemann.github.io/blog/2015/01/28/using-custom-visitors-in-aql-graph-traversals/), since they always popped up when searching for what I am trying to do. In particular, I have a graph where a given node has a single path (via a certain 'type' of edge) from that node to a leaf. Something like x -a-> y -a-> z. Where a is the edge type, and x,y,z are the nodes. These paths can be of arbitrary length. I would like to write an AQL query that returns the single 'longest' path from the starting node to the leaf node. I find that I always get every sub-path and would then have to do some post-processing. The traversal objects looked like they supplied a solution to this issue, but it seems they are now deprecated. Is there a correct way to do this in AQL? Is there a document that shows how to do what steemann does in his article, but only using AQL? Is there some great AQL documentation on graph queries other than what is on the arangodb site (all of which I have already read, including the graph presentation and the udemy course)? If not, I would be happy to write something to share with the community, but I am not sure yet how to do this myself, so I'd need some pointers to material that can get me started. Long, short, I'd just like to know how to run my query to find the path from node to leaf. However, I'd be happy to contribute once I see how things should be done without traversal-objects. Thank you for your help
Taking a traversal in OUTBOUND direction as example, you can do a second traversal with depth = 1 to check if you reached a leaf node (no more incoming edges).
Based on this information, the “short” paths can be filtered out.
Note that a second condition is required:
it is possible that the last node in a traversal is not a leaf node if the maximum traversal depth is exceeded.
Thus, you need to also let paths through, which contain as many edges as hops you do in the traversal (here: 5).
LET maxDepth = 5
FOR v, e, p IN 1..maxDepth OUTBOUND "verts/s" edges
LET next = (
FOR vv, ee IN OUTBOUND v edges
//FILTER ee != e // needed if traversing edges in ANY direction
LIMIT 1 // optimization, no need to check for more than a single neighbor
RETURN true // using a constant here, which is not actually used
)
FILTER LENGTH(next) == 0 || LENGTH(p.edges) == maxDepth
RETURN CONCAT_SEPARATOR(" -> ", p.vertices[*]._key)
Cross-post from https://groups.google.com/g/arangodb/c/VAo_i_1UHbo/m/ByIUTqIfBAAJ
First I have Hands = Set[Tuple[str,str]] to represents the suits and ranks of the card respectively( Hands = {("Diamonds", "4"),("Clubs","J"),...}). then I have to check if Hands contain straight flush combination(All 5 cards have the same suit in sequence.) I tried using for loop to check if all the cards have same suits but the problem is that I can't slice the element inside set. After that I am stumped. Is there a way to return a boolean that indicate whether variable Hands is straight flush?
Here is my code I have been working on
Hands = Set[Tuple[str,str]]
h = {("Diamonds", "Q"),("Diamonds","J"),("Diamonds","K"),("Diamonds","A"),("Diamonds","2")}
def is_sflush(h:Hands) -> bool:
for i in h:
if h[i][0] == h[i+1][0]: #This is where I am wrong and need help here
This sounds like a H/W problem, so not to give away the farm...
you have 2 checks to figure out: same suit and sequential. Do them separately.
For the "same suit", I recommend making a set of the suits from the cards (not the ranks), which you can do from a set comprehension. What will the size of that set tell you?
The sequential part is a bit more work. :) You might need an extra data structure that has the correct sequencing or position of the cards as something to compare against. Several strategies could work.
I have a graph representing levels in a game. Some examples of types of nodes I have are "Level", "Room", "Chair". The types are represented by VertexCollections where one collection is called "Level" etc. They are connected by edges from separate EdgeCollections, for example "Level" -roomEdge-> "Room" -chairEdge-> "Chair". Rooms can also contain rooms etc. so the depth is unknown.
I want to start at an arbitrary "Level"-vertex and traverse the whole subtree and find all chairs that belong to the level.
I'm trying to see if ArangoDB would work better than OrientDB for me, in OrientDB I use the query:
SELECT FROM (TRAVERSE out() FROM startNode) WHERE #class = 'Chair'
I have tried the AQL query:
FOR v IN 1..6 OUTBOUND #startVertex GRAPH 'testDb' FILTER IS_SAME_COLLECTION('Chair', v) == true RETURN v;
It does however seem to be executing much slower compared to the OrientDB query(~1 second vs ~0.1 second).
The code im using for the query is the following:
String statement = "FOR v IN 1..6 OUTBOUND #startVertex GRAPH 'testDb' FILTER IS_SAME_COLLECTION('Chair', v) == true RETURN v";
timer.start();
ArangoCursor<BaseDocument> cursor = db.query(statement, new MapBuilder().put("startVertex", "Level/"+startNode.getKey()).get(), BaseDocument.class);
timer.saveTime();
Both solutions are running on the same hardware without any optimization done, both databases are used "out of the box". Both cases use the same data (~1 million vertices) and return the same result.
So my question is if I'm doing things correctly in my AQL query or is there a better way to do it? Have I misunderstood the concept of VertexCollections and how they're used?
Is there a reason you have multiple collections for each entity type, e.g. one collection for Rooms, one for Levels, one for Chairs?
One option is to have a single collection that contains your entities, and you identify the type of entity it is with a type: "Chair" or type: "Level" key on the document.
You then have a single relationship collection, that holds edges both _to and _from the entity collection.
You can then start at a given node (for example a Level) and find all entities of type Chair that it is connected to with a query like:
FOR v, e, p IN 1..6 OUTBOUND Level_ID Relationship_Collection
FILTER p.vertices[-1].Type == 'Chair'
RETURN v
You could return v (final vertex) or e (final edge) or p (all paths).
I'm not sure you need to use a graph object, rather use a relationships collection that adds relationships to your entity collection.
Graphs are good if you need them, but not necessary for traversal queries. Read the documentation at ArangoDB to see if you need them, usually I don't use them as using a graph can slow performance down a little bit.
Remember to look at indexes, and use the 'Explain' feature in the UI to see how your indexes are being used. Maybe add a hash index to the 'Type' key.
I made relation graph two relationship, like if A knows B then B knows A, Every node has unique Id and Name along with other properties.. So my graph looks like
if I trigger a simple query
MATCH (p1:SearchableNode {name: "Ishaan"}), (p2:SearchableNode {name: "Garima"}),path = (p1)-[:NAVIGATE_TO*]-(p2) RETURN path
it did not give any response and consumes 100% CPU and RAM of the machine.
UPDATED
As I read though posts and from comments on this post I simplified the model and relationship. Now it ends up to
Each relationship has different weights, to simplify consider horizontal connections weight 1, vertical weights 1 and diagonal relations have weights 1.5
In my database there are more than 85000 nodes and 0.3 Million relationships
Query with shortest path is not ends up to some result. It stuck in the processing and CPU goes to 100%
im afraid you wont be able to do much here. your graph is very specific, having a relation only to closest nodes. thats too bad cause neo4j is ok to play around the starting point +- few relations away, not over whole graph with each query
it means, once, you are 2 nodes away, the computational complexity raises up to:
8 relationships per node
distance 2
8 + 8^2
in general, the top complexity for a distance n is
O(8 + 8^n) //in case all affected nodes have 8 connections
you say, you got like ~80 000 of nodes.this means (correct me if im wrong), the longest distance of ~280 (from √80000). lets suppose your nodes
(p1:SearchableNode {name: "Ishaan"}),
(p2:SearchableNode {name: "Garima"}),
to be only 140 hopes away. this will create a complexity of 8^140 = 10e126, im not sure if any computer in the world can handle this.
sure, not all nodes have 8 connections, only those "in the middle", in our example graph it will have ~500 000 relationships. you got like ~300 000, which is maybe 2 times less so lets supose the overal complexity for an average distance of 70 (out of 140 - a very relaxed bottom estimation) for nodes having 4 relationships in average (down from 8, 80 000 *4 = 320 000) to be
O(4 + 4^70) = ~10e42
one 1GHz CPU should be able to calculate this by:
-1000 000 per second
10e42 == 10e36 * 1 000 000 -> 10e36 seconds
lets supose we got a cluster of 100 10Ghz cpu serves, 1000 GHz in total.
thats still 10e33 * 1 000 000 000 -> 10e33seconds
i would suggest to just keep away from AllshortestPaths, and look only for the first path available. using gremlin instead of cypher it is possible to implement own algorithms with some heuristics so actually you can cut down the time to maybe seconds or less.
exmaple: using one direction only = down to 10e16 seconds.
an example heuristic: check the id of the node, the higher the difference (subtraction value) between node2.id - node1.id, the higher the actual distance (considering the node creation order - nodes with similar ids to be close together). in that case you can either skip the query or just jump few relations away with something like MATCH n1-[:RELATED..5]->q-[:RELATED..*]->n2 (i forgot the syntax of defining exact relation count) which will (should) actually jump (instantly skip to) 5 distances away nodes which are closer to the n2 node = complexity down from 4^70 to 4^65. so if you can exactly calculate the distance from the nodes id, you can even match ... [:RELATED..65] ... which will cut the complexity to 4^5 and thats just matter of miliseconds for cpu.
its possible im completely wrong here. it has been already some time im our of school and would be nice to ask a mathematician (graph theory) to confirm this.
Let's consider what your query is doing:
MATCH (p1:SearchableNode {name: "Ishaan"}),
(p2:SearchableNode {name: "Garima"}),
path = (p1)-[:NAVIGATE_TO*]-(p2)
RETURN path
If you run this query in the console with EXPLAIN in front of it, the DB will give you its plan for how it will answer. When I did this, the query compiler warned me:
If a part of a query contains multiple disconnected patterns, this
will build a cartesian product between all those parts. This may
produce a large amount of data and slow down query processing. While
occasionally intended, it may often be possible to reformulate the
query that avoids the use of this cross product, perhaps by adding a
relationship between the different parts or by using OPTIONAL MATCH
You have two issues going on with your query - first, you're assigning p1 and p2 independent of one another, possibly creating this cartesian product. The second issue is that because all of your links in your graph go both ways and you're asking for an undirected connection you're making the DB work twice as hard, because it could actually traverse what you're asking for either way. To make matters worse, because all of the links go both ways, you have many cycles in your graph, so as cypher explores the paths that it can take, many paths it will try will loop back around to where it started. This means that the query engine will spend a lot of time chasing its own tail.
You can probably immediately improve the query by doing this:
MATCH p=shortestPath((p1:SearchableNode {name:"Ishaan"})-[:NAVIGATE_TO*]->(p2:SearchableNode {name:"Garima"}))
RETURN p;
Two modifications here - p1 and p2 are bound to each other immediately, you don't separately match them. Second, notice the [:NAVIGATE_TO*]-> part, with that last arrow ->; we're matching the relationship ONE WAY ONLY. Since you have so many reflexive links in your graph, either way would work fine, but either way you choose you cut the work the DB has to do in half. :)
This may still perform not so great, because traversing that graph is still going to have a lot of cycles, which will send the DB chasing its tail trying to find the best path. In your modeling choice here, you usually shouldn't have relationships going both ways unless you need separate properties on each relationship. A relationship can be traversed in both directions, so it doesn't make sense to have two (one in each direction) unless the information that relationship is capturing is semantically different.
Often you'll find with query performance that you can do better by reformulating the query and thinking about it, but there's major interplay between graph modeling and overall performance. With the graph set up with so many bi-directional links, there will only be so much you can do to optimize path-finding.
MATCH (p1:SearchableNode {name: "Ishaan"}), (p2:SearchableNode {name: "Garima"}),path = (p1)-[:NAVIGATE_TO*]->(p2) RETURN path
Or:
MATCH (p1:SearchableNode {name: "Ishaan"}), (p2:SearchableNode {name: "Garima"}), (p1)-[path:NAVIGATE_TO*]->(p2) RETURN path
I have a device and for each device I wish to generate a string of the following format: XXXXXXXX. Each X is either B, G, or R. An example is GRBRRBRB. This gives me roughly 7000 keys to work with which is enough as I doubt I'll have more devices.
I was thinking I could generate them all before hand, and dump them in a file or something, and just get the next key available from that, but I wonder if there is a better way to do this.
I know there are better ways to do it if I don't need guaranteed uniqueness but I definitely need that so I'm not sure what the best way to do it is.
Treat it as a ternary representation of a number, where R=0, B=1, G=2. So when you're writing the nth ID, the first digit is R if n % 3 == 0, B if n % 3 == 1, G otherwise. The second digit is the same, except you're looking at (n / 3) % 3; then for the third digit look at (n / 3^2) % 3; etc.
If your devices have any unique sequential ID available, I would implement a deterministic algorithm for that 'unique string' retrieval.
I mean something like
GetDeviceRgbString(deviceid) { // deterministic algorithm returning appropriate value using device id to choose it }
Otherwise consider storing it somewhere (depends on your environment, you gave little data about it, but that may be file, database, ... ) and marking them as used, preferably keep the data about assigned device, you may need it once.
I'm going to assume you want them unique to identify them for some type of network (internet) activity. That being the case, I would have a web service that can take care of making sure each device is unique by handing out IDs. The software on each device would see if it has an ID (stored locally), and if not, request one from the web service.