I am new to ArangoDB. After reading the official document, I know ArangoDB's graph feature uses edge collection to define the relations between vertices with _from and _to attribute,which will refer to the start and end vertex.
Index will be created on these two attributes automatically for fast access.
With this structure, the performance of gragh traversal will heavily depend on efficiency of the index defined on the from_ and to_ attribute, but it looks that the index alone is not enough to support efficient gragh traversal?
I have thought that given a vertex, only a small subset of the vertices will be involved to traverse (like traversing a linked list for given a node), but with index and edge collection structure, the whole edge collection will be involved to do the query(although the index would help to avoid traverse the whole table from the first document to the end document).
Also, when solving a practical problem with graph, there may be many vertices to visit, even with the help of index, it also would inevitablely cause bad performance.
So, I would ask with the edge collection structure, how ArangoDB is implemented for efficient graph traversal
Related
Is the difference only logical / housekeeping related?
I did read this question but the answer there only deals with 1 edge definition vs multiple edge definitions within a graph which is now already covered in the documentation. So I'm curious.
I have used Arango for 6 years and don't use Graph objects, all my queries are just AQL queries, which means you don't need to use a Graph to use the benefits of graph databases and to perform traversals.
The way I think of a 'Graph' in Arango is that it's a limited / curated view of your collections that is query-able, but also is helpful if you want it to manage some level of integrity on deletes.
Overall, it slows down a Traversal, so I find it better to avoid them. A key driver for my decision is that I don't need views, and I don't need the system to handle the deletion of edges if I delete a connected vertex, but that's just my use case.
Update 2:
The original question is too long, a simple way:
In The City Graph, how to query the city that can be reached directly from Berlin by germanHighway. I don't want the internationalHighway.
Original Question:
I now use ArangoDB to store a graph. I have one question for the data model design.
Use the knows_graph for example, social_graph
In my original opition, I think I will design two collections, the Document collection is person, and the Edge collection is marriedWith or friendWith.
But when I want to query the person who marriedWith someone, I can't filter the unwanted friendWith edges.(I'm not very familiar with the AQL, maybe this is not true).
In contrast to the examples in AQL Documents, it used to define a more common edge collection, for example, relation in social_graph, and define the more specific type in attribute. for example, "type":"married" as an attribute of a relation.
and thus in AQL, I can use FILTER p.edges[0].type== 'married' to filter the unwanted relation.
My question is:
Which method of data model design is better, or any suggestions for this?
Now I think, put married as a type of a person, may be more flexible, easy to extend to student, neighbour... with one relation Edge collection.
Otherwise, many Edge collections, isStudent, neighbourWith... shoud be created.
Can AQL could filter nodes by edge type but not attributes? Maybe looks like:
FILTER 'isStudent' edge
Update:
I just tried, one relation can only used for two node type.
For example, one isFriend edge is used for person and dog nodes, then you can't use isFriend edge for dog and cat!
so many edges is must needed.
For the original question:
If you have a finite, well defined, number of edges, then using multiple edge collections is fine specially if you expect to have a large number of edge of each type. If in the other hand, you foresee having to a large number of relationship types (friend , best friend, wife, etc) and the number of relationships of each type is not huge, then a single edge collection with a type indicator is fine and may simplify things.
The only two ways I can think of filtering edges from a traversal are:
IS_SAME_COLLECTION function. This will tell you if a document is of particular type. Keep an eye on performance if you use this in a big dataset though
Adding a type attribute in each edge collection that indicates what type of collection this is. Yes, it is basically a static field and is a bit of a waste of space but it works and space is cheap nowadays
Use anonymous graph traversals where you can define which edges to use explicitly
Having said that, Arango is a multi-model DB, and as such you could just ignore the traversal syntax, and just join the tables that you need, which would work just fine as well. It is the great thing about multi-model DBs, you use them in any way you need them.
In terms of your last update, you could check the edge collection by doing something like:
FILTER IS_SAME_COLLECTION('internationalHighway', e._id) == false
I think the way to design the data model depends on your business, If your model is more or less stable, and without many edges, you can select the many edges way, the edges is a finite set.
But I don't know how to filter by edge names :-)
otherwise, I think less edge and more attribute will be good.
We are building an app that is partly a social network, on top of ArangoDB. We are at the point where we need to decide how to construct our graphs, and we have some questions but we could not find something relevant in the docs.
We will be creating some relationships between the users. There will be
Friend request edges
Friend edges
Close friend edges
Block edges
Mute edges etc
As a first option, we have considered using the SmartGraph functionality, however we will not know in advance the users’ locations, and even if we did the user might relocate and since their location will be part of the shard key, it will be immutable (to our current understanding).
The second option is to create a separate named graph for each edge: friend request graph, friend graph, etc
The third option is to create a bigger named graph containing all the relationships (edges) and if we need a particular subset of this graph, to use anonymous graphs. However we cannot find any performance data comparing small graphs with large graphs.
Given that we cannot create multiple graphs with the same edges, we have to decide a priori which solution is the most performant and stick to it, since a possible change will result in changing all AQL queries (something we want to avoid when we are near release).
Which option would be the recommended one?
I'm modeling out my ArangoDB database and the list of edge collections I've created is growing and growing. I could just combine all of the edges into a single edge collection called relations with a type parameter.
It would certainly clean up my list of tables but would it have any effect on my traversal queries? Would it have any positive or negative effects?
You should add a vertex-centric index for the edge collection. This allows you to use a single edge collection without a big performance impact.
You can essentially add indexes on the "_from" or "_to" field and your type attribute.
If your traversal queries need both directions you need to add two indexes one on "_to"+"_type" and one on "_from"+"_type"
The example in the documentation just suggests a skiplist index, but you should probably use a hash-index because the type field contains a discrete value.
https://docs.arangodb.com/3.2/Manual/Indexing/IndexBasics.html#vertex-centric-indexes
Can someone tell me how to represent spatial data (coming from postgis) in Cassandra?
This presentation was pretty interesting, on the topic of spatial data in Cassandra, and may help:
http://www.readwriteweb.com/cloud/2011/02/video-simplegeo-cassandra.php
Please provide a bit more detail on what you are trying to achieve.
This is particularly important for Cassandra (as opposed to a relational database), because you need to model the data to support the specific queries you need, rather than modelling the domain in a fairly generic way and using SQL to define queries afterwards.
Are you just trying to look up lat/longs for entities with unique identifiers, or do you have more complex shapes associated with your entities - or what?
Responding to Mr. Roland (and hopefully the OP):
You'd need to come up with your own indexing scheme, and store the indexes in Cassandra.
For example, you could subdivide the space into squares (perhaps using a hierarchical structure such as a quadtree) and store each square in a Cassandra row, with the columns storing the objects that fall within the square. Your client code would need to determine the correct square for each lat,long, then look up the objects in that square (or squares) that cover the radius you desire, then do a final client-side filter to remove any objects that are just outside the radius due to them being stored in squares.