We are building an app that is partly a social network, on top of ArangoDB. We are at the point where we need to decide how to construct our graphs, and we have some questions but we could not find something relevant in the docs.
We will be creating some relationships between the users. There will be
Friend request edges
Friend edges
Close friend edges
Block edges
Mute edges etc
As a first option, we have considered using the SmartGraph functionality, however we will not know in advance the users’ locations, and even if we did the user might relocate and since their location will be part of the shard key, it will be immutable (to our current understanding).
The second option is to create a separate named graph for each edge: friend request graph, friend graph, etc
The third option is to create a bigger named graph containing all the relationships (edges) and if we need a particular subset of this graph, to use anonymous graphs. However we cannot find any performance data comparing small graphs with large graphs.
Given that we cannot create multiple graphs with the same edges, we have to decide a priori which solution is the most performant and stick to it, since a possible change will result in changing all AQL queries (something we want to avoid when we are near release).
Which option would be the recommended one?
Related
Is the difference only logical / housekeeping related?
I did read this question but the answer there only deals with 1 edge definition vs multiple edge definitions within a graph which is now already covered in the documentation. So I'm curious.
I have used Arango for 6 years and don't use Graph objects, all my queries are just AQL queries, which means you don't need to use a Graph to use the benefits of graph databases and to perform traversals.
The way I think of a 'Graph' in Arango is that it's a limited / curated view of your collections that is query-able, but also is helpful if you want it to manage some level of integrity on deletes.
Overall, it slows down a Traversal, so I find it better to avoid them. A key driver for my decision is that I don't need views, and I don't need the system to handle the deletion of edges if I delete a connected vertex, but that's just my use case.
Update 2:
The original question is too long, a simple way:
In The City Graph, how to query the city that can be reached directly from Berlin by germanHighway. I don't want the internationalHighway.
Original Question:
I now use ArangoDB to store a graph. I have one question for the data model design.
Use the knows_graph for example, social_graph
In my original opition, I think I will design two collections, the Document collection is person, and the Edge collection is marriedWith or friendWith.
But when I want to query the person who marriedWith someone, I can't filter the unwanted friendWith edges.(I'm not very familiar with the AQL, maybe this is not true).
In contrast to the examples in AQL Documents, it used to define a more common edge collection, for example, relation in social_graph, and define the more specific type in attribute. for example, "type":"married" as an attribute of a relation.
and thus in AQL, I can use FILTER p.edges[0].type== 'married' to filter the unwanted relation.
My question is:
Which method of data model design is better, or any suggestions for this?
Now I think, put married as a type of a person, may be more flexible, easy to extend to student, neighbour... with one relation Edge collection.
Otherwise, many Edge collections, isStudent, neighbourWith... shoud be created.
Can AQL could filter nodes by edge type but not attributes? Maybe looks like:
FILTER 'isStudent' edge
Update:
I just tried, one relation can only used for two node type.
For example, one isFriend edge is used for person and dog nodes, then you can't use isFriend edge for dog and cat!
so many edges is must needed.
For the original question:
If you have a finite, well defined, number of edges, then using multiple edge collections is fine specially if you expect to have a large number of edge of each type. If in the other hand, you foresee having to a large number of relationship types (friend , best friend, wife, etc) and the number of relationships of each type is not huge, then a single edge collection with a type indicator is fine and may simplify things.
The only two ways I can think of filtering edges from a traversal are:
IS_SAME_COLLECTION function. This will tell you if a document is of particular type. Keep an eye on performance if you use this in a big dataset though
Adding a type attribute in each edge collection that indicates what type of collection this is. Yes, it is basically a static field and is a bit of a waste of space but it works and space is cheap nowadays
Use anonymous graph traversals where you can define which edges to use explicitly
Having said that, Arango is a multi-model DB, and as such you could just ignore the traversal syntax, and just join the tables that you need, which would work just fine as well. It is the great thing about multi-model DBs, you use them in any way you need them.
In terms of your last update, you could check the edge collection by doing something like:
FILTER IS_SAME_COLLECTION('internationalHighway', e._id) == false
I think the way to design the data model depends on your business, If your model is more or less stable, and without many edges, you can select the many edges way, the edges is a finite set.
But I don't know how to filter by edge names :-)
otherwise, I think less edge and more attribute will be good.
I am trying to learn about Blazegraph. At the moment I am puzzled how I can optimise simple lookups.
Suppose all my vertices have a property id, which is unique. This property is set by the user. Is there any way to speed up finding a vertex of a particular id while still sticking to the Tinkerpop APIs?
Is the search API defined here the only way?
My previous experience is in TitanDB and in Titan's case it's possible to define an index which the Tinkerpop APIs integrate with flawlessly. Is there any way to achieve the same results in Blazegraph without using the Search API?
Whether a mid-traversal V() uses an index or not, depends on a)
whether suitable index exists and b) if the particular graph system
provider implemented this functionality.
Gremlin (Tinkerpop) does not specify how to set indexes although the documentation presents things like the following
graph.createIndex("username",Vertex.class)
But may be reserved for the ThinkerGraph implementation, as a matter of fact it says
Each graph system will have different mechanism by which indices and
schemas are defined. TinkerPop3 does not require any conformance in
this area. In TinkerGraph, the only definitions are around indices.
With other graph systems, property value types, indices, edge labels,
etc. may be required to be defined a priori to adding data to the
graph.
There is an example for Neo4J
TinkerPop3 does not provide method interfaces for defining
schemas/indices for the underlying graph system. Thus, in order to
create indices, it is important to call the Neo4j API directly.
But the code is very specific for that plugin
graph.cypher("CREATE INDEX ON :person(name)")
Note that for BlazeGraph the search uses a built in full-text index
In ArangoDB, there seem to be two set of functions for working with graphs. One one side you have EDGES, NEIGHBORS, TRAVERSAL, SHORTEST_PATH and more (https://docs.arangodb.com/Aql/GraphFunctions.html).
OTOH there are the graph operations (https://docs.arangodb.com/Aql/GraphOperations.html) that seems to have the same functions prefixed by GRAPH and with some different parameters, such as GRAPH_EDGES, GRAPH_NEIGHBORS, GRAPH_TRAVERSAL, GRAPH_SHORTEST_PATH.
What is the difference between these. Are they used in different scenarios? Are there performance differences, etc?
There is no general recommendation which to choose over the other - it depends on your requirements.
The EDGES functions may work on collections that are not managed by the graph module, and thus may not be visible in the graph viewer (but you may use them on collections that are also managed). It however has lesser overhead by not doing graph management.
The GRAPH_EDGES family is the more recent implementation. It only works on managed graphs that you can also browse in the graph viewer. As you already noted, the later have many more options to i.e. filter the graphs by examples etc.
With ArangoDB 3 the GRAPH_* family of functions was removed. We explain in this cookbook how their functionality can be achieved with AQL in ArangoDB 3.
I have more "Location documents" in my couchdb with longitude and latitude fields. How to find all location documents in database which distance to provided latitude and longitude is less than provided distance.
There is a way how to achieve it using vanilla CouchDB, but it‘s bit tricky.
You can use the fact you can apply two map functions during one request. Second map function can be created using list mechanics.
Lists are not very efficient from computational side, they can‘t cache results as views. But they have one unique feature – you can pass several arguments into list. Moreover, one of your arguments can be, for example, JS code, that is eval-ed inside list function (risky!).
So entire scheme looks like this:
Make view, that performs coarse search
Make list, that receives custom params and refines data set
Make client-side API to ease up querying this chain.
Can‘t provide exact code for your particular case, many details are not clear, but it seems that coarse search must group results to somehow linearly enumerated squares, and list perform more precise calculations.
Please note, that scheme might be inefficient for large datasets since it‘s computationally hungry.
Vanilla CouchDB isn't really built for geospacial queries.
Your best bet is to either use GeoCouch, CouchDB-Lucene or something similar.
Failing that, you could emit a Geohash from your map function, and do range queries over those.
Caveats apply. Queries around Geohash "fault lines" (equator, poles, longitude 180, etc) can give too many or too little results.
There are multiple JavaScript libraries that can help convert to/from Geohash, as well as help with some of those caveats.
CouchDB is not built for dynamic queries, so there is no good/fast way of implementing it in vanilla couchDB.
If you know beforehand which locations you want to calculate the distance from you could create a view for each location and call it with parameters ?startkey=0&endkey=max_distance
function(doc) {
function distance(...){ /* your function for calculating distance */ }
var NY = {lat:40,lon:73}
emit( distance(NY,doc), doc._id);
}
If you do not know the locations beforehand you could solve it by using a temporary view, but I would strongly advise against it since it's slow and should only be used for testing.