I'm modeling out my ArangoDB database and the list of edge collections I've created is growing and growing. I could just combine all of the edges into a single edge collection called relations with a type parameter.
It would certainly clean up my list of tables but would it have any effect on my traversal queries? Would it have any positive or negative effects?
You should add a vertex-centric index for the edge collection. This allows you to use a single edge collection without a big performance impact.
You can essentially add indexes on the "_from" or "_to" field and your type attribute.
If your traversal queries need both directions you need to add two indexes one on "_to"+"_type" and one on "_from"+"_type"
The example in the documentation just suggests a skiplist index, but you should probably use a hash-index because the type field contains a discrete value.
https://docs.arangodb.com/3.2/Manual/Indexing/IndexBasics.html#vertex-centric-indexes
Related
I am new to ArangoDB. After reading the official document, I know ArangoDB's graph feature uses edge collection to define the relations between vertices with _from and _to attribute,which will refer to the start and end vertex.
Index will be created on these two attributes automatically for fast access.
With this structure, the performance of gragh traversal will heavily depend on efficiency of the index defined on the from_ and to_ attribute, but it looks that the index alone is not enough to support efficient gragh traversal?
I have thought that given a vertex, only a small subset of the vertices will be involved to traverse (like traversing a linked list for given a node), but with index and edge collection structure, the whole edge collection will be involved to do the query(although the index would help to avoid traverse the whole table from the first document to the end document).
Also, when solving a practical problem with graph, there may be many vertices to visit, even with the help of index, it also would inevitablely cause bad performance.
So, I would ask with the edge collection structure, how ArangoDB is implemented for efficient graph traversal
Update 2:
The original question is too long, a simple way:
In The City Graph, how to query the city that can be reached directly from Berlin by germanHighway. I don't want the internationalHighway.
Original Question:
I now use ArangoDB to store a graph. I have one question for the data model design.
Use the knows_graph for example, social_graph
In my original opition, I think I will design two collections, the Document collection is person, and the Edge collection is marriedWith or friendWith.
But when I want to query the person who marriedWith someone, I can't filter the unwanted friendWith edges.(I'm not very familiar with the AQL, maybe this is not true).
In contrast to the examples in AQL Documents, it used to define a more common edge collection, for example, relation in social_graph, and define the more specific type in attribute. for example, "type":"married" as an attribute of a relation.
and thus in AQL, I can use FILTER p.edges[0].type== 'married' to filter the unwanted relation.
My question is:
Which method of data model design is better, or any suggestions for this?
Now I think, put married as a type of a person, may be more flexible, easy to extend to student, neighbour... with one relation Edge collection.
Otherwise, many Edge collections, isStudent, neighbourWith... shoud be created.
Can AQL could filter nodes by edge type but not attributes? Maybe looks like:
FILTER 'isStudent' edge
Update:
I just tried, one relation can only used for two node type.
For example, one isFriend edge is used for person and dog nodes, then you can't use isFriend edge for dog and cat!
so many edges is must needed.
For the original question:
If you have a finite, well defined, number of edges, then using multiple edge collections is fine specially if you expect to have a large number of edge of each type. If in the other hand, you foresee having to a large number of relationship types (friend , best friend, wife, etc) and the number of relationships of each type is not huge, then a single edge collection with a type indicator is fine and may simplify things.
The only two ways I can think of filtering edges from a traversal are:
IS_SAME_COLLECTION function. This will tell you if a document is of particular type. Keep an eye on performance if you use this in a big dataset though
Adding a type attribute in each edge collection that indicates what type of collection this is. Yes, it is basically a static field and is a bit of a waste of space but it works and space is cheap nowadays
Use anonymous graph traversals where you can define which edges to use explicitly
Having said that, Arango is a multi-model DB, and as such you could just ignore the traversal syntax, and just join the tables that you need, which would work just fine as well. It is the great thing about multi-model DBs, you use them in any way you need them.
In terms of your last update, you could check the edge collection by doing something like:
FILTER IS_SAME_COLLECTION('internationalHighway', e._id) == false
I think the way to design the data model depends on your business, If your model is more or less stable, and without many edges, you can select the many edges way, the edges is a finite set.
But I don't know how to filter by edge names :-)
otherwise, I think less edge and more attribute will be good.
I have ~10 different document types which share 10-15 common fields. But each document type has additional fields, 3 of them up to 30-40 additional fields.
I was considering to use a different mapping type for each document type. But if I correctly understand how mappings work, ElasticSearch will internally use one mapping with 150-200 fields. Because no document has a value for each field, I will end up with a lot of sparse data.
According to this article (Index vs. Type) ElasticSearch is (was?) not very good in dealing with sparse data, so that would be an argument for having a separate index for each document type. But some document types only have very little documents, so it would be overkill to have a separate index for them.
My question: How bad are sparse documents? Or am I better off with a separate index for each type even though some indexes will only contain a few documents?
ElasticSearch will internally use one mapping with 150-200 fields.
Because no document has a value for each field, I will end up with a
lot of sparse data.
Yes, different types within an index share the same mapping structure. Each type just have a “_type” field to every document that is automatically used for filtering when searching on a specific type.
How bad are sparse documents?
Citing from Index Vs Type
Fields that exist in one type will also consume resources for documents of types where this field does not exist. This is a general issue with Lucene indices: they don’t like sparsity.
am I better off with a separate index for each type even though some
indexes will only contain a few documents?
As you may be aware that each separate index has its own overhead and types don't gel well with sparse documents.
I would suggest
Document Types with small number of documents (with large number of sparse fields) should go to a separate index, obviously by reducing the number of shards to the least possible number i.e. 1. Each index has 5 shards by default. If your number of docs are not that large, it doesn't make sense to use 5 shards and it will reduce the load on search query.
Document Types having significant fields in common should go to the same index with different types. Depending on the total number of docs, you may like to increase the number of shards setting.
If some document types have a huge number of documents, you may like to create separate indices for them.
Keep in mind that you should keep a reasonable number of shards in your cluster, which can be achieved by reducing the number of shards for indices that don’t require a high write throughput and/or will store low numbers of documents.
There are various implications between choosing Index or a Type. It depends on the computing power of your nodes, how many documents each type will store and so on.
If you say each index will contain only few documents, then I would recommend to go with types, because each index will end up creating separate shards - which would be an overkill for the small set of documents.
You could refer to this SO Answer as well.
Friends,
I am modeling a table in Cassandra which contains a Map column. So this Map should contains dynamic values and will be update so much for that row (I will update by a Primary Key)
Is it an anti-patterns, which other options should I consider ?
What you're trying to do is possibly what I described here.
First big limitations that comes into my mind are the one given by the specification:
64KB is the max size of an item in a collection
65536 is the max number of queryable elements inside a collection
More there are the problems described in other post
you can not retrieve part of a collection: even if internally each entry of a map is stored as a column you can only retrieve the whole collection (this can lead to very slow performances)
you have to choose whether creating an index on keys or on values, both simultaneously are not supported.
Since maps are typed you can't put mixed values inside: you have to represent everything as a string or bytes and then transform your data client side
I personally consider this approach as an anti pattern for all these reasons -- this approach provide a schema less solution but reduce performances and introduce lots of limitations like the one secondary indexes and typing.
HTH, Carlo
I would like to know what are the best practices for storing heterogeneous data in CouchDB. In MongoDB you have collections which help with the modelling of the data (IE: Typical usage is one document type per collection). What is the best way to handle this kind of requirement in CouchDB? Tagging of documents with a _type field? Or is there some other method that I am not aware of?
Main benefit of Mongo's collection is that indexes are defined and calculated per collection. In case of Couch you have even more freedom and flexibility to do that. Each index is defined by the view in map/reduce way. You limit the data to calculate the index by filtering it in map function. Because of this flexibility, it is up to you how to distinguish which document belongs to which view.
If you really like the fixed Mongo-like style of division documents into set of distinct partitions with separate indexes just create the field collection and never mix two different collections in single view. In my opinion, rejecting one of the only benefit of Couch over Mongo (where Mougo is in general more powerful and flexible system) does not seem to be good idea.