Arangodb document arrays vs key/value collection

Arangodb document arrays vs key/value collection - arangodb

Are there limits to how many array values can be in a document other than document size? Arangodb can index into the arrays since version 2.8 so that's not a reason to go to a key/value collection format.
E.g.
group document with member array:
{'_key': group1, members: [1, 2, 3, ...]}
Is there a limit to how large the array members can be? Is it better to break this out in a key/value {group: group1, member: 1} collection for performance reasons?

There is no artificial limit in place for the number of array values or object keys in ArangoDB.
However, there are a few practical limits that you may want to consider:
the more array/object members you use in a document, the bigger the document will grow byte-wise. The performance of reading and writing individual documents obviously depends on the document size, so the bigger the documents are, the slower this will get and the more memory each individual document will consume during querying. This will especially hurt with the RocksDB storage engine, as due to the level design of RocksDB each document revision may need to be shoved through the various levels of the LSM tree and thus needs to be copied/written several times.
searching for specifying object keys inside documents normally uses a binary search, so its performance degrades logarithmically with the number of object keys. However, the performance of the full iteration of all object keys or all array values will grow linearly with the number of members.
when using huge documents from ArangoDB's JavaScript functionality, e.g. when using ArangoDB's Foxx microservice framework, the documents need to be converted to plain JavaScript objects & arrays. The V8 JavaScript implementation that is used by ArangoDB should behave well for small and medium-sized objects/arrays, but it has its problems with huge values. Apart from that it may also limit the number of object keys/array members internally.
peeking into the middle of an array from an AQL query will normally not use any index. The same is true when querying for arbitrary object keys. For object keys there is the possibility to create an index on dedicated keys, but obviously the keys need to be known in advance.
All that said, you may still want to make sure that objects/arrays do no get excessively big, because otherwise performance and memory usage may degrade.

Related

mongodb, Impact of collection data structure on performance

on the define the collection data structure, how to judge which structure is a good design or decision? This will affect the subsequent access to the database performance.
for example:
when the one data like this:
{
_id:'a'
index:1, //index 1~n
name:'john'
}
When n is large, meaning that data will be large and frequent deposited.
the collection data structure will be to one dimensional object:
{
_id:'a'
index:1,
name:'john'
}
.
.
.
{
_id:'a'
index:99,
name:'jule'
}
Or a composite two-dimensional object:
{
_id:'a'
info:[
{index:1,name:'john'},...,{index:99,name:'jule'}
]
}
composite two-dimensional object can effectively reduce the number of data, however, the search method is not convenient for writing, and whether it will actually reduce the effectiveness of searching or depositing a database.
Or the number of data is the key to affecting the effectiveness of the database.

"Better" means different things to different use cases. What works in your case might not necessarily work in other use cases.
Generally, it is better to avoid large arrays, due to:
MongoDB's document size limitation (16MB).
Indexing a large array is typically not very performant.
However, this is just a general observation and not a specific rule of thumb. If your data lends itself to an array-based representation and you're certain you'll never hit the 16MB document size, then that design may be the way to go (again, specific to your use case).
You may find these links useful to get started in schema design:
6 Rules of Thumb for MongoDB Schema Design: Part 1
Part 2
Part 3
Data Models
Use Cases
Query Optimization
Explain Results

Multiple indexes or multiple mapping types for sparse documents?

I have ~10 different document types which share 10-15 common fields. But each document type has additional fields, 3 of them up to 30-40 additional fields.
I was considering to use a different mapping type for each document type. But if I correctly understand how mappings work, ElasticSearch will internally use one mapping with 150-200 fields. Because no document has a value for each field, I will end up with a lot of sparse data.
According to this article (Index vs. Type) ElasticSearch is (was?) not very good in dealing with sparse data, so that would be an argument for having a separate index for each document type. But some document types only have very little documents, so it would be overkill to have a separate index for them.
My question: How bad are sparse documents? Or am I better off with a separate index for each type even though some indexes will only contain a few documents?

ElasticSearch will internally use one mapping with 150-200 fields.
Because no document has a value for each field, I will end up with a
lot of sparse data.
Yes, different types within an index share the same mapping structure. Each type just have a “_type” field to every document that is automatically used for filtering when searching on a specific type.
How bad are sparse documents?
Citing from Index Vs Type
Fields that exist in one type will also consume resources for documents of types where this field does not exist. This is a general issue with Lucene indices: they don’t like sparsity.
am I better off with a separate index for each type even though some
indexes will only contain a few documents?
As you may be aware that each separate index has its own overhead and types don't gel well with sparse documents.
I would suggest
Document Types with small number of documents (with large number of sparse fields) should go to a separate index, obviously by reducing the number of shards to the least possible number i.e. 1. Each index has 5 shards by default. If your number of docs are not that large, it doesn't make sense to use 5 shards and it will reduce the load on search query.
Document Types having significant fields in common should go to the same index with different types. Depending on the total number of docs, you may like to increase the number of shards setting.
If some document types have a huge number of documents, you may like to create separate indices for them.
Keep in mind that you should keep a reasonable number of shards in your cluster, which can be achieved by reducing the number of shards for indices that don’t require a high write throughput and/or will store low numbers of documents.

There are various implications between choosing Index or a Type. It depends on the computing power of your nodes, how many documents each type will store and so on.
If you say each index will contain only few documents, then I would recommend to go with types, because each index will end up creating separate shards - which would be an overkill for the small set of documents.
You could refer to this SO Answer as well.

Document update resulting in collection greater than 10GB

How does DocumentDb handle the case, when a document update results in exceeding the collection size (10 GB). Say I have 50K documents in one of my collection and then I update all of the documents to include an additional JSON section that could exceed the collection size.
What are the best practices to handle this case and is there built in support to handle this scenario (e.g. Move that document to another collection).

There's no specific best practice, but you have specific things built into DocumentDB to help you make proper decisions:
x-ms-resource-usage is a header returned on your queries. Among other things, collectionSize will report total consumption within your collection, including overhead from indexes, etc. You can compare that to collectionSize in the x-ms-resource-quota header returned (which should equate to 10GB), to know how much overhead you have remaining. There's a bit more detail in this answer.
The various language-level drivers provide partitioning support. When you realize you need to span multiple partitions, you can implement a partition resolver, to allow content to be written across multiple partitions. There are several answers covering partitioning thoughts, such as this one posted by Larry Maccherone. And the DocumentDB team published an article on partitioning, here.
You're probably aware already, but: you can check for HTTP 403, which is returned when trying to insert documents and exceeding collection size. All error codes are documented here.
Regarding your question about moving documents to different collections: That's ultimately going to be your call whether to do this within your code or by taking advantage of partition resolvers.

mongodb performance when updating/inserting subdocuments

I have a mongo database used to represent spreadsheets with three collections representing respectively cell values (row, col, value), cell formatting (row, col, object representing the format) and cell sizes (whether it's a row or column size, its index and the size).
Every document in all the collections also has a field to identify the table it refers to (containing the table's name) and I'm using upserts (mongoose's findOneAndReplace method with upsert:true) for all insertions/updates.
I was thinking of "pulling the schema inside out", by keeping a single collection representing the table and having the documents previously contained in the three collections as subdocuments inside it, as I thought it would make it more organized.
However, reading up on the subject of subdocuments, it looks like in any case two queries would be needed for every insertion/update (eg, see this question). Therefore, I was wondering if the changes I had in mind would lead to a hit on performance (I guess upserts still need to do a search and then either update or insert, so that would still be two queries behind the scenes, but there might be some optimization I'm not aware of) and in trying to simplify the schema I would not only complicate the insertion/update procedures but also get lower performances. Thanks!

Yes, there is a performance hit. MongoDB has collection-level update locks. By keeping everything in a single collection you are ultimately limiting the number of concurrent update operations your application can perform, hence leading to decreased performance. The caveat to this, is that it totally dependant on how your application is doing the writes.
On the flip side is that you could potentially save on read operations as you'd need to query a single collection rather than 3. However, scaling reads is easy compared to writes, and writes are typically the bottleneck, so its kind of hard to say if that's worth it.

Using intensive update in Map type column in Cassandra is anti-pattern?

Friends,
I am modeling a table in Cassandra which contains a Map column. So this Map should contains dynamic values and will be update so much for that row (I will update by a Primary Key)
Is it an anti-patterns, which other options should I consider ?

What you're trying to do is possibly what I described here.
First big limitations that comes into my mind are the one given by the specification:
64KB is the max size of an item in a collection
65536 is the max number of queryable elements inside a collection
More there are the problems described in other post
you can not retrieve part of a collection: even if internally each entry of a map is stored as a column you can only retrieve the whole collection (this can lead to very slow performances)
you have to choose whether creating an index on keys or on values, both simultaneously are not supported.
Since maps are typed you can't put mixed values inside: you have to represent everything as a string or bytes and then transform your data client side
I personally consider this approach as an anti pattern for all these reasons -- this approach provide a schema less solution but reduce performances and introduce lots of limitations like the one secondary indexes and typing.
HTH, Carlo

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string