mongodb, Impact of collection data structure on performance - node.js

on the define the collection data structure, how to judge which structure is a good design or decision? This will affect the subsequent access to the database performance.
for example:
when the one data like this:
{
_id:'a'
index:1, //index 1~n
name:'john'
}
When n is large, meaning that data will be large and frequent deposited.
the collection data structure will be to one dimensional object:
{
_id:'a'
index:1,
name:'john'
}
.
.
.
{
_id:'a'
index:99,
name:'jule'
}
Or a composite two-dimensional object:
{
_id:'a'
info:[
{index:1,name:'john'},...,{index:99,name:'jule'}
]
}
composite two-dimensional object can effectively reduce the number of data, however, the search method is not convenient for writing, and whether it will actually reduce the effectiveness of searching or depositing a database.
Or the number of data is the key to affecting the effectiveness of the database.

"Better" means different things to different use cases. What works in your case might not necessarily work in other use cases.
Generally, it is better to avoid large arrays, due to:
MongoDB's document size limitation (16MB).
Indexing a large array is typically not very performant.
However, this is just a general observation and not a specific rule of thumb. If your data lends itself to an array-based representation and you're certain you'll never hit the 16MB document size, then that design may be the way to go (again, specific to your use case).
You may find these links useful to get started in schema design:
6 Rules of Thumb for MongoDB Schema Design: Part 1
Part 2
Part 3
Data Models
Use Cases
Query Optimization
Explain Results

Related

Single Node.js/Mongoose stream from multiple unrelated MongoDB collections

I'm collecting a large amount of data coming from a market data websocket stream. I'm collecting 2 different types of events from this single stream that are to be stored with event date/time and have no parent/child database relation. They're being stored their own respective MongoDB collections due to the difference in data structures.
Once a certain amount of data has been stored (100k+ events), I will be running analysis on the events, but I'd like to do so in a fashion where I'm simulating the original single stream of events by time (not processing both collection streams individually).
What I'd like to be able to do is make a single query from Mongoose, if possible, that joins both collections, sorts by date, and outputs as a stream for memory-saving purposes. So, performance is important in this case due to the number of events.
All answers I've seen when searching for a solution are regarding a parent/child aggregation of some sort, but since this isn't a user/userData-related segment of an application I'm having trouble finding an answer.
Also, storing the data in 2 separate collections seems necessary since their fields are all different except for time. But... would there be more pros than cons to keep these events in a single collection, if it eliminates the need for this type of solution?
The data structure reasoning is slightly inverted. Mongodb is schemaless and it's natural to have documents with different structure in the same collection.
It makes it easy to collect and analyse data but cause problems on application level since devs cannot rely on data structure and have to validate it on each data retrieval.
Mongoose aims to solve this problem by introducing data structures on the application level and taking all the routine validation tasks. Sometimes a single collection stores multiple models with some discrimination fields to resolve which one to unmarshall documents to.
Having a single stream from multiple collections is the simplest part of the question, $unionWith does exactly that:
db.collection1.aggregate( [
{ $unionWith: "collection2" },
{ $sort: { time: 1 } }
] )
Unmarshalling of the documents to mongoose models will be a little bit more complex - you will need to do it manually since the documents will have different structure.
Sorting might be a problem tho. https://docs.mongodb.com/manual/core/aggregation-pipeline/#pipeline-operators-and-indexes says the query can benefit from the indexed "time" column as long as there is no $project, $unwind, and $group stages but I would double check it can be used after $unionWith stage.
It would be much simpler to store the whole websocket stream in a single collection and use it straight from there.

Arangodb document arrays vs key/value collection

Are there limits to how many array values can be in a document other than document size? Arangodb can index into the arrays since version 2.8 so that's not a reason to go to a key/value collection format.
E.g.
group document with member array:
{'_key': group1, members: [1, 2, 3, ...]}
Is there a limit to how large the array members can be? Is it better to break this out in a key/value {group: group1, member: 1} collection for performance reasons?
There is no artificial limit in place for the number of array values or object keys in ArangoDB.
However, there are a few practical limits that you may want to consider:
the more array/object members you use in a document, the bigger the document will grow byte-wise. The performance of reading and writing individual documents obviously depends on the document size, so the bigger the documents are, the slower this will get and the more memory each individual document will consume during querying. This will especially hurt with the RocksDB storage engine, as due to the level design of RocksDB each document revision may need to be shoved through the various levels of the LSM tree and thus needs to be copied/written several times.
searching for specifying object keys inside documents normally uses a binary search, so its performance degrades logarithmically with the number of object keys. However, the performance of the full iteration of all object keys or all array values will grow linearly with the number of members.
when using huge documents from ArangoDB's JavaScript functionality, e.g. when using ArangoDB's Foxx microservice framework, the documents need to be converted to plain JavaScript objects & arrays. The V8 JavaScript implementation that is used by ArangoDB should behave well for small and medium-sized objects/arrays, but it has its problems with huge values. Apart from that it may also limit the number of object keys/array members internally.
peeking into the middle of an array from an AQL query will normally not use any index. The same is true when querying for arbitrary object keys. For object keys there is the possibility to create an index on dedicated keys, but obviously the keys need to be known in advance.
All that said, you may still want to make sure that objects/arrays do no get excessively big, because otherwise performance and memory usage may degrade.

Cosmos DB: How to reference a document in a separate collection using DocumentDB API

I am new to Azure Cosmos DB using the DocumentDB API. I plan to model my data so that one document references another document. This is pretty straight forward, as described in Modeling document data. However, I also would like to separate the related documents into different collections (this decision is related to how the data are partitioned).
Edit 7/24/2017: In response to a comment wondering why I chose to use separate collections: The reasoning for a separate collections mainly comes down to partition keys and read/write priorities. Since a certain partition key is required to be present in ALL documents in the collection, it then makes sense to separate documents that the chosen partition key doesn't belong. After much weighing of options, the partition key that I settled on was one that would optimize write speeds and evenly distribute my data across shards - but unfortunately it did not logically belong in my "Metadata" documents. Since there is a one to gazillion relationship between metadata and measurements, I chose to use a reference to the metadata in the measurements instead of embedding. And because metadata would rarely (or never) be appended to each measurement, I considered the expense of an additional round-trip to the DB a very low concern.
Since the reference is a "weak link" that is not verified by the database, is it possible and wise to store additional information, such as the collection name? That is, instead of having just a string id, we may use a kind of path?
Metadata document in collection "Metadata":
{
"id": "metadata1",
...
}
Measurement document in collection "Measurements":
{
"id": "measurement1",
"metadata-id" : "../Metadata/metadata1",
...
}
Then, when I parse the data in my application/script I know what collection and document to query.
Finally, I assume there are other/better ways to go about this and I welcome your suggests (e.g. underscores, not slashes; use a symbol to represent a collection, like $Metadata; etc). Or, is my use of relations spanning collections a code smell?
Thank you!
Edit: To the downvoter, can you please explain your reasoning? Is my question uninformed, unclear, or not useful? Why?
You're thinking about this the wrong way and incurring significantly more cost for an "optimization" that isn't necessary since you're billed at the per collection level. What you should be doing is picking a more generic partition key. Something like key or partitionKey. The tradeoff here is that you'll need to ensure in your client application that you populate this property on all of your documents (it may lead to a duplicated value, but ultimately that's okay). They you can continue to use the value of whatever you chose originally for your Measurements document and set something different for your Metadata documents.
I've written about this extensively in some other answers here and I believe it's one of the biggest misunderstandings about using Cosmos effectively and at scale. It doesn't help that in many Cosmos examples they talk about picking a partitionKey like deviceId or postal code which implies that you're dealing with homogeneous documents.
Please refer to this question that I answered regarding homogeneous vs heterogeneous in documentdb. The biggest argument for this pattern is the new addition of Graph APIs in Cosmos which necessitate having many different types of entities in a single collection and supports exactly the use case you're describing minus the extra collections. Obviously when dealing with heterogeneous types there isn't going to be a single property present on all documents that is appropriate for a partition key which is why you need to go generic.
What you're trying to do is feasible. The convention you use is not particularly important, as long as you can figure out the reference. Keep in mind though, that using this type of "relations" will be rather slow, because you need to fetch all documents from one collection and then fetch the related documents in a separate query. It can have a serious impact on your application.
Another possibility is to optimise your data for reading: you can embed the metadata document inside the other document. Your data will be duplicated, so if you update those documents, you will have to update them in both collections, but you'll probably write less often than you read (probably, if that's not the case, this setup would be worse).
Your documents would look like this:
Metadata document in collection "Metadata":
{
"id": "metadata1",
...
}
Measurement document in collection "Measurements":
{
"id": "measurement1",
"metadata" : {
"id": "metadata1",
...
},
...
}

mongodb performance when updating/inserting subdocuments

I have a mongo database used to represent spreadsheets with three collections representing respectively cell values (row, col, value), cell formatting (row, col, object representing the format) and cell sizes (whether it's a row or column size, its index and the size).
Every document in all the collections also has a field to identify the table it refers to (containing the table's name) and I'm using upserts (mongoose's findOneAndReplace method with upsert:true) for all insertions/updates.
I was thinking of "pulling the schema inside out", by keeping a single collection representing the table and having the documents previously contained in the three collections as subdocuments inside it, as I thought it would make it more organized.
However, reading up on the subject of subdocuments, it looks like in any case two queries would be needed for every insertion/update (eg, see this question). Therefore, I was wondering if the changes I had in mind would lead to a hit on performance (I guess upserts still need to do a search and then either update or insert, so that would still be two queries behind the scenes, but there might be some optimization I'm not aware of) and in trying to simplify the schema I would not only complicate the insertion/update procedures but also get lower performances. Thanks!
Yes, there is a performance hit. MongoDB has collection-level update locks. By keeping everything in a single collection you are ultimately limiting the number of concurrent update operations your application can perform, hence leading to decreased performance. The caveat to this, is that it totally dependant on how your application is doing the writes.
On the flip side is that you could potentially save on read operations as you'd need to query a single collection rather than 3. However, scaling reads is easy compared to writes, and writes are typically the bottleneck, so its kind of hard to say if that's worth it.

Using intensive update in Map type column in Cassandra is anti-pattern?

Friends,
I am modeling a table in Cassandra which contains a Map column. So this Map should contains dynamic values and will be update so much for that row (I will update by a Primary Key)
Is it an anti-patterns, which other options should I consider ?
What you're trying to do is possibly what I described here.
First big limitations that comes into my mind are the one given by the specification:
64KB is the max size of an item in a collection
65536 is the max number of queryable elements inside a collection
More there are the problems described in other post
you can not retrieve part of a collection: even if internally each entry of a map is stored as a column you can only retrieve the whole collection (this can lead to very slow performances)
you have to choose whether creating an index on keys or on values, both simultaneously are not supported.
Since maps are typed you can't put mixed values inside: you have to represent everything as a string or bytes and then transform your data client side
I personally consider this approach as an anti pattern for all these reasons -- this approach provide a schema less solution but reduce performances and introduce lots of limitations like the one secondary indexes and typing.
HTH, Carlo

Resources