I'm collecting a large amount of data coming from a market data websocket stream. I'm collecting 2 different types of events from this single stream that are to be stored with event date/time and have no parent/child database relation. They're being stored their own respective MongoDB collections due to the difference in data structures.
Once a certain amount of data has been stored (100k+ events), I will be running analysis on the events, but I'd like to do so in a fashion where I'm simulating the original single stream of events by time (not processing both collection streams individually).
What I'd like to be able to do is make a single query from Mongoose, if possible, that joins both collections, sorts by date, and outputs as a stream for memory-saving purposes. So, performance is important in this case due to the number of events.
All answers I've seen when searching for a solution are regarding a parent/child aggregation of some sort, but since this isn't a user/userData-related segment of an application I'm having trouble finding an answer.
Also, storing the data in 2 separate collections seems necessary since their fields are all different except for time. But... would there be more pros than cons to keep these events in a single collection, if it eliminates the need for this type of solution?
The data structure reasoning is slightly inverted. Mongodb is schemaless and it's natural to have documents with different structure in the same collection.
It makes it easy to collect and analyse data but cause problems on application level since devs cannot rely on data structure and have to validate it on each data retrieval.
Mongoose aims to solve this problem by introducing data structures on the application level and taking all the routine validation tasks. Sometimes a single collection stores multiple models with some discrimination fields to resolve which one to unmarshall documents to.
Having a single stream from multiple collections is the simplest part of the question, $unionWith does exactly that:
db.collection1.aggregate( [
{ $unionWith: "collection2" },
{ $sort: { time: 1 } }
] )
Unmarshalling of the documents to mongoose models will be a little bit more complex - you will need to do it manually since the documents will have different structure.
Sorting might be a problem tho. https://docs.mongodb.com/manual/core/aggregation-pipeline/#pipeline-operators-and-indexes says the query can benefit from the indexed "time" column as long as there is no $project, $unwind, and $group stages but I would double check it can be used after $unionWith stage.
It would be much simpler to store the whole websocket stream in a single collection and use it straight from there.
Related
How can I use indexes in aggregate?
I saw the document https://docs.mongodb.com/manual/core/aggregation-pipeline/#pipeline-operators-and-indexes
The $match and $sort pipeline operators can take advantage of an index when they occur at the beginning of the pipeline.
Is there any way of using index not the beginning situation?
like $sort,
$match or $group
Please help me
An index works by keeping a record of certain pieces of data that point to a given record in your collection. Think of it like having a novel, and then having a sheet of paper that lists the names of various people or locations in that novel with the page numbers where they're mentioned.
Aggregation is like taking that novel and transforming the different pages into an entirely different stream of information. You don't know where the new information is located until the transformation actually happens, so you can't possibly have an index on that transformed information.
In other words, it's impossible to use an index in any aggregation pipeline stage that is not at the very beginning because that data will have been transformed and MongoDB has no way of knowing if it's even possible to efficiently make use of the newly transformed data.
If your aggregation pipeline is too large to handle efficiently, then you need to limit the size of your pipeline in some way such that you can handle it more efficiently. Ideally this would mean having a $match stage that sufficiently limits the documents to a reasonably-sized subset. This isn't always possible, however, so additional effort may be required.
One possibility is generating "summary" documents that are the result of aggregating all new data together, then performing your primary aggregation pipeline using only these summary documents. For example, if you have a log of transactions in your system that you wish to aggregate, then you could generate a daily summary of the quantities and types of the different transactions that have been logged for the day, along with any other additional data you would need. You would then limit your aggregation pipeline to only these daily summary documents and avoid using the normal transaction documents.
An actual solution is beyond the scope of this question, however. Just be aware that the index usage is a limitation that you cannot avoid.
on the define the collection data structure, how to judge which structure is a good design or decision? This will affect the subsequent access to the database performance.
for example:
when the one data like this:
{
_id:'a'
index:1, //index 1~n
name:'john'
}
When n is large, meaning that data will be large and frequent deposited.
the collection data structure will be to one dimensional object:
{
_id:'a'
index:1,
name:'john'
}
.
.
.
{
_id:'a'
index:99,
name:'jule'
}
Or a composite two-dimensional object:
{
_id:'a'
info:[
{index:1,name:'john'},...,{index:99,name:'jule'}
]
}
composite two-dimensional object can effectively reduce the number of data, however, the search method is not convenient for writing, and whether it will actually reduce the effectiveness of searching or depositing a database.
Or the number of data is the key to affecting the effectiveness of the database.
"Better" means different things to different use cases. What works in your case might not necessarily work in other use cases.
Generally, it is better to avoid large arrays, due to:
MongoDB's document size limitation (16MB).
Indexing a large array is typically not very performant.
However, this is just a general observation and not a specific rule of thumb. If your data lends itself to an array-based representation and you're certain you'll never hit the 16MB document size, then that design may be the way to go (again, specific to your use case).
You may find these links useful to get started in schema design:
6 Rules of Thumb for MongoDB Schema Design: Part 1
Part 2
Part 3
Data Models
Use Cases
Query Optimization
Explain Results
I am new to Azure Cosmos DB using the DocumentDB API. I plan to model my data so that one document references another document. This is pretty straight forward, as described in Modeling document data. However, I also would like to separate the related documents into different collections (this decision is related to how the data are partitioned).
Edit 7/24/2017: In response to a comment wondering why I chose to use separate collections: The reasoning for a separate collections mainly comes down to partition keys and read/write priorities. Since a certain partition key is required to be present in ALL documents in the collection, it then makes sense to separate documents that the chosen partition key doesn't belong. After much weighing of options, the partition key that I settled on was one that would optimize write speeds and evenly distribute my data across shards - but unfortunately it did not logically belong in my "Metadata" documents. Since there is a one to gazillion relationship between metadata and measurements, I chose to use a reference to the metadata in the measurements instead of embedding. And because metadata would rarely (or never) be appended to each measurement, I considered the expense of an additional round-trip to the DB a very low concern.
Since the reference is a "weak link" that is not verified by the database, is it possible and wise to store additional information, such as the collection name? That is, instead of having just a string id, we may use a kind of path?
Metadata document in collection "Metadata":
{
"id": "metadata1",
...
}
Measurement document in collection "Measurements":
{
"id": "measurement1",
"metadata-id" : "../Metadata/metadata1",
...
}
Then, when I parse the data in my application/script I know what collection and document to query.
Finally, I assume there are other/better ways to go about this and I welcome your suggests (e.g. underscores, not slashes; use a symbol to represent a collection, like $Metadata; etc). Or, is my use of relations spanning collections a code smell?
Thank you!
Edit: To the downvoter, can you please explain your reasoning? Is my question uninformed, unclear, or not useful? Why?
You're thinking about this the wrong way and incurring significantly more cost for an "optimization" that isn't necessary since you're billed at the per collection level. What you should be doing is picking a more generic partition key. Something like key or partitionKey. The tradeoff here is that you'll need to ensure in your client application that you populate this property on all of your documents (it may lead to a duplicated value, but ultimately that's okay). They you can continue to use the value of whatever you chose originally for your Measurements document and set something different for your Metadata documents.
I've written about this extensively in some other answers here and I believe it's one of the biggest misunderstandings about using Cosmos effectively and at scale. It doesn't help that in many Cosmos examples they talk about picking a partitionKey like deviceId or postal code which implies that you're dealing with homogeneous documents.
Please refer to this question that I answered regarding homogeneous vs heterogeneous in documentdb. The biggest argument for this pattern is the new addition of Graph APIs in Cosmos which necessitate having many different types of entities in a single collection and supports exactly the use case you're describing minus the extra collections. Obviously when dealing with heterogeneous types there isn't going to be a single property present on all documents that is appropriate for a partition key which is why you need to go generic.
What you're trying to do is feasible. The convention you use is not particularly important, as long as you can figure out the reference. Keep in mind though, that using this type of "relations" will be rather slow, because you need to fetch all documents from one collection and then fetch the related documents in a separate query. It can have a serious impact on your application.
Another possibility is to optimise your data for reading: you can embed the metadata document inside the other document. Your data will be duplicated, so if you update those documents, you will have to update them in both collections, but you'll probably write less often than you read (probably, if that's not the case, this setup would be worse).
Your documents would look like this:
Metadata document in collection "Metadata":
{
"id": "metadata1",
...
}
Measurement document in collection "Measurements":
{
"id": "measurement1",
"metadata" : {
"id": "metadata1",
...
},
...
}
I'm starting a new website project and i would like to use DocumentDB as database instead of traditional RDBMS.
I will need two kind of documents to store:
User documents, they will hold all the user data.
Survey documents, that will hold all data about survays.
May i put both kind in a single collection or should i create one collection for each?
How you do this is totally up to you - it's a fairly broad question, and there are good reasons for combining, and good reasons for separating. But objectively, you'll have some specific things to consider:
Each collection has its own cost footprint (starting around $24 per collection).
Each collection has its own performance (RU capacity) and storage limit.
Documents within a collection do not have to be homogeneous - each document can have whatever properties you want. You'll likely want some type of identification property that you can query on, to differentiate document types, should you store them all in a single collection.
Transactions are collection-scoped. So, for example, if you're building server-side stored procedures and need to modify content across your User and Survey documents, you need to keep this in mind.
I'm planning to implement this schema in MongoDB, I have been doing some readings about schema design, and the notion was whenever you structure your data like a relational database you must be doing something wrong.
My questions:
what should I do when collection size gets larger than 16MB limit?
app_log in server_log collections gets might in some cases grow larger than 16MB depending how busy the server is.
I'm aware of the cap feature that I could use, but the requirement is store all logs for 90 days.
Do you see any potential issues with my design?
Is it a good practice to have the application check collection size and create new collection by day / hour ..etc to accommodate log size growth?
Thanks
Your collection size is not restricted to 16MB, as one of the comments pointed out, you can check in the MongoDB manual that it is the largest document size. So there is no need to separate the same class of data between different collections, in fact it would be a major headache for you to do so :) One user collection, one for your servers and one for your server_logs. You can then create references from one collection to the next by using the id field.
Whether this is a good design or not will depend on your queries. In general, you want to avoid using joins in Mongo (they're still possible, but if you're doing a bunch of joins, you're using it wrong, and really should use a relational DB :-)
For example, if most of your queries are on the server_log collection and only use the fields in that collection, then you'll be fine. OTOH, if your server_log queries always need to pull in data from the server collection as well (say for example the name and userId fields), then it might be worth selectively denormalizing that data. That's a fancy way of saying, you may wish to copy the name and userId fields into your server_log documents, so that your queries can avoid having to join with the server collection. Of course, every time you denormalize, you add complexity to your application which must now ensure that the data is consistent across multiple collections (e.g., when you change the server name, you have to make sure you change it in the server_logs, too).
You may wish to make a list of the queries you expect to perform, and see if they can be done with a minimum of joins with your current schema. If not, see if a little denormalization will help. If you're getting to the point where either you need to do a bunch of joins or a lot of manual management of denormalized data in order to satisfy your queries, then you may need to rethink your schema or even your choice of DB.
what should I do when collection size gets larger than 16MB limit
In Mongodb there is no limit for collection size. Limit is exist for each document. Each document should not exceed the size of 16 MB.
Do you see any potential issues with my design?
No issue with above design