mongodb performance when updating/inserting subdocuments - node.js

I have a mongo database used to represent spreadsheets with three collections representing respectively cell values (row, col, value), cell formatting (row, col, object representing the format) and cell sizes (whether it's a row or column size, its index and the size).
Every document in all the collections also has a field to identify the table it refers to (containing the table's name) and I'm using upserts (mongoose's findOneAndReplace method with upsert:true) for all insertions/updates.
I was thinking of "pulling the schema inside out", by keeping a single collection representing the table and having the documents previously contained in the three collections as subdocuments inside it, as I thought it would make it more organized.
However, reading up on the subject of subdocuments, it looks like in any case two queries would be needed for every insertion/update (eg, see this question). Therefore, I was wondering if the changes I had in mind would lead to a hit on performance (I guess upserts still need to do a search and then either update or insert, so that would still be two queries behind the scenes, but there might be some optimization I'm not aware of) and in trying to simplify the schema I would not only complicate the insertion/update procedures but also get lower performances. Thanks!

Yes, there is a performance hit. MongoDB has collection-level update locks. By keeping everything in a single collection you are ultimately limiting the number of concurrent update operations your application can perform, hence leading to decreased performance. The caveat to this, is that it totally dependant on how your application is doing the writes.
On the flip side is that you could potentially save on read operations as you'd need to query a single collection rather than 3. However, scaling reads is easy compared to writes, and writes are typically the bottleneck, so its kind of hard to say if that's worth it.

Related

MongoDB schema design

I'm planning to implement this schema in MongoDB, I have been doing some readings about schema design, and the notion was whenever you structure your data like a relational database you must be doing something wrong.
My questions:
what should I do when collection size gets larger than 16MB limit?
app_log in server_log collections gets might in some cases grow larger than 16MB depending how busy the server is.
I'm aware of the cap feature that I could use, but the requirement is store all logs for 90 days.
Do you see any potential issues with my design?
Is it a good practice to have the application check collection size and create new collection by day / hour ..etc to accommodate log size growth?
Thanks
Your collection size is not restricted to 16MB, as one of the comments pointed out, you can check in the MongoDB manual that it is the largest document size. So there is no need to separate the same class of data between different collections, in fact it would be a major headache for you to do so :) One user collection, one for your servers and one for your server_logs. You can then create references from one collection to the next by using the id field.
Whether this is a good design or not will depend on your queries. In general, you want to avoid using joins in Mongo (they're still possible, but if you're doing a bunch of joins, you're using it wrong, and really should use a relational DB :-)
For example, if most of your queries are on the server_log collection and only use the fields in that collection, then you'll be fine. OTOH, if your server_log queries always need to pull in data from the server collection as well (say for example the name and userId fields), then it might be worth selectively denormalizing that data. That's a fancy way of saying, you may wish to copy the name and userId fields into your server_log documents, so that your queries can avoid having to join with the server collection. Of course, every time you denormalize, you add complexity to your application which must now ensure that the data is consistent across multiple collections (e.g., when you change the server name, you have to make sure you change it in the server_logs, too).
You may wish to make a list of the queries you expect to perform, and see if they can be done with a minimum of joins with your current schema. If not, see if a little denormalization will help. If you're getting to the point where either you need to do a bunch of joins or a lot of manual management of denormalized data in order to satisfy your queries, then you may need to rethink your schema or even your choice of DB.
what should I do when collection size gets larger than 16MB limit
In Mongodb there is no limit for collection size. Limit is exist for each document. Each document should not exceed the size of 16 MB.
Do you see any potential issues with my design?
No issue with above design

Mongo DB relations between documents in different collections

I'm not yet ready to let this go, which is why I re-thought the problem and edited the Q (original below).
I am using mongoDB for a weekend project and it requires some relations in the DB, which is what the misery is all about:
I have three collections:
Users
Lists
Texts
A user can have texts and lists - lists 'contain' texts. Texts can be in multiple lists.
I decided to go with separate collections (not embeds) because child documents don't always appear in context of their parent (eg. all texts, without being in a list).
So what needs to be done is reference the texts that belong into certain lists with exactly those lists. There can be unlimited lists and texts, though lists will be less in comparison.
In contrast to what I first thought of, I could also put the reference in every single text-document and not all text-ids in the list-documents. It would actually make a difference, because I could get away with one query to find every snippet in a list. Could even index that reference.
var TextSchema = new Schema({
_id: Number,
name: String,
inListID: { type : Array , "default" : [] },
[...]
It is also rather seldom the case that texts will be in MANY lists, so the array would not really explode. The question kind of remains though, is there a chance this scales or actually a better way of implementing it with mongoDB? Would it help to limit the amount of lists a text can be in (probably)? Is there a recipe for few:many relations?
It would even be awesome to get references to projects where this has been done and how it was implemented (few:many relations). I can't believe everybody shies away from mongo DB as soon as some relations are needed.
Original Question
I'll break it down in two problems I see so far:
1) Let's assume a list consists of 5 texts. How do I reference the texts contained in a list? Just open an array and store the text's _ids in there? Seems like those arrays might grow to the moon and back, slowing the app down? On the other hand texts need to be available without a list, so embedding is not really an option. What if I want to get all texts of a list that contains 100 texts.. sounds like two queries and an array with 100 fields :-/. So is this way of referencing the proper way to do it?
var ListSchema = new Schema({
_id: Number,
name: String,
textids: { type : Array , "default" : [] },
[...]
Problem 2) I see with this approach is cleaning the references if a text is deleted. Its reference will still be in every list that contained the text and I wouldn't want to iterate through all the lists to clean out those dead references. Or would I? Is there a smart way to solve this? Just making the texts hold the reference (in which list they are) just moves the problem around, so that's not an option.
I guess I'm not the first with this sort of problem but I was also unable to find a definitive answer on how to do it 'right'.
I'm also interested in general thoughts on best-practice for this sort of referencing (many-to-many?) and especially scalability/performance.
Relations are usually not a big problem, though certain operations involving relations might be. That depends largely on the problem you're trying to solve, and very strongly on the cardinality of the result set and the selectivity of the keys.
I have written a simple testbed that generates data following a typical long-tail distribution to play with. It turns out that MongoDB is usually better at relations than people believe.
After all, there are only three differences to relational databases:
Foreign key constraints: You have to manage these yourself, so there's some risk for dead links
Transaction isolation: Since there are no multi-document transactions, there's some likelihood for creating invalid foreign key constraints even if the code is correct (in the sense that it never tries to create a dead link), but merely interrupted at runtime. Also, it is hard to check for dead links because you could be observing a race condition
Joins: MongoDB doesn't support joins, though a manual subquery with $in does scale well up to several thousand items in the $in-clause, provided the reference values are indexed, of course
Iff you need to perform large joins, i.e. if your queries are truly relational and you need large amount of the data joined accordingly, MongoDB is probably not a good fit. However, many joins required in relational databases aren't truly relational, they are required because you had to split up your object to multiple tables, for instance because it contains a list.
An example of a 'truly' relational query could be "Find me all customers who bought products that got >4 star reviews by customers that ranked high in turnover in June". Unless you have a very specialized schema that essentially was built to support this query, you'll most likely need to find all the orders, group them by customer ids, take the top n results, use these to query ratings using $in and use another $in to find the actual customers. Still, if you can limit yourself to the top, say 10k customers of June, this is three round-trips and some fast $in queries.
That will probably be in the range of 10-30ms on typical cloud hardware as long as your queries are supported by indexes in RAM and the network isn't completely congested. In this example, things get messy if the data is too sparse, i.e. the top 10k users hardly wrote >4 star reviews, which would force you to write program logic that is smart enough to keep iterating the first step which is both complicated and slow, but if that is such an important scenario, there is probably a better suited data structure anyway.
Using MongoDB with references is a gateway to performance issues. Perfect example of what not to use. This is a m:n kind of relation where m and n can scale to millions. MongoDB works well where we have 1:n(few), 1:n(many), m(few):n(many). But not in situations where you have m(many):n(many). It will obviously result in 2 queries and lot of housekeeping.
I am not sure that is this question still actual, but i have similar experience.
First of all i want to say what tells official mongo documentation:
Use embedded data models when: you have one-to-one or one-to-many model.
For model many-to-many use relationships with document references.
I think is the answer) but this answer provide a lot of problems because:
As were mentioned, mongo don't provide transactions at all.
And you don't have foreign key constraints.
Even if you have references (DBRefs) between documents, you will be faced with amazing problem how to dereference this documents.
Each this item - is huge piece of responsibility, even if you work at weekend project. And it might mean that you should be write many code to provide simple behaviour of your system (for example you can see how realize transaction in mongo here).
I have no idea how done foreign key constraints, and i don't saw something in this direction in mongo documentation, that's why i think that it amazing challenge (and risk for project).
And the last, mongo references - it isn't mysql join, and you dont receive all data from parent collection with data from child collection (like all fields from table and all fields from joined table in mysql), you will receive just REFERENCE to another document in another collection, and you will need to do something with this reference (dereference).
It can be easily reached in node by callback, but only in case when you need just one text from one list, but if you need all texts in one list - it's terrible, but if you need all texts in more than one list - it's become nightmare...
Perhaps it's my not the best experience... but i think you should think about it...
Using array in MongoDB is generally not preferable, and generally not advised by experts.
Here is a solution that came to my mind :
Each document of Users is always unique. There can be Lists and Texts for individual document in Users. So therefore, Lists and Texts have a Field for USER ID, which will be the _id of Users.
Lists always have an owner in Users so they are stored as they are.
Owner of Texts can be either Users or List, so you should keep a Field of LIST ID also in it, which will be _id of Lists.
Now mind that Texts cannot have both USER ID and LIST ID, so you will have to keep a condition that there should be only ONE out of both, the other should be null so that we can easily know who is the primary owner of the Texts.
Writing an answer as I want to explain how I will proceed from here.
Taking into consideration the answers here and my own research on the topic, it might actually be fine storing those references (not really relations) in an array, trying to keep it relativley small: less than 1000 fields is very likely in my case.
Especially because I can get away with one query (which I first though I couldn't) that doen't even require using $in so far, I'm confident that the approach will scale. After all it's 'just a weekend-project', so if it doesn't and I end up re-writing - that's fine.
With a text-schema like this:
var textSchema = new Schema({
_id: {type: Number, required: true, index: { unique: true }},
...
inList: { type : [Number] , "default" : [], index: true }
});
I can simply get all texts in a list with this query, where inList is an indexed array containing the _ids of the texts in the list.
Text.find({inList: listID}, function(err, text) {
...
});
I will still have to deal with foreign key constraints and write my own "clean-up" functions that take care of removing references if a list is removed - remove reference in every text that was in the list.
Luckily this will happen very rarely, so I'm okay with going through every text once in a while.
On the other hand I don't have to care about deleting references in a list-document if a text is removed, because I only store the reference on one side of the relation (in the text-document). Quite an important point in my opinion!
#mnemosyn: thanks for the link and pointing out that this is indeed not a large join or in other words: just a very simple relation. Also some numbers on how long those complex operations take (ofc. hardware dependet) is a big help.
PS: Grüße aus Bielefeld.
What I found most helpful during my own research was this vid, where Alvin Richards also talks about many-to-many relations at around min. 17. This is where I got the idea of making the relation one-sided to save myself some work cleaning up the dead references.
Thanks for the help guys
👍

Using intensive update in Map type column in Cassandra is anti-pattern?

Friends,
I am modeling a table in Cassandra which contains a Map column. So this Map should contains dynamic values and will be update so much for that row (I will update by a Primary Key)
Is it an anti-patterns, which other options should I consider ?
What you're trying to do is possibly what I described here.
First big limitations that comes into my mind are the one given by the specification:
64KB is the max size of an item in a collection
65536 is the max number of queryable elements inside a collection
More there are the problems described in other post
you can not retrieve part of a collection: even if internally each entry of a map is stored as a column you can only retrieve the whole collection (this can lead to very slow performances)
you have to choose whether creating an index on keys or on values, both simultaneously are not supported.
Since maps are typed you can't put mixed values inside: you have to represent everything as a string or bytes and then transform your data client side
I personally consider this approach as an anti pattern for all these reasons -- this approach provide a schema less solution but reduce performances and introduce lots of limitations like the one secondary indexes and typing.
HTH, Carlo

data modelling in cassandra to optimize search results

I was just wondering if I could get some clue/pointers to our kind of simple data modelling problem.
It would be great if somebody can help me in the right direction.
So we have kind of a flat table ex. document
which has all kinds of meta data attached to a document like
UUID documentId,
String organizationId,
Integer totalPageCount,
String docType,
String acountNumber,
String branchNumber,
Double amount,
etc etc...
which we are storing in cassandra .
UUID is the rowkey and we have certain secondary indexes like organization Id.
This table is actaully suppose hold millions of records.
Placing proper indices helps with a lot of queries but with the generic queries I am stuck.
The problem is even with something like 100k records if I throw in a query like
select * from document where orgId='something' and amount > 5 and amount < 50 ...I am begining to see all Read time out problems.
The query still works (although quite slow) if I limit the no of records to something lets say 2000.
The above can be solved by probably placing certain parmas properly but there about dozens of those columns based on which we need to search.
I am still trying to scale it horizontally so to place mutiple records in a single row.
Hoping for a sense of direction.
This is a broad problem, and general solutions are hard to give. However, here's my 2 pennies:
You WANT queries to hit single partitions for quick querying. If you don't hit a rowkey in your query, it's a cluster wide operation. So select * from docs where orgId='something' and amount > 5 and amount < 50 means you will have issues. Hitting a partition key AND an index is way way better than hitting the index without the partition key.
Again, you don't want all docs in a single partition...that's an obvious hotspot, not to mention it can cause size issues - keeping a row around the 100mb mark is a good idea. Several thousand or even several hundred thousand metadata entries per row should be fine - though much of this depends on your specific data.
So we want to hit partition keys, but also want to take advantage of distribution, while preserving efficiency. Hmmm.....
You can create artificial buckets. Decide how many buckets you want, based on expected data volumes. Assuming a few hundred thousand per partition, n buckets gives you n * hundreds of thousands. Make the bucket id the row key. When querying, use something like:
select * from documents where bucketid in (...) and orgId='something' and amount > 5;
[Note: for this, you may want to make the docid the last clustering key, so you don't have to specify it when doing the range query.]
That will result in n fast queries hitting n partitions, where n is the number of buckets.
Also, consider limiting your results. Do you really need 2000 records at a time?
For some information, it may make sense to have separate tables (i.e. some information with one particular clustering scheme in one table, and another in another). Duplication of some information is often ok - but again, this depends on particular scenarios.
Again, it's hard to give a general answer. But does that help?
The problem is not in Cassandra, but in your data model. You need to shift from relation thinking, to a nosql-cassandra thinking. In Cassandra, you write your queries first if you want to get decent O(1) speed. Using secondary indexes in Cassandra is frankly a poor choice. This is due to the fact that your indexes are distributed.
If you don't know your queries upfront, use other technology but not Cassandra. Relational servers are really good, if you can fit all data on 1 server, otherwise have a look at ElasticSearch.
Other option is to use Datastax edition, which contains Solr for full text search.
Lastly, you can have several tables that duplicate information. This will allow you to query for a specific property . This process is called de-normalisation and the idea is that you take a property of your object, make it a primary key and insert it into its own table. The outcome is that you can query that particular table, for that particular property value in O(1) time. The downside is that you now have to duplicate data.

What is the best way to store and search through object transactions?

We have a decent sized object-oriented application. Whenever an object in the app is changed, the object changes are saved back to the DB. However, this has become less than ideal.
Currently, transactions are stored as a transaction and a set of transactionLI's.
The transaction table has fields for who, what, when, why, foreignKey, and foreignTable. The first four are self-explanatory. ForeignKey and foreignTable are used to determine which object changed.
TransactionLI has timestamp, key, val, oldVal, and a transactionID. This is basically a key/value/oldValue storage system.
The problem is that these two tables are used for every object in the application, so they're pretty big tables now. Using them for anything is slow. Indexes only help so much.
So we're thinking about other ways to do something like this. Things we've considered so far:
- Sharding these tables by something like the timestamp.
- Denormalizing the two tables and merge them into one.
- A combination of the two above.
- Doing something along the lines of serializing each object after a change and storing it in subversion.
- Probably something else, but I can't think of it right now.
The whole problem is that we'd like to have some mechanism for properly storing and searching through transactional data. Yeah you can force feed that into a relational database, but really, it's transactional data and should be stored accordingly.
What is everyone else doing?
We have taken the following approach:-
All objects are serialised (using the standard XMLSeriliser) but we have decorated our classes with serialisation attributes so that the resultant XML is much smaller (storing elements as attributes and dropping vowels on field names for example). This could be taken a stage further by compressing the XML if necessary.
The object repository is accessed via a SQL view. The view fronts a number of tables that are identical in structure but the table name appended with a GUID. A new table is generated when the previous table has reached critical mass (a pre-determined number of rows)
We run a nightly archiving routine that generates the new tables and modifies the views accordingly so that calling applications do not see any differences.
Finally, as part of the overnight routine we archive any old object instances that are no longer required to disk (and then tape).
I've never found a great end all solution for this type of problem. Some things you can try is if your DB supports partioning (or even if it doesn't you can implement the same concept your self), but partion this log table by object type and then you can further partion by date/time or by your object ID (if your ID is a numeric this works nicely not sure how a guid would partion).
This will help maintain the size of the table and keep all related transactions to a single instance of an object to itself.
One idea you could explore is instead of storing each field in a name value pair table, you could store the data as a blob (either text or binary). For example serialize the object to Xml and store it in a field.
The downside of this is that as your object changes you have to consider how this affects all historical data if your using Xml then there are easy ways to update the historical xml structures, if your using binary there are ways but you have to be more concious of the effort.
I've had awsome success storing a rather complex object model that has tons of interelations as a blob (the xml serializer in .net didn't handle the relationships btw the objects). I could very easily see myself storing the binary data. A huge downside of storing it as binary data is that to access it you have to take it out of the database with Xml if your using a modern database like MSSQL you can access the data.
One last approach is to split the two patterns, you could define a Difference Schema (and I assume more then one property changes at a time) so for example imagine storing this xml:
<objectDiff>
<field name="firstName" newValue="Josh" oldValue="joshua"/>
<field name="lastName" newValue="Box" oldValue="boxer"/>
</objectDiff>
This will help alleviate the number of rows, and if your using MSSQL you can define an XML Schema and get some of the rich querying ability around the object. You can still partition the table.
Josh
Depending on the characteristics of your specific application an alternative approach is to keep revisions of the entities themselves in their respective tables, together with the who, what, why and when per revision. The who, what and when can still be foreign keys.
Although I would be very careful to use this approach, since this is only viable for applications with a relatively small amount of changes per entity/entity type.
If querying the data is important I would use true Partitioning in SQL Server 2005 and above if you have enterprise edition of SQL Server. We have millions of rows partitioned by year down to day for the current month - you can be as granular as your application demands with a maximum number of 1000 partitions.
Alternatively , if you are using SQL 2008 you could look into filtered indexes.
These are solutions that will enable you to retain the simplified structure you have whilst providing the performance you need to query that data.
Splitting/Archiving older changes obviously should be considered.

Resources