I am taking a look at Cassandra's CQL collections (list, set, map) and I can't find a reliable source stating on their concurrency.
I want to know if having multiple writers adding different elements to the same set is supported.
From what I read of the implementation (http://www.opensourceconnections.com/blog/2013/07/24/understanding-how-cql3-maps-to-cassandras-internal-data-structure-sets-lists-and-maps/) it seems that sets are implemented using columns, so I should be safe.
But on the other hand, I've read here and there that the operations on the collections always triggered a full read (even writes). This suggests that I could get in trouble if multiple writers were using the same collection.
So can I (and should I) use collection from multiple writers? (And also, the documentation mentions that collections should be used for "small amount of data", how much would that be? Tens, hundreds, thousands?
Updates are atomic which should include any collections in the row. So it should be fine to have multiple writers.
"In an UPDATE statement, all updates within the same partition key are applied atomically and in isolation."
http://cassandra.apache.org/doc/cql3/CQL.html#updateStmt
"Values of items in collections are limited to 64K"
http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddlWhenCollections.html
Cheers,
Related
I am trying to prevent duplicate insertion of an item into the collection due to multiple parallel requests at the same time.
My business logic is, if i dont have an unique item XYZ in the collection,
I will insert it. Otherwise i will just return the document.
Now these item cannot be duplicate in db.
In case of multiple concurrent requests in Nodejs, we are getting duplicate items in the database, as all the requests when read from database at same time, finds the item to not be present and then insert the item leading to duplication.
I know we can prevent this using unique indexes, but i don’t want to use indexes as the collection is very large and holds different kind of data, and we are already using heavy indexing on other collections.
Kindly suggest some other methods how can i handle this?
I can use indexes, but need other solution to avoid memory ram over usage.
Are you using insert? If so, I'd suggest using update with opts upsert=true. This, however, is only atomic when there is a unique index, according to this.
Other than that I don't know if you're using any sort of mutex for your concurrency, if not, you should look into it. Here is a good place to start.
Without either atomic operations or mutex locks, you're not guaranteed any data race safety in parallel threads.
My doubt is from link https://hazelcast.org/mastering-hazelcast/#controlled-partitioning
It says:
Hazelcast has two types of distributed objects.
One type is the truly partitioned data structure, like the IMap, where
each partition will store a section of the Map.
The other type is a non-partitioned data structure, like the
IAtomicLong or the ISemaphore, where only a single partition is
responsible for storing the main instance.
Let's say, I have put 500 records in IMap, what I understand is, each record may go in different partition.
Now I have put 500 records in ISemaphore, then from above quoted paragraph from the link does it mean, that all the 500 records will go in single partition?
Please help me to understand IAtomicLong or the ISemaphore, where only a single partition is responsible for storing the main instance.
Also would like to understand, how Semaphore and IMap differ when it comes to data distribution across parttion in hazelcast?
With an IMap, it makes sense to partition the data structure because it will usually hold lots of items (500 in your example) and concurrent access to items anywhere in the map is frequently needed.
But data structures like ISemaphore and IAtomicLong are simple objects, not collections of objects - you can't add 500 records to an ISemaphore. The semaphore state consists of just a few fields (count, current owner, name, maybe a few others) and it doesn't make sense to break those apart and store them in separate partitions.
A queue is more interesting because it does hold multiple items, but is not a partitioned data structure. You could add 500 items to a queue, but access is always going to be to the front of the queue (for reading) or back of the queue (for writing), so distributing the data structure across partitions doesn't really offer improved concurrency as it does with Map, Set, List, and similar collections that are accessed randomly.
When we run a Mongo find() query without any sort order specified, what does the database internally use to sort the results?
According to the documentation on the mongo website:
When executing a find() with no parameters, the database returns
objects in forward natural order.
For standard tables, natural order is not particularly useful because,
although the order is often close to insertion order, it is not
guaranteed to be. However, for Capped Collections, natural order is
guaranteed to be the insertion order. This can be very useful.
However for standard collections (non capped collections), what field is used to sort the results?
Is it the _id field or something else?
Edit:
Basically, I guess what I am trying to get at is that if I execute the following search query:
db.collection.find({"x":y}).skip(10000).limit(1000);
At two different points in time: t1 and t2, will I get different result sets:
When there have been no additional writes between t1 & t2?
When there have been new writes between t1 & t2?
There are new indexes that have been added between t1 & t2?
I have run some tests on a temp database and the results I have gotten are the same (Yes) for all the 3 cases - but I wanted to be sure and I am certain that my test cases weren't very thorough.
What is the default sort order when none is specified?
The default internal sort order (or natural order) is an undefined implementation detail. Maintaining order is extra overhead for storage engines and MongoDB's API does not mandate predictability outside of an explicit sort() or the special case of fixed-sized capped collections which have associated usage restrictions. For typical workloads it is desirable for the storage engine to try to reuse available preallocated space and make decisions about how to most efficiently store data on disk and in memory.
Without any query criteria, results will be returned by the storage engine in natural order (aka in the order they are found). Result order may coincide with insertion order but this behaviour is not guaranteed and cannot be relied on (aside from capped collections).
Some examples that may affect storage (natural) order:
WiredTiger uses a different representation of documents on disk versus the in-memory cache, so natural ordering may change based on internal data structures.
The original MMAPv1 storage engine (removed in MongoDB 4.2) allocates record space for documents based on padding rules. If a document outgrows the currently allocated record space, the document location (and natural ordering) will be affected. New documents can also be inserted in storage marked available for reuse due to deleted or moved documents.
Replication uses an idempotent oplog format to apply write operations consistently across replica set members. Each replica set member maintains local data files that can vary in natural order, but will have the same data outcome when oplog updates are applied.
What if an index is used?
If an index is used, documents will be returned in the order they are found (which does necessarily match insertion order or I/O order). If more than one index is used then the order depends internally on which index first identified the document during the de-duplication process.
If you want a predictable sort order you must include an explicit sort() with your query and have unique values for your sort key.
How do capped collections maintain insertion order?
The implementation exception noted for natural order in capped collections is enforced by their special usage restrictions: documents are stored in insertion order but existing document size cannot be increased and documents cannot be explicitly deleted. Ordering is part of the capped collection design that ensures the oldest documents "age out" first.
It is returned in the stored order (order in the file), but it is not guaranteed to be that they are in the inserted order. They are not sorted by the _id field. Sometimes it can be look like it is sorted by the insertion order but it can change in another request. It is not reliable.
I am going to do a project using nodejs and mongodb. We are designing the schema of database, we are not sure that whether we need to use different collections or same collection to store the data. Because each has its own pros and cons.
If we use single collection, whenever the database is invoked, total collection will be loaded into memory which reduces the RAM capacity.If we use different collections then to retrieve data we need to write different queries. By using one collection retrieving will be easy and by using different collections application will become faster. We are confused whether to use single collection or multiple collections. Please Guide me which one is better.
Usually you use different collections for different things. For example when you have users and articles in the systems, you usually create a "users" collection for users and "articles" collection for articles. You could create one collection called "objects" or something like that and put everything there but it would mean you would have to add some type fields and use it for searches and storage of data. You can use a single collection in the database but it would make the usage more complicated. Of course it would let you to load the entire collection at once but whether or not it is relevant for the performance of your application, that is something that would have to be profiled and tested to give your the performance impact for your particular use case.
Usually, developers create the different collection for different things. Like for post management, people create 'post' collection and save the posts in post collection and same for users and all.
Using different collection for different purpose is a good pratices.
MongoDB is great at scaling horizontally. It can shard a collection across a dynamic cluster to produce a fast, querable collection of your data.
So having a smaller collection size is not really a pro and I am not sure where this theory comes that it is, it isn't in SQL and it isn't in MongoDB. The performance of sharding, if done well, should be relative to the performance of querying a single small collection of data (with a small overhead). If it isn't then you have setup your sharding wrong.
MongoDB is not great at scaling vertically, as #Sushant quoted, the ns size of MongoDB would be a serious limitation here. One thing that quote does not mention is that index size and count also effect the ns size hence why it describes that:
By default MongoDB has a limit of approximately 24,000 namespaces per
database. Each namespace is 628 bytes, the .ns file is 16MB by
default.
Each collection counts as a namespace, as does each index. Thus if
every collection had one index, we can create up to 12,000
collections. The --nssize parameter allows you to increase this limit
(see below).
Be aware that there is a certain minimum overhead per collection -- a
few KB. Further, any index will require at least 8KB of data space as
the b-tree page size is 8KB. Certain operations can get slow if there
are a lot of collections and the meta data gets paged out.
So you won't be able to gracefully handle it if your users exceed the namespace limit. Also it won't be high on performance with the growth of your userbase.
UPDATE
For Mongodb 3.0 or above using WiredTiger storage engine, it will no longer be the limit.
Yes personally I think having multiple collections in a DB keeps it nice and clean. The only thing I would worry about is the size of the collections. Collections are used by a lot of developers to cut up their db into, for example, posts, comments, users.
Sorry about my grammar and lack of explanation I'm on my phone
Posting here as I could not find any forums for lmdb key-value store.
Is there a limit for sub-databases? What is a reasonable number of sub-databases concurrently open?
I would like to have ~200 databases which seems like a lot and clearly indicates my model is wrong.
I suppose could remodel and embed id of each db in key itself and keep one db only but then I have longer keys and also I cannot drop database if needed.
I'm interested though if LMDB uses some sort of internal prefixes for keys already.
Any suggestions how to address that problem appreciated.
Instead of calling mdb_dbi_open each time, keep your own map with database names to database handles returned from mdb_dbi_open. Reuse these handles for the lifetime of your program. This will allow you to have multiple databases within an environment and prevent the overhead with mdb_dbi_open.
If you read the documentation for mdb_env_set_maxdbs.
Currently a moderate number of slots are cheap but a huge number gets expensive: 7-120 words per transaction, and every mdb_dbi_open() does a linear search of the opened slots.
http://www.lmdb.tech/doc/group__mdb.html#gaa2fc2f1f37cb1115e733b62cab2fcdbc
The best way to know is to test the function call mdb_dbi_open performance to see if it is acceptable.