I see info here on collection size limits in cassandra, but it includes this note: "The limits specified for collections are for non-frozen collections." I can't find limits on frozen collections defined anywhere.
Frozen collections are treated as blobs so there is no imposed limit on them (other than the overall size that you would want to have in partitions etc).
Frozen collections are useful if you want to use them in the primary key. Frozen collection can only be replaced as a whole, you cannot for example add/remove elements in a frozen collection.
Related
I have a Cassandra UDT column that has about 10 attributes and now we are planning to add 3 more attributes to it. I am wondering if it would behave well if I alter the UDT type in the higher environments which has very large volume of data.
Altering UDT is same as altering table, except that you cannot remove an existing UDT unless you drop all the depended tables. Also you can't alter type of a column. Below is the query how you could add new udt columnd.
alter TYPE commentmetadata ADD columnname <type>;
It should be safe.
Just few precautions:
Don't run in mixed version Cassandra cluster.
Don't try to do same schema change concurrently from multiple client ( driver)
Additional pointers when you alter your table with UDT collections data type:
Never insert more than 2 billion items in a collection, as only that number can be queried.
The maximum number of keys for a map collection is 65,535.
The maximum size of an item in a list or a map collection is 2GB.
The maximum size of an item in a set collection is 65,535 bytes.
Keep collections small to prevent delays during querying.
Collections cannot be "sliced"; Cassandra reads a collection in its entirety, impacting performance.
Click here for more information.
[Alex Ott] MAP & LIST limits are version dependent.
65,535 bytes are supported by v3.0+ while lower versions are limited to 64,000 bytes. Fix version ticket.
When we run a Mongo find() query without any sort order specified, what does the database internally use to sort the results?
According to the documentation on the mongo website:
When executing a find() with no parameters, the database returns
objects in forward natural order.
For standard tables, natural order is not particularly useful because,
although the order is often close to insertion order, it is not
guaranteed to be. However, for Capped Collections, natural order is
guaranteed to be the insertion order. This can be very useful.
However for standard collections (non capped collections), what field is used to sort the results?
Is it the _id field or something else?
Edit:
Basically, I guess what I am trying to get at is that if I execute the following search query:
db.collection.find({"x":y}).skip(10000).limit(1000);
At two different points in time: t1 and t2, will I get different result sets:
When there have been no additional writes between t1 & t2?
When there have been new writes between t1 & t2?
There are new indexes that have been added between t1 & t2?
I have run some tests on a temp database and the results I have gotten are the same (Yes) for all the 3 cases - but I wanted to be sure and I am certain that my test cases weren't very thorough.
What is the default sort order when none is specified?
The default internal sort order (or natural order) is an undefined implementation detail. Maintaining order is extra overhead for storage engines and MongoDB's API does not mandate predictability outside of an explicit sort() or the special case of fixed-sized capped collections which have associated usage restrictions. For typical workloads it is desirable for the storage engine to try to reuse available preallocated space and make decisions about how to most efficiently store data on disk and in memory.
Without any query criteria, results will be returned by the storage engine in natural order (aka in the order they are found). Result order may coincide with insertion order but this behaviour is not guaranteed and cannot be relied on (aside from capped collections).
Some examples that may affect storage (natural) order:
WiredTiger uses a different representation of documents on disk versus the in-memory cache, so natural ordering may change based on internal data structures.
The original MMAPv1 storage engine (removed in MongoDB 4.2) allocates record space for documents based on padding rules. If a document outgrows the currently allocated record space, the document location (and natural ordering) will be affected. New documents can also be inserted in storage marked available for reuse due to deleted or moved documents.
Replication uses an idempotent oplog format to apply write operations consistently across replica set members. Each replica set member maintains local data files that can vary in natural order, but will have the same data outcome when oplog updates are applied.
What if an index is used?
If an index is used, documents will be returned in the order they are found (which does necessarily match insertion order or I/O order). If more than one index is used then the order depends internally on which index first identified the document during the de-duplication process.
If you want a predictable sort order you must include an explicit sort() with your query and have unique values for your sort key.
How do capped collections maintain insertion order?
The implementation exception noted for natural order in capped collections is enforced by their special usage restrictions: documents are stored in insertion order but existing document size cannot be increased and documents cannot be explicitly deleted. Ordering is part of the capped collection design that ensures the oldest documents "age out" first.
It is returned in the stored order (order in the file), but it is not guaranteed to be that they are in the inserted order. They are not sorted by the _id field. Sometimes it can be look like it is sorted by the insertion order but it can change in another request. It is not reliable.
I'm setting up our first Azure Cosmos DB - I will be importing into the first collection, the data from a table in one of our SQL Server databases. In setting up the collection, I'm having trouble understanding the meaning and the requirements around the partition key, which I specifically have to name while setting up this initial collection.
I've read the documentation here: (https://learn.microsoft.com/en-us/azure/cosmos-db/documentdb-partition-data) and still am unsure how to proceed with the naming convention of this partition key.
Can someone help me understand how I should be thinking in naming this partition key? See the screenshot below for the field I'm trying to fill in.
In case it helps, the table I'm importing consists of 7 columns, including a unique primary key, a column of unstructured text, a column of URL's and several other secondary identifiers for that record's URL. Not sure if any of that information has any bearing on how I should name my Partition Key.
EDIT: I've added a screenshot of several records from the table from which I'm importing, per request from #Porschiey.
Honestly the video here* was a MAJOR help to understanding partitioning in CosmosDb.
But, in a nutshell:
The PartitionKey is a property that will exist on every single object that is best used to group similar objects together.
Good examples include Location (like City), Customer Id, Team, and more. Naturally, it wildly depends on your solution; so perhaps if you were to post what your object looks like we could recommend a good partition key.
EDIT: Should be noted that PartitionKey isn't required for collections under 10GB. (thanks David Makogon)
* The video used to live on this MS docs page entitled, "Partitioning and horizontal scaling in Azure Cosmos DB", but has since been removed. A direct link has been provided, above.
Partition key acts as a logical partition.
Now, what is a logical partition, you may ask? A logical partition may vary upon your requirements; suppose you have data that can be categorized on the basis of your customers, for this customer "Id" will act as a logical partition and info for the users will be placed according to their customer Id.
What effect does this have on the query?
While querying you would put your partition key as feed options and won't include it in your filter.
e.g: If your query was
SELECT * FROM T WHERE T.CustomerId= 'CustomerId';
It will be Now
var options = new FeedOptions{ PartitionKey = new PartitionKey(CustomerId)};
var query = _client.CreateDocumentQuery(CollectionUri,$"SELECT * FROM T",options).AsDocumentQuery();
I've put together a detailed article here Azure Cosmos DB. Partitioning.
What's logical partition?
Cosmos DB designed to scale horizontally based on the distribution of data between Physical Partitions (PP) (think of it as separately deployable underlaying self-sufficient node) and logical partition - bucket of documents with same characteristic (partition key) which is supposed to be stored fully on the same PP. So LP can't have part of the data on PP1 and another on PP2.
There are two main limitation on Physical Partitions:
Max throughput: 10k RUs
Max data size (sum of sizes of all LPs stored in this PP): 50GB
Logical partition has one - 20GB limit in size.
NOTE: Since initial releases of Cosmos DB size limits grown and I won't be surprised that soon size limitations might increase.
How to select right partition key for my container?
Based on the Microsoft recommendation for maintainable data growth you should select partition key with highest cardinality (like Id of the document or a composite field). For the main reason:
Spread request unit (RU) consumption and data storage evenly across all logical partitions. This ensures even RU consumption and storage distribution across your physical partitions.
It is critical to analyze application data consumption pattern when considering right partition key. In a very rare scenarios larger partitions might work though in the same time such solutions should implement data archiving to maintain DB size from a get-go (see example below explaining why). Otherwise you should be ready to increasing operational costs just to maintain same DB performance and potential PP data skew, unexpected "splits" and "hot" partitions.
Having very granular and small partitioning strategy will lead to an RU overhead (definitely not multiplication of RUs but rather couple additional RUs per request) in consumption of data distributed between number of physical partitions (PPs) but it will be neglectable comparing to issues occurring when data starts growing beyond 50-, 100-, 150GB.
Why large partitions are a terrible choice in most cases even though documentation says "select whatever works best for you"
Main reason is that Cosmos DB is designed to scale horizontally and provisioned throughput per PP is limited to the [total provisioned per container (or DB)] / [number of PP].
Once PP split occurs due to exceeding 50GB size your max throughput for existing PPs as well as two newly created PPs will be lower then it was before split.
So imagine following scenario (consider days as a measure of time between actions):
You've created container with provisioned 10k RUs and CustomerId partition key (which will generate one underlying PP1). Maximum throughput per PP is 10k/1 = 10k RUs
Gradually adding data to container you end-up with 3 big customers with C1[10GB], C2[20GB] and C3[10GB] of invoices
When another customer was onboarded to the system with C4[15GB] of data Cosmos DB will have to split PP1 data into two newly created PP2 (30GB) and PP3 (25GB). Maximum throughput per PP is 10k/2 = 5k RUs
Two more customers C5[10GB] C6[15GB] were added to the system and both ended-up in PP2 which lead to another split -> PP4 (20GB) and PP5 (35GB). Maximum throughput per PP is now 10k/3 = 3.333k RUs
IMPORTANT: As a result on [Day 2] C1 data was queried with up to 10k RUs
but on [Day 4] with only max to 3.333k RUs which directly impacts execution time of your query
This is a main thing to remember when designing partition keys in current version of Cosmos DB (12.03.21).
CosmosDB can be used to store any limit of data. How it does in the back end is using partition key. Is it the same as Primary key? - NO
Primary Key: Uniquely identifies the data
Partition key helps in sharding of data(For example one partition for city New York when city is a partition key).
Partitions have a limit of 10GB and the better we spread the data across partitions, the more we can use it. Though it will eventually need more connections to get data from all partitions. Example: Getting data from same partition in a query will be always faster then getting data from multiple partitions.
Partition Key is used for sharding, it acts as a logical partition for your data, and provides Cosmos DB with a natural boundary for distributing data across partitions.
You can read more about it here: https://learn.microsoft.com/en-us/azure/cosmos-db/partition-data
Each partition on a table can store up to 10GB (and a single table can store as many document schema types as you like). You have to choose your partition key though such that all the documents that get stored against that key (so fall into that partition) are under that 10GB limit.
I'm thinking about this too right now - so should the partition key be a date range of some type? In that case, it would really depend on how much data is getting stored in a period of time.
You are defining a logical partition.
Underneath, physically the data is split into physical partitions by Azure.
Ideally a partitionKey should be a primary Key, or a field with high cardinality to ensure proper distribution, with the self generated id field within that partition also set to the primary key, that will help with documentFetchById much faster.
You cannot change a partitionKey once container is created.
Looking at the dataset, captureId is a good candidate for partitionKey, with id set manually to this field, and not an auto generated cosmos one.
There is documentation available from Microsoft about partition keys. According to me you need to check the queries or operations that you plan to perform with cosmos DB. Are they read-heavy or write-heavy? if read heavy it is ideal to choose a partition key in the where clause that will be used in the query, if it is a write heavy operation then look for a key which has high cardinality
Always point reads /writes are better since it consumes way less RU's than running other queries
I am going to do a project using nodejs and mongodb. We are designing the schema of database, we are not sure that whether we need to use different collections or same collection to store the data. Because each has its own pros and cons.
If we use single collection, whenever the database is invoked, total collection will be loaded into memory which reduces the RAM capacity.If we use different collections then to retrieve data we need to write different queries. By using one collection retrieving will be easy and by using different collections application will become faster. We are confused whether to use single collection or multiple collections. Please Guide me which one is better.
Usually you use different collections for different things. For example when you have users and articles in the systems, you usually create a "users" collection for users and "articles" collection for articles. You could create one collection called "objects" or something like that and put everything there but it would mean you would have to add some type fields and use it for searches and storage of data. You can use a single collection in the database but it would make the usage more complicated. Of course it would let you to load the entire collection at once but whether or not it is relevant for the performance of your application, that is something that would have to be profiled and tested to give your the performance impact for your particular use case.
Usually, developers create the different collection for different things. Like for post management, people create 'post' collection and save the posts in post collection and same for users and all.
Using different collection for different purpose is a good pratices.
MongoDB is great at scaling horizontally. It can shard a collection across a dynamic cluster to produce a fast, querable collection of your data.
So having a smaller collection size is not really a pro and I am not sure where this theory comes that it is, it isn't in SQL and it isn't in MongoDB. The performance of sharding, if done well, should be relative to the performance of querying a single small collection of data (with a small overhead). If it isn't then you have setup your sharding wrong.
MongoDB is not great at scaling vertically, as #Sushant quoted, the ns size of MongoDB would be a serious limitation here. One thing that quote does not mention is that index size and count also effect the ns size hence why it describes that:
By default MongoDB has a limit of approximately 24,000 namespaces per
database. Each namespace is 628 bytes, the .ns file is 16MB by
default.
Each collection counts as a namespace, as does each index. Thus if
every collection had one index, we can create up to 12,000
collections. The --nssize parameter allows you to increase this limit
(see below).
Be aware that there is a certain minimum overhead per collection -- a
few KB. Further, any index will require at least 8KB of data space as
the b-tree page size is 8KB. Certain operations can get slow if there
are a lot of collections and the meta data gets paged out.
So you won't be able to gracefully handle it if your users exceed the namespace limit. Also it won't be high on performance with the growth of your userbase.
UPDATE
For Mongodb 3.0 or above using WiredTiger storage engine, it will no longer be the limit.
Yes personally I think having multiple collections in a DB keeps it nice and clean. The only thing I would worry about is the size of the collections. Collections are used by a lot of developers to cut up their db into, for example, posts, comments, users.
Sorry about my grammar and lack of explanation I'm on my phone
We are using DocumentDB on azure. We have a single database with 7 collection, each having maximum 15 records. It does not require much storage.
Only a few developers are using this DB instance. So traffic is also below average.
Still this server is using 67,600 RUs per day. There must be some problem with DocumentDB settings. So, I'm looking for direction to analyse exactly how these RUs are charged and how to reduce it?
There's no problem with DocumentDB settings. You provisioned 7 collections. By default, via the portal, each collection is assigned 1000 RU (which you have at your disposal, regardless whether you use 0 RU or all 1000 RU). The minimum RU setting for a non-partitioned collection is 400.
EDIT - I misread - if you're at 67,000 RU, then you have likely provisioned several partitioned collections (which start at 10,100 RU). For initial dev/test, with only 15 documents, you've grossly over-allocated capacity.
Since you provisioned seven collections (which are likely partitioned, based on your RU sizing), you have a ~70,000 RU deployment. Regardless what you actually consume (you're essentially reserving capacity).
I have no idea what your app needs are, and whether you need 7 collections for some specific reason. But... objectively speaking, there is no rule that says you need to separate different document types into different collections. You can easily store heterogeneous data within a single collection. How you query for specific types is really up to you, but it's trivial to add something like a type property to each document).
Note, since I now believe you're using partitioned collections: You cannot convert these to non-partitioned collections; you'll need to create new non-partitioned collections and move your data from your partitioned collections. (given that you have 15 total documents, this should be trivial).
Note that a single non-partitioned collection may be scaled down to 400 RU. If you then combine your 7 collections into 1 collection, you should be able to reduce your consumption from ~70,000 => 400. (at least during dev/test).
EDIT As of February 2017, the minimum RU for partitioned collections dropped to 2,500 (from the original 10,100 minimum). In December 2017, it dropped again, to 1,000.
It's common for people new to DocumentDB to think of a collection similar to a table in SQL or even what MongoDB calls a "collection". However, DocumentDB is designed differently. It's best to use a single partitioned collection to store all document types and partition on something like geography, tenant, or user. You'll distinguish document types with a type = <MyType> field or I actually prefer to use myType = true approach so I can model inheritance and mixins.
This means, you'll only need to pay for a single partitioned collection. A single partitioned collection may still end up costing you more than table storage, but if you want DocumentDB's near infinite scalability later on, then I highly recommend you start out the way I'm describing.
One more note about David's suggestion to go with non-partitioned collections. That was the only option when DocumentDB first launched but it's now recommended to use partitioned collections. I suspect that non-partitioned collection option may be phased out at some point. You interact with them slightly differently and as David pointed out, there is currently no conversion assistance (especially if you use multiple non-partitioned collections) so transitioning later from non-partitioned collections to a partitioned collection is not hard but it's not as simple as changing your partition type and will cost you development effort. It'll cost you a little more to have a single partitioned collection than a single non-partitioned collection, but it's worth it to save transition costs later, IMHO and it'll cost you less to have a single partitioned collection than it costs to have seven non-partitioned ones.