How does the following two Cassandra limitations interplay with one another?
Cells in a partition: ~2 billion (2^31); single column value size: 2 GB (1 MB is recommended) [1]
Collection values may not be larger than 64KB. [2]
Are collections laid out inside of a single column and hence ought one limit the size of the entire collection to 1MB?
[1] https://docs.datastax.com/en/cql/3.3/cql/cql_reference/refLimits.html
[2] https://wiki.apache.org/cassandra/CassandraLimitations
A collection is a single column value with
a single value inside size limited to 64k (max value of unsinged short)
items in collections limited to to 64K (max value of unsinged short)
The 1MB is a recommendation and no hard limit, you can go higher if you need to - but as always do testing before production. But as you can have 2^16 items and 2^16 bytes in each - this will break the 2GB limit per cell.
But collections should be kept small for performance reasons anyway as they are always read entirely. And updates to collections are not very fast either.
Related
What is the max size for Text column value in cassandra 3.x?
Found this page of all the max limits of Cassandra but Text type is not there in the list
https://docs.datastax.com/en/cql-oss/3.x/cql/cql_reference/refLimits.html
it's covered as
single column value size: 2 GB (1 MB is recommended)
but it's not really recommended to keep large texts (and blobs) in the Cassandra.
I see the following limits bandied about
2 million cells per partition - is this per key or is it the sum of all the cells for all the rows in that partition ?
100MB partition size max- is this the total space occupied for all the rows with the same partitionkey ?
is there a recommended number of maximum cells in a partition key and a limit on the amount of space occupied by one ?
Untill my experience into Cassandra, till now I used 8 columns into Partition Key and everything working so smoothly.
For max limit conditions, you can check below link.
http://docs.datastax.com/en/cql/3.3/cql/cql_reference/refLimits.html
As you didn't mention about which Version of CQL are you using? Please update that, then you may get more accurate information.
Cassandra published its technical limitations but did not mention the max number of columns allowed. Is there a maximum number of columns? I have a need to store 400+ fields. Is this possible in Cassandra?
The maximum number of columns per row (or a set of rows, which is called "partition" in Cassandra's CQL) is 2 billion (but the partition must also fit on a physical node, see docs).
400+ fields is not a problem.
As per Cassandra technical limitation page, total no. of cells together cannot exceed 2 billion cells (rows X columns).
You can have a table with (1 row X 2 billion columns) and no more rows will be allowed in that table, so the limit is not 2 billion columns per row but limit is on total no. of cells in a partition.
https://wiki.apache.org/cassandra/CassandraLimitations
Rajmohan's answer is technically correct. On the other hand, if you have 400 CQL columns, you most likely aren't optimizing your data model. You want to generate cassandra wide rows using partition keys and clustering columns in CQL.
Moreover, you don't want to have rows that are too wide from a practical (performance) perspective. A conservative rule of thumb is keep your partitions under the 100's of megs or 100,000's of cells.
Take a look at these two links to help wrap your head around this.
http://www.datastax.com/dev/blog/does-cql-support-dynamic-columns-wide-rows
http://www.sestevez.com/sestevez/CASTableSizer/
is this 2 billion cells per partition limit still valid?
http://wiki.apache.org/cassandra/CassandraLimitations
Let's say you save 16 bytes on average per cell. Then you "just" can persist 16*2e9 bytes = 32 GB of data (plus column name) on one machine!?
Or if you imagine a quadratic table you will be able to store 44721 rows with 44721 columns each!?
Doesn't really sound like Big Data.
Is this correct?
Thanks!
Malte
The 2 billion cell limit is still valid and you most likly want to remodel your data if you start seeing that many cells per partition.
The maximum number of cells (rows x columns) in a single partition is
2 billion.
A partition is defined by they partition key in CQL and will define where a particular piece of data will live. For example if I had two nodes with a fictional range of 0-100 and 100-200. Partition keys which hashed to between 0 and 100 would reside on the first node and those with hashed value of between 100 and 200 would reside on the second node. In reality Cassandra uses the Murmur3 algorithm to hash primary keys generating values between -2^63 and 2^63-1.
The real limitation tends to be based on how many unique values you have for your partition key. If you don't have a good deal of uniqueness within a single column many users combine columns to generate more uniqueness(composite primary key).
http://www.datastax.com/documentation/cql/3.0/cql/cql_reference/create_table_r.html
More info on hashing and how C* holds data.
http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePartitionerAbout_c.html
I have about 28GB of Data-In for a little over 13.5 million rows stored in Windows Azure Table Storage.
6 Columns, all ints except 1 decimal and 1 datetime.
Partition Key is about 10 characters long.
RowKey is a guid.
This is for my sanity check--does this seem about right?
The Sql Database I migrated the data from has WAY more data and is only 4.9GB.
Is there a way to condense the size? I don't suspect renaming properties will put a huge dent on this.
*Note this was only a sampling of data to estimate costs for the long haul.
Well... something doesn't seem to add up right.
Each property is a key/value pair, so include property names in your calculations.
The data itself is probably around 75-100 bytes including property names averaging 10 characters apiece. The 4 ints equate to 16 bytes, the decimal (double?) 8 bytes, and the timestamp 8 bytes. So let's just round up to 100 bytes per entity.
at 14 million entities you'd have 100*13.5 million, or about 1.35 GB.
Your numbers are approx. an order of magnitude larger (about 2,000 bytes per entity). Even accounting for bulk from serialization, I don't see how you're getting such a large size. Just curious: how did you compute the current table size? And... have you done multiple tests, resulting in more data from previous runs? Are you measuring just the table size, or the total storage used in the storage account? If the latter, there may be other tables (such as diagnostics) also consuming space.
Renaming properties in the entities that are persisted should have some impact on the size. Unfortunately, that'll be only for data saved in the future. Existing data does not change just because you've renamed the properties