How to increase the Cassandra row size above 64KB - cassandra

I read that Apache Cassandra supported maximum size of a row is 64KB. But I need to save a record with a size of 560 KB. Is that possible.

Yes, it is possible if you store data in a column value instead of column key.
In Cassandra, the 64KB limitation is only for column keys, which determine the ordering of data in a partition. For column values, the size limitation is 2GB.
This page describes the difference between clustering columns (aka column keys) and regular columns (aka column values).

Related

cassandra: `sstabledump` output questions

I'm inspecting the output of sstabledump, to gain better understanding of the cassandra data model, and I have some questions
From the output of of sstabledump it seems that
a table is a list of partitions (split by partition key)
a partition is a list of rows (split according to clustering key)
a row is a map of key-value pairs, where the keys belong in a predefined list
Question 1: For each partition, as well as for each row inside a partition, there is a position key. What does this value correspond to? Physical storage details? And how exactly?
Question 2: Each row inside each partition has a type: row key-value pair. Could this type be anything else? If yes, what? If not
why have a value that is always the same?
why is cassandra is classified as wide-column and other similar terms? Looks more like a two-level row storage.
Partition key is the murmur3 hash of whatever you assigned as the primary key. Consistent hashing is used with that hash to determine which node in the cluster that partition belongs to and its replicas. Within each partition data is sorted by clustering key, and then by cell name within the row. The structure is used so redundant things like timestamps if inserted for a row at once is only inserted once as a vint delta sequence from the partitions to save space.
On disk the partitions are sorted in order of this hashed key. The output of the position key is just referring to where in the sstable's data file its located (decompressed byte offset). type can also identify in that spot as a static block, which is located at the beginning of each partition for any static cells or a ranged tombstone marker (beginning or end). Note that values are sometimes for sstabledump repeated in json for readability even if not physically written on disk (ie repeated timestamps).
You can have many of these rows inside a partition, a common datamodel for time series for example is to use timestamp as the clustering key which makes very wide partitions with millions of rows. Pre 3.0 as well the data storage was closer to big table's design. It was essentially a Map<byte[], SortedMap<byte[], Cell>> where the Comparator of the sorted map was changed based on schema. It did not differentiate rows and columns within a partition, and it lead to massive amounts redundant data and was redesigned to fit the query language better.
Some more references:
Explanation of motivation of 3.0 change by DataStax here
Blog post by TLP has a good detailed explanation of the new disk format
CASSANDRA-8099

What is the recommended max size for space occupied by cells of a partition key?

I see the following limits bandied about
2 million cells per partition - is this per key or is it the sum of all the cells for all the rows in that partition ?
100MB partition size max- is this the total space occupied for all the rows with the same partitionkey ?
is there a recommended number of maximum cells in a partition key and a limit on the amount of space occupied by one ?
Untill my experience into Cassandra, till now I used 8 columns into Partition Key and everything working so smoothly.
For max limit conditions, you can check below link.
http://docs.datastax.com/en/cql/3.3/cql/cql_reference/refLimits.html
As you didn't mention about which Version of CQL are you using? Please update that, then you may get more accurate information.

How to get blob size in Cassandra

Is there a way to get the size of the blob value in cassandra with CQL? Better yet, is there a way to get an average size of the blob column, especially with a condition?
Thanks!
I don't know of a way to do that in CQL. I assume you are interested in the original uncompressed size of the blob rather than the compressed size within Cassandra. I'd suggest adding an integer field to the table and store the size of the blob in it when you originally save the blob.
If you use that integer field as a clustering column, then you could do a range query on it to get the rows that have blobs of a certain size range. To get the average size of the blob's in a range, you could use CQL to retrieve the size column, then use java/python/etc. to calculate the average of the returned values.

What are the maximum number of columns allowed in Cassandra

Cassandra published its technical limitations but did not mention the max number of columns allowed. Is there a maximum number of columns? I have a need to store 400+ fields. Is this possible in Cassandra?
The maximum number of columns per row (or a set of rows, which is called "partition" in Cassandra's CQL) is 2 billion (but the partition must also fit on a physical node, see docs).
400+ fields is not a problem.
As per Cassandra technical limitation page, total no. of cells together cannot exceed 2 billion cells (rows X columns).
You can have a table with (1 row X 2 billion columns) and no more rows will be allowed in that table, so the limit is not 2 billion columns per row but limit is on total no. of cells in a partition.
https://wiki.apache.org/cassandra/CassandraLimitations
Rajmohan's answer is technically correct. On the other hand, if you have 400 CQL columns, you most likely aren't optimizing your data model. You want to generate cassandra wide rows using partition keys and clustering columns in CQL.
Moreover, you don't want to have rows that are too wide from a practical (performance) perspective. A conservative rule of thumb is keep your partitions under the 100's of megs or 100,000's of cells.
Take a look at these two links to help wrap your head around this.
http://www.datastax.com/dev/blog/does-cql-support-dynamic-columns-wide-rows
http://www.sestevez.com/sestevez/CASTableSizer/

Cassandra Cell Number Limitation

is this 2 billion cells per partition limit still valid?
http://wiki.apache.org/cassandra/CassandraLimitations
Let's say you save 16 bytes on average per cell. Then you "just" can persist 16*2e9 bytes = 32 GB of data (plus column name) on one machine!?
Or if you imagine a quadratic table you will be able to store 44721 rows with 44721 columns each!?
Doesn't really sound like Big Data.
Is this correct?
Thanks!
Malte
The 2 billion cell limit is still valid and you most likly want to remodel your data if you start seeing that many cells per partition.
The maximum number of cells (rows x columns) in a single partition is
2 billion.
A partition is defined by they partition key in CQL and will define where a particular piece of data will live. For example if I had two nodes with a fictional range of 0-100 and 100-200. Partition keys which hashed to between 0 and 100 would reside on the first node and those with hashed value of between 100 and 200 would reside on the second node. In reality Cassandra uses the Murmur3 algorithm to hash primary keys generating values between -2^63 and 2^63-1.
The real limitation tends to be based on how many unique values you have for your partition key. If you don't have a good deal of uniqueness within a single column many users combine columns to generate more uniqueness(composite primary key).
http://www.datastax.com/documentation/cql/3.0/cql/cql_reference/create_table_r.html
More info on hashing and how C* holds data.
http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePartitionerAbout_c.html

Resources