Cassandra published its technical limitations but did not mention the max number of columns allowed. Is there a maximum number of columns? I have a need to store 400+ fields. Is this possible in Cassandra?
The maximum number of columns per row (or a set of rows, which is called "partition" in Cassandra's CQL) is 2 billion (but the partition must also fit on a physical node, see docs).
400+ fields is not a problem.
As per Cassandra technical limitation page, total no. of cells together cannot exceed 2 billion cells (rows X columns).
You can have a table with (1 row X 2 billion columns) and no more rows will be allowed in that table, so the limit is not 2 billion columns per row but limit is on total no. of cells in a partition.
https://wiki.apache.org/cassandra/CassandraLimitations
Rajmohan's answer is technically correct. On the other hand, if you have 400 CQL columns, you most likely aren't optimizing your data model. You want to generate cassandra wide rows using partition keys and clustering columns in CQL.
Moreover, you don't want to have rows that are too wide from a practical (performance) perspective. A conservative rule of thumb is keep your partitions under the 100's of megs or 100,000's of cells.
Take a look at these two links to help wrap your head around this.
http://www.datastax.com/dev/blog/does-cql-support-dynamic-columns-wide-rows
http://www.sestevez.com/sestevez/CASTableSizer/
Related
I have a table with around 4M of partitions and each partition contains 4 rows. So, the total data in table would be having 16M rows (wide columns). Since our table is a time series database, we only need the latest row or version of the partition_key. I can achieve my desired results through below query. However this will impact load on clusters and time consuming. Would like to see if we have any other best way to achieve this or this is the only way.
SELECT some_value FROM some_table PER PARTITION LIMIT 1;
Using PER PARTITION LIMIT won't have an impact on performance. In fact, it's efficient for achieving what you need from each partition since only the first row will be returned and it doesn't to iterate over the other rows in the partition. Cheers!
I am basically substituting for another programmer.
Problem Description:
There are 11 hive tables each has 8 to 11 columns. All these tables have around 5 columns whose names are similar but hold different values.
For example Table A has mobile_no, date, duration columns so has Table B. But values are not same. other columns have different names table wise.
In all tables, Data types are string, integer, double I.e. simple data types. String data has a maximum 100 characters.
Each Table contains around 50 millions of data. I have requirement to merge these 11 table taking their columns as it is and make one big table.
Our spark cluster has 20 physical server, each has 36 cores (if count virtualization then 72), RAM 512 GB each. Spark version 2.2.x
I have to merge those with both memory & speed wise efficiently.
Can you guys, help me regarding this problem?
N.B: please let me know if you have questions
We are evaluating if we can migrate from SQL SERVER to cassandra for OLAP. As per the internal storage structure we can have wide rows. We almost need to access data by the date. We often need to access data within date range as we have financial data. If we use date as Partition key to support filter by date,we end up having less row with huge number of columns.
Will it hamper performance if we have millions of columns for a single row key in future as we process millions of transactions every day.
Do we need to have some changes in the access pattern to have more rows with less number of columns per row.
Need some performance insight to proceed in either direction
Using wide rows is typically fine with Cassandra, there are however a few things to consider:
Ensure that you don't reach the 2 billion column limit in any case
The whole wide row is stored on the same node: it needs to fit on the disk. Also, if you have some dates that are accessed more frequently then other dates (e.g. today) then you can create hotspots on the node that stores the data for that day.
Very wide rows can affect performance however: Aaron Morton from The Last Pickle has an interesting article about this: http://thelastpickle.com/blog/2011/07/04/Cassandra-Query-Plans.html
It is somewhat old, but I believe that the concepts are still valid.
For a good table design decision one needs to know all typical filter conditions. If you have any other fields you typically filter for as an exact match, you could add them to the partition key as well.
is this 2 billion cells per partition limit still valid?
http://wiki.apache.org/cassandra/CassandraLimitations
Let's say you save 16 bytes on average per cell. Then you "just" can persist 16*2e9 bytes = 32 GB of data (plus column name) on one machine!?
Or if you imagine a quadratic table you will be able to store 44721 rows with 44721 columns each!?
Doesn't really sound like Big Data.
Is this correct?
Thanks!
Malte
The 2 billion cell limit is still valid and you most likly want to remodel your data if you start seeing that many cells per partition.
The maximum number of cells (rows x columns) in a single partition is
2 billion.
A partition is defined by they partition key in CQL and will define where a particular piece of data will live. For example if I had two nodes with a fictional range of 0-100 and 100-200. Partition keys which hashed to between 0 and 100 would reside on the first node and those with hashed value of between 100 and 200 would reside on the second node. In reality Cassandra uses the Murmur3 algorithm to hash primary keys generating values between -2^63 and 2^63-1.
The real limitation tends to be based on how many unique values you have for your partition key. If you don't have a good deal of uniqueness within a single column many users combine columns to generate more uniqueness(composite primary key).
http://www.datastax.com/documentation/cql/3.0/cql/cql_reference/create_table_r.html
More info on hashing and how C* holds data.
http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePartitionerAbout_c.html
Is there any limit on the number of columns in cassandra? I am thinking of using a unix timestamp (converted to TimeUUID) as the column key. In the worst case, I will end up having 86400 columns per row. Is this a good idea?
Having 86.400 columns per row is piece of cake for cassandra as long your columns are not too big and you don't retrieve all of them.
The maximum of column per row is 2 billion.
See http://wiki.apache.org/cassandra/CassandraLimitations
A suggestion: For column name you should use Integer data serialization, which would take just 4 bytes for 1 second precision instead of using UUID (16 bytes); as long as your timestamps are all unique and 1s precision is enough.
Column names are sorted and you can use unix time as Integer. With this you can have fast lookups on columns.
There is also timestamp associated with each column, which can be useful to set in some cases. You cannot query on it, but may provide you additional information if needed.
Assuming you're doing that for a good reason, it's totally fine.