I have a table with around 4M of partitions and each partition contains 4 rows. So, the total data in table would be having 16M rows (wide columns). Since our table is a time series database, we only need the latest row or version of the partition_key. I can achieve my desired results through below query. However this will impact load on clusters and time consuming. Would like to see if we have any other best way to achieve this or this is the only way.
SELECT some_value FROM some_table PER PARTITION LIMIT 1;
Using PER PARTITION LIMIT won't have an impact on performance. In fact, it's efficient for achieving what you need from each partition since only the first row will be returned and it doesn't to iterate over the other rows in the partition. Cheers!
Related
I am working on the keyspace and tables for a Cassandra environment. I understand the size limitations of Cassandra and dealing with Partition keys to keep it optimized. However, I am having a disagreement with a developer regarding how to handle the keys. Is there any downside in having a key that would include a large number of data rather than a small amount of data. For example,
I have 100k records. I can create a key that will partition this into 10k; I could also create a key that will partition this into 10 records (by day). So either I store 10k and 10 partitions or 10 records and 10,000 partitions.
Keep in mind that having more columns in the key requires you to specify those columns in your select statements, which sometimes isn't desired. The more partitions the better - whether by picking a better single column or having multiple columns.
Cassandra reads data via the partition key, and can get help with performance if clustering columns are used. If you have a large partition, the entire partition must be read (memory and disk) and then merged for the output. If you have large partitions, this will definitely slow you down.
I totally understand the count(*) from table where partitionId = 'test' will return the count of the rows. I could see that it takes the same time as select * from table where partitionId = 'test.
Is there any other alternative in Cassandra to retrieve the count of the rows in an efficient way?
You can compare results of select * & select count(*) if you run cqlsh, and enable tracing there with tracing on command - it will print time that is required for execution of corresponding command. The difference between both queries is only in what amount of data should be returned back.
But anyway, to find number of rows Cassandra needs to hit SSTable(s), and scan entries - performance could be different if you have partition spread between multiple SSTables - this may depend on your compaction strategy for tables, that is selected based on your reading/writing patterns.
As Alex Ott mentioned, the COUNT(*) needs to go through the entire partition to know that total.
The fact is that Cassandra wants to avoid locks and as a result they do not maintain a number of row in their sstables and each time you do an INSERT, UPDATE, or DELETE, you may actually overwrite another entry which is just marked as a tombstone (i.e. it's not an in place overwrite, instead it saves the new data at the end of the sstable and marks the old data as dead).
The COUNT(*) will go through the sstables and count all the entries not marked as a tombstone. That's very costly. We're used to SQL having the total number of rows in a table or an index so COUNT(*) on those is instantaneous... not here.
One solution I've used is to have Elasticsearch installed on your Cassandra cluster. One of the parameters Elasticsearch saves in their stats is the number of rows in a table. I don't remember the exact query, but more or less you can just a count request and you get a result in like 100ms, always, whatever the number is. Even in the 10s of millions of rows. Just like with a SELECT COUNT(*) ... the result will always be an approximation if you have many writes happening in parallel. It will stabilize if the writes stop for long enough (possibly about 1 or 2 seconds).
I have a log table in cassandra, and now I want to search the rows count of the table.
First, I use the select count(*) from log,but it's very, very slow.
Then I want to use the counter type, and then the problem is coming. My table is a TTL table, all rows keep an hour, use the counter type become very difficult.
Cassandra isn't efficient for doing table scan operations. It is good at ingesting high volumes of data and then accessing small slices of that data rather than the whole table.
So if you want to count keys without using a counter, you need to break the table into chunks of data that are small enough to be processed quickly. For example if you want to use count(*), you should only use it on a single partition, and keep the partition size below about 100,000 rows.
In your case you might want to partition your data by hour (or something small like 5 minute intervals if you insert a lot of log lines per second).
Be careful with using a TTL of an hour if you are inserting a lot of data continuously since it could cause a lot of tombstones. To avoid building up tombstones you should delete each hour partition after the hour has passed.
I am new to cassandra, from the documentation it is found that there is limit of 2 billion cells per partition. Can we configure this limit in cassandra. what could be impact on performance if we increase this limit.
Can any please help me on this.
As far as I know this limit cannot be configured and is likely an addressing limitation.
You wouldn't typically want to use that many cells in a single partition, but would instead want to spread your data load out across many partitions so that multiple Cassandra nodes would share the load.
For example, if you are collecting time series data, you might want to add a date field to the partition key so that each day would use a different partition.
I am new to cassandra, As per my understanding depending on the configured partitioner(murmur3partitioner or randomaccess partitioner) there is a partitions limit per table. if we configure keyspace with murmur3partitioner which would enforce the partitions limit of 2^63 partitions per table. while inserting the row, if the new insertion tries to create new partition beyond the limit, the insertion would fail(means if I get unique combinations of row keys more than 2^63 per table).
Can anyone please clarify, Is my understanding about partitions limit on column family is correct ?
And also as per my understanding there is no way to increase the partitions limit even by adding nodes into the cluster, please correct me if I am wrong.
The range of values for the murmur3 partitioner is actually -2^63 to +2^63-1 That's a massive number. You aren't going to run out of values in any practical sense. No worries.