I need to check which tables are empty in Cassandra over a few keyspaces and there are over 20 tables. I could do a count on every single table but that's a little troublesome...
Is there a way to see the counts for every single table across the different keyspaces without typing all the 20+ queries? I have the tables in a comma-delimited list if that helps.
Edit: I used python to help with this but am interested in a Cassandra solution.
You could also try it at the command line level with nodetool tablestats:
ยป bin/nodetool tablestats stackoverflow | grep "Table\:\|partitions"
Table: cart_product
Number of partitions (estimate): 1
Table: keyvalue
Number of partitions (estimate): 0
Table: last_message_by_group
Number of partitions (estimate): 2
Table: mytable
Number of partitions (estimate): 5
Table: temps_by_item
Number of partitions (estimate): 2
Table: users
Number of partitions (estimate): 1
Granted, this is only reflective of the node that the command is on. But you should be able to ascertain whether or not a table is empty by this or several of the other statistics available in the tablestats output.
Related
How can I know over how many partitions in a DolphinDB database that a table is distributed? For example, if I created a database with 100 partitions and a table in the database only has data in 4 partitions, how do I get the number of 4?
this will do:
sqlDS(<select * from t>).size()
I have a table with around 4M of partitions and each partition contains 4 rows. So, the total data in table would be having 16M rows (wide columns). Since our table is a time series database, we only need the latest row or version of the partition_key. I can achieve my desired results through below query. However this will impact load on clusters and time consuming. Would like to see if we have any other best way to achieve this or this is the only way.
SELECT some_value FROM some_table PER PARTITION LIMIT 1;
Using PER PARTITION LIMIT won't have an impact on performance. In fact, it's efficient for achieving what you need from each partition since only the first row will be returned and it doesn't to iterate over the other rows in the partition. Cheers!
I am new to cassandra, As per my understanding depending on the configured partitioner(murmur3partitioner or randomaccess partitioner) there is a partitions limit per table. if we configure keyspace with murmur3partitioner which would enforce the partitions limit of 2^63 partitions per table. while inserting the row, if the new insertion tries to create new partition beyond the limit, the insertion would fail(means if I get unique combinations of row keys more than 2^63 per table).
Can anyone please clarify, Is my understanding about partitions limit on column family is correct ?
And also as per my understanding there is no way to increase the partitions limit even by adding nodes into the cluster, please correct me if I am wrong.
The range of values for the murmur3 partitioner is actually -2^63 to +2^63-1 That's a massive number. You aren't going to run out of values in any practical sense. No worries.
Cassandra published its technical limitations but did not mention the max number of columns allowed. Is there a maximum number of columns? I have a need to store 400+ fields. Is this possible in Cassandra?
The maximum number of columns per row (or a set of rows, which is called "partition" in Cassandra's CQL) is 2 billion (but the partition must also fit on a physical node, see docs).
400+ fields is not a problem.
As per Cassandra technical limitation page, total no. of cells together cannot exceed 2 billion cells (rows X columns).
You can have a table with (1 row X 2 billion columns) and no more rows will be allowed in that table, so the limit is not 2 billion columns per row but limit is on total no. of cells in a partition.
https://wiki.apache.org/cassandra/CassandraLimitations
Rajmohan's answer is technically correct. On the other hand, if you have 400 CQL columns, you most likely aren't optimizing your data model. You want to generate cassandra wide rows using partition keys and clustering columns in CQL.
Moreover, you don't want to have rows that are too wide from a practical (performance) perspective. A conservative rule of thumb is keep your partitions under the 100's of megs or 100,000's of cells.
Take a look at these two links to help wrap your head around this.
http://www.datastax.com/dev/blog/does-cql-support-dynamic-columns-wide-rows
http://www.sestevez.com/sestevez/CASTableSizer/
I am running, 'nodetool cfstat' and it returns me list of Keyspace and cfstats for each column family on that node/machine. The cfstats results has, SSTable count value for each column families. My question is, Will SSTable value for column family be same across node, specially for those CF, whose SSTable count is 0? The reason is if SSTable for a column family is 0, then It is safe to drop those column families.
The cfstats output is per node, so is only valid for the node that nodetool connected to. To get the total SSTable count, you will need to sum them across all nodes. A column family with no SSTables on any node is empty.
Because the counts are per node, they will be inflated by replication. So e.g. the sum of all key counts will be approximately a factor of the replication factor higher than the actual key count.