Select count(*) unstable on wide rows - Cassandra 2.1.2 - cassandra

Im running a 4 node Cassandra 2.1.2 cluster (6 cores per machine, 32G RAM).
I have 2 similar tables with about 650K rows each. The rows are pretty wide - 150K columns
On the first table when running select count(*) from the cqlsh Im getting the same result in a stable manner (the actual number of rows), but on the second table I get completely different values between run to run.
The only difference between the two tables is that the 2nd tables has a column that contains a collection (list) of 3 Doubles, whereas the first table contains a single Double in that column.
There is no data being inserted into the tables, and there are no compactions going on.
The row cache is disabled.
Any ideas on how to fix this ?

Related

Cassandra (DSE) - Need suggestion on using PER PARTITION LIMIT on huge data

I have a table with around 4M of partitions and each partition contains 4 rows. So, the total data in table would be having 16M rows (wide columns). Since our table is a time series database, we only need the latest row or version of the partition_key. I can achieve my desired results through below query. However this will impact load on clusters and time consuming. Would like to see if we have any other best way to achieve this or this is the only way.
SELECT some_value FROM some_table PER PARTITION LIMIT 1;
Using PER PARTITION LIMIT won't have an impact on performance. In fact, it's efficient for achieving what you need from each partition since only the first row will be returned and it doesn't to iterate over the other rows in the partition. Cheers!

Cassandra: Does SELECT COUNT(*) differ between versions 2.x and 3.x?

I'm migrating data between a Cassandra cluster on version 2.2.4 to one on 3.11.3 by exporting the table as a CSV file and using it to create a new table in the new cluster. I'm using SELECT COUNT(*) to verify that the data has been copied over correctly but am seeing a discrepancy in the number of rows. Could this be because of the difference in versions? Is there anything else that would explain it? Thanks!
Here are the steps I'm running through:
SELECT COUNT(*) FROM table_cass2
count
-------
7951
(1 rows)
COPY table_cass2 TO '/tmp/table.csv'
COPY table_cass3 FROM '/tmp/table.csv'
Using 15 child processes
Starting copy of <table> with columns [..].
Processed: 7951 rows; Rate: 3741 rows/s; Avg. rate: 6045 rows/s
7951 rows imported from 1 files in 1.315 seconds (0 skipped).
SELECT COUNT(*) FROM table_cass3`
count
-------
7919
(1 rows)
To answer my own question, someone else on my team confirmed that it is normal for there to be a small but consistent difference in results for SELECT COUNT(*) queries between different instances of Cassandra.

Merge very large hive Tables (11 to be precise) using Spark

I am basically substituting for another programmer.
Problem Description:
There are 11 hive tables each has 8 to 11 columns. All these tables have around 5 columns whose names are similar but hold different values.
For example Table A has mobile_no, date, duration columns so has Table B. But values are not same. other columns have different names table wise.
In all tables, Data types are string, integer, double I.e. simple data types. String data has a maximum 100 characters.
Each Table contains around 50 millions of data. I have requirement to merge these 11 table taking their columns as it is and make one big table.
Our spark cluster has 20 physical server, each has 36 cores (if count virtualization then 72), RAM 512 GB each. Spark version 2.2.x
I have to merge those with both memory & speed wise efficiently.
Can you guys, help me regarding this problem?
N.B: please let me know if you have questions

Total row count in Cassandra

I totally understand the count(*) from table where partitionId = 'test' will return the count of the rows. I could see that it takes the same time as select * from table where partitionId = 'test.
Is there any other alternative in Cassandra to retrieve the count of the rows in an efficient way?
You can compare results of select * & select count(*) if you run cqlsh, and enable tracing there with tracing on command - it will print time that is required for execution of corresponding command. The difference between both queries is only in what amount of data should be returned back.
But anyway, to find number of rows Cassandra needs to hit SSTable(s), and scan entries - performance could be different if you have partition spread between multiple SSTables - this may depend on your compaction strategy for tables, that is selected based on your reading/writing patterns.
As Alex Ott mentioned, the COUNT(*) needs to go through the entire partition to know that total.
The fact is that Cassandra wants to avoid locks and as a result they do not maintain a number of row in their sstables and each time you do an INSERT, UPDATE, or DELETE, you may actually overwrite another entry which is just marked as a tombstone (i.e. it's not an in place overwrite, instead it saves the new data at the end of the sstable and marks the old data as dead).
The COUNT(*) will go through the sstables and count all the entries not marked as a tombstone. That's very costly. We're used to SQL having the total number of rows in a table or an index so COUNT(*) on those is instantaneous... not here.
One solution I've used is to have Elasticsearch installed on your Cassandra cluster. One of the parameters Elasticsearch saves in their stats is the number of rows in a table. I don't remember the exact query, but more or less you can just a count request and you get a result in like 100ms, always, whatever the number is. Even in the 10s of millions of rows. Just like with a SELECT COUNT(*) ... the result will always be an approximation if you have many writes happening in parallel. It will stabilize if the writes stop for long enough (possibly about 1 or 2 seconds).

What are the maximum number of columns allowed in Cassandra

Cassandra published its technical limitations but did not mention the max number of columns allowed. Is there a maximum number of columns? I have a need to store 400+ fields. Is this possible in Cassandra?
The maximum number of columns per row (or a set of rows, which is called "partition" in Cassandra's CQL) is 2 billion (but the partition must also fit on a physical node, see docs).
400+ fields is not a problem.
As per Cassandra technical limitation page, total no. of cells together cannot exceed 2 billion cells (rows X columns).
You can have a table with (1 row X 2 billion columns) and no more rows will be allowed in that table, so the limit is not 2 billion columns per row but limit is on total no. of cells in a partition.
https://wiki.apache.org/cassandra/CassandraLimitations
Rajmohan's answer is technically correct. On the other hand, if you have 400 CQL columns, you most likely aren't optimizing your data model. You want to generate cassandra wide rows using partition keys and clustering columns in CQL.
Moreover, you don't want to have rows that are too wide from a practical (performance) perspective. A conservative rule of thumb is keep your partitions under the 100's of megs or 100,000's of cells.
Take a look at these two links to help wrap your head around this.
http://www.datastax.com/dev/blog/does-cql-support-dynamic-columns-wide-rows
http://www.sestevez.com/sestevez/CASTableSizer/

Resources