Cassandra: Does SELECT COUNT(*) differ between versions 2.x and 3.x?

Cassandra: Does SELECT COUNT(*) differ between versions 2.x and 3.x? - cassandra

I'm migrating data between a Cassandra cluster on version 2.2.4 to one on 3.11.3 by exporting the table as a CSV file and using it to create a new table in the new cluster. I'm using SELECT COUNT(*) to verify that the data has been copied over correctly but am seeing a discrepancy in the number of rows. Could this be because of the difference in versions? Is there anything else that would explain it? Thanks!
Here are the steps I'm running through:
SELECT COUNT(*) FROM table_cass2
count
-------
7951
(1 rows)
COPY table_cass2 TO '/tmp/table.csv'
COPY table_cass3 FROM '/tmp/table.csv'
Using 15 child processes
Starting copy of <table> with columns [..].
Processed: 7951 rows; Rate: 3741 rows/s; Avg. rate: 6045 rows/s
7951 rows imported from 1 files in 1.315 seconds (0 skipped).
SELECT COUNT(*) FROM table_cass3`
count
-------
7919
(1 rows)

To answer my own question, someone else on my team confirmed that it is normal for there to be a small but consistent difference in results for SELECT COUNT(*) queries between different instances of Cassandra.

Related

Automatically Updating a Hive View Daily

I have a requirement I want to meet. I need to sqoop over data from a DB to Hive. I am sqooping on a daily basis since this data is updated daily.
This data will be used as lookup data from a spark consumer for enrichment. We want to keep a history of all the data we have received but we don't need all the data for lookup only the latest data (same day). I was thinking of creating a hive view from the historical table and only showing records that were inserted that day. Is there a way to automate the view on a daily basis so that the view query will always have the latest data?

Q: Is there a way to automate the view on a daily basis so that the
view query will always have the latest data?
No need to update/automate the process if you get a partitioned table based on date.
Q: We want to keep a history of all the data we have received but we
don't need all the data for lookup only the latest data (same day).
NOTE : Either hive view or hive table you should always avoid scanning the full table data aka full table scan for getting latest partitioned data.
Option 1: hive approach to query data
If you want to adapt hive approach
you have to go with partition column for example : partition_date and partitioned table in hive
select * from table where partition_column in
(select max(distinct partition_date ) from yourpartitionedTable)
or
select * from (select *,dense_rank() over (order by partition_date desc) dt_rnk from db.yourpartitionedTable ) myview
where myview.dt_rnk=1
will give the latest partition always. (if same day or todays date is there in partition data then it will give the same days partition data otherwise it will give max partition_date) and its data from the partition table.
Option 2: Plain spark approach to query data
with spark show partitions command i.e. spark.sql(s"show Partitions $yourpartitionedtablename") get the result in array and sort that to get latest partition date. using that you can query only latest partitioned date as lookup data using your spark component.
see my answer as an idea for getting latest partition date.
I prefer option2 since no hive query is needed and no full table query since
we are using show partitions command. and no performance bottle necks
and speed will be there.
One more different idea is querying with HiveMetastoreClient or with option2... see this and my answer and the other

I am assuming that you are loading daily transaction records to your history table with some last modified date. Every time you insert or update record to your history table you get your last_modified_date column updated. It could be date or timestamp also.
you can create a view in hive to fetch the latest data using analytical function.
Here's some sample data:
CREATE TABLE IF NOT EXISTS db.test_data
(
user_id int
,country string
,last_modified_date date
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS orc
;
I am inserting few sample records. you see same id is having multiple records for different dates.
INSERT INTO TABLE db.test_data VALUES
(1,'India','2019-08-06'),
(2,'Ukraine','2019-08-06'),
(1,'India','2019-08-05'),
(2,'Ukraine','2019-08-05'),
(1,'India','2019-08-04'),
(2,'Ukraine','2019-08-04');
creating a view in Hive:
CREATE VIEW db.test_view AS
select user_id, country, last_modified_date
from ( select user_id, country, last_modified_date,
max(last_modified_date) over (partition by user_id) as max_modified
from db.test_data ) as sub
where last_modified_date = max_modified
;
hive> select * from db.test_view;
1 India 2019-08-06
2 Ukraine 2019-08-06
Time taken: 5.297 seconds, Fetched: 2 row(s)
It's showing us result with max date only.
If you further inserted another set of record with max last modified date as:
hive> INSERT INTO TABLE db.test_data VALUES
> (1,'India','2019-08-07');
hive> select * from db.test_view;
1 India 2019-08-07
2 Ukraine 2019-08-06
for reference:Hive View manuual

Merge very large hive Tables (11 to be precise) using Spark

I am basically substituting for another programmer.
Problem Description:
There are 11 hive tables each has 8 to 11 columns. All these tables have around 5 columns whose names are similar but hold different values.
For example Table A has mobile_no, date, duration columns so has Table B. But values are not same. other columns have different names table wise.
In all tables, Data types are string, integer, double I.e. simple data types. String data has a maximum 100 characters.
Each Table contains around 50 millions of data. I have requirement to merge these 11 table taking their columns as it is and make one big table.
Our spark cluster has 20 physical server, each has 36 cores (if count virtualization then 72), RAM 512 GB each. Spark version 2.2.x
I have to merge those with both memory & speed wise efficiently.
Can you guys, help me regarding this problem?
N.B: please let me know if you have questions

Total row count in Cassandra

I totally understand the count(*) from table where partitionId = 'test' will return the count of the rows. I could see that it takes the same time as select * from table where partitionId = 'test.
Is there any other alternative in Cassandra to retrieve the count of the rows in an efficient way?

You can compare results of select * & select count(*) if you run cqlsh, and enable tracing there with tracing on command - it will print time that is required for execution of corresponding command. The difference between both queries is only in what amount of data should be returned back.
But anyway, to find number of rows Cassandra needs to hit SSTable(s), and scan entries - performance could be different if you have partition spread between multiple SSTables - this may depend on your compaction strategy for tables, that is selected based on your reading/writing patterns.

As Alex Ott mentioned, the COUNT(*) needs to go through the entire partition to know that total.
The fact is that Cassandra wants to avoid locks and as a result they do not maintain a number of row in their sstables and each time you do an INSERT, UPDATE, or DELETE, you may actually overwrite another entry which is just marked as a tombstone (i.e. it's not an in place overwrite, instead it saves the new data at the end of the sstable and marks the old data as dead).
The COUNT(*) will go through the sstables and count all the entries not marked as a tombstone. That's very costly. We're used to SQL having the total number of rows in a table or an index so COUNT(*) on those is instantaneous... not here.
One solution I've used is to have Elasticsearch installed on your Cassandra cluster. One of the parameters Elasticsearch saves in their stats is the number of rows in a table. I don't remember the exact query, but more or less you can just a count request and you get a result in like 100ms, always, whatever the number is. Even in the 10s of millions of rows. Just like with a SELECT COUNT(*) ... the result will always be an approximation if you have many writes happening in parallel. It will stabilize if the writes stop for long enough (possibly about 1 or 2 seconds).

Pyspark giving error with IN query in cassandra

I have a large Cassandra keyspace (around 20 GB) on aws Cassandra server with a master server of 16GB ram. I am trying to run an IN query
"select colA colB colC where colA in {}".foramt( variable );
colA is Clustering Key.
variable is a python datatype which has around 500K entries. Currently I am facing two problems first is it is not at all working for above query and for variable of length around 20K it is taking around 20 minutes any optimization that can be done.

Select count(*) unstable on wide rows - Cassandra 2.1.2

Im running a 4 node Cassandra 2.1.2 cluster (6 cores per machine, 32G RAM).
I have 2 similar tables with about 650K rows each. The rows are pretty wide - 150K columns
On the first table when running select count(*) from the cqlsh Im getting the same result in a stable manner (the actual number of rows), but on the second table I get completely different values between run to run.
The only difference between the two tables is that the 2nd tables has a column that contains a collection (list) of 3 Doubles, whereas the first table contains a single Double in that column.
There is no data being inserted into the tables, and there are no compactions going on.
The row cache is disabled.
Any ideas on how to fix this ?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string