Retrieve Cassandra row counts faster - cassandra

We have set up our cassandra cluster as 3 nodes in aws ec2 instances. Each instance is of type t2.large
We need to get counts of row from a cassandra table.
We loaded a table in cassandra with 900k records. We have around 91 columns in this table. Most of the records are text datatype.
All these 900k records were part of a single partition key.
When we tried a select count(*) query with that partition key, the query was timing out.
However we were able to retrieve counts through multiple calls by fetching only 100k records in each call. The only disadvantage here is the time taken which is around 1 minute and 3 seconds.
Is there any other approach to get the row count faster in cassandra? Do we need to change the data modelling approach to achieve this?

Hades Architect is correct. You will definitely want to rethink your data model.
With Cassandra, more partitions helps with better data distribution. On the other hand, large partitions can dramatically slow down the disk read process. As the partition grows it will eventually become unusable.
Is there any other approach to get the row count faster in cassandra?
Yes. The DSBulk tool has built-in mechanisms which work with the partition ranges of a cluster and can read/count all rows.
dsbulk count \
-k keyspacename \
-t tablename \
-u username \
-p password \
-h 10.0.0.2

Related

Alter table to add partition taking long time on Hive external table

I'm trying to execute a spark job through EMR cluster with 6 nodes(8 cores and 56GB memory on each node). Spark job does an incremental load on partitions on Hive table and at the end it does a refresh table in order to update the metadata.
Refresh command takes as long as 3-6 hours to complete which is too long.
Nature of data in Hive:
27Gb of data located on S3.
Stored in parquet.
Partitioned on 2 columns.(ex: s3a//bucket-name/table/partCol1=1/partCol2=2020-10-12).
Note- Its a date wise partition and cannot be changed.
Spark config used:
Num-executors= 15
Executor-memory =16Gb
Executor-cores = 2
Driver-memory= 49Gb
Spark-shuffle-partitions=48
Hive.exec.dynamic.partition.mode=nonstrict
Spark.sql.sources.partitionOverwriteMode=dynamic.
Things tried:
Tuning the spark cores/memory/executors but no luck.
Refresh table command.
Alter table add partition command.
Hive cli taking 3-4 hours to complete MSCK repair table tablename
All the above had no effect on reducing the time to refresh the partition on Hive.
Some assumptions:
Am I missing any parameter in tuning as the data is stored in Amazon-S3.?
Currently number of partitions on table are close to 10k is this an issue.?
Any help will be much appreciated.
incase possible, make the partitions to 1 column. It kills when we have multi level (multi column partitions)
use R type instance. It provides more memory compared to M type instances at same price
use coalesce to merge the files in source if there are many small files.
Check the number of mapper tasks. The more the task, lesser the performance
use EMRFS rather than S3 to keep the metadata info
use below
{
"Classification": "spark",
"Properties": {
"maximizeResourceAllocation": "true"
}
}
Follow some of the instructions from below Link

Looking up about 40k records out 150 million records in Cassandra in every job run?

I am building a near real time/ microbatch data application with Cassandra as the lookup store. Each incremental run has ~40K records, while the Cassandra table has about 150 million records. In each run, I need to lookup the id field and get some attributes from Cassandra. These lookups can be random (not any time/ region/ country dependency), so there is no clear partitioning scheme.
How should I try to partition the Cassandra table to ensure decent/ good performance (for microbatches running every 15-30 mins)?
Apart from partitioning, any other tips?
joinWithCassandraTable and leftJoinWithCassandraTable functions were specifically designed for efficient data lookup in Cassandra from Spark jobs. It performs fetching of data by primary or partition key, and because it's executed by multiple executors in parallel, it could be fast (although ~40K could still take time, but it depends on size of your Cassandra and Spark clusters). See the SCC's documentation for detailed information how to use it - but remember, that these functions are available only in RDD API. The DataStax's version of connector has support for so-called "DirectJoin" - efficient joins with Cassandra in the DataFrame API.
Regarding partitioning - it depends on how do you perform lookup - you have 1 record in Cassandra matching one record in Spark? If yes, then just use this ID as primary key (it's equal to partition key in this case).

Mysql or Spark Processing of 400gb data

If I use spark in my case, based on block and cores will it be useful ?
I have 400 GB of data in single table i.e. User_events with multiple columns in MySQL. This table stores all user events from application. Indexes are there on required columns. I have an user interface where user can try different permutation and combination of fields under user_events
Currently I am facing the performance issues where query either takes 15/20 seconds or even longer or times out.
I have gone through couple of Spark tutorial but I am not sure if it can help here. Per mine understanding from spark,
First Spark has to bring all the data in memory. Bring 100 M record on netwok will be costly operation and I will be needing big memory for the
same. Isn't it ?
Once data in memory, Spark can distribute the data among partition based on cores and input data size. Then it can filter the data on each partition
in parallel. Here Spark can be beneficial as it can do the parallel operation while MySQL will be sequential. Is that correct ?
Is my understanding correct ?

Size of cassandra partitions

What is the best tool to find the number of rows in each Cassandra partition? I have a big partition and I want to know how much records are there in that partition
nodetool tablehistograms <keyspace> <table> will give you the distribution of the cells and sizes of thee partition for the table. But that does not give you for sure that partition. To get the specific one you must use count(*) on a select query that specifies the partition key in where clause. A very large partition and that can fail though.
sstablemetadata after 4.0 is based off the describe command in sstable-tools. It will give you the partitions largest in size, and largest in number of rows, and the partitions with most tombstones if you provide the -s to scan the sstable. These can be used against 3.0 and 3.11 sstables. I think 2.1 sstables are not able to be processed though.
...
Partitions: 22515
Rows: 13579337
Tombstones: 0
Cells: 13579337
Widest Partitions:
[12345] 999999
[99049] 62664
[99007] 60437
[99017] 59728
[99010] 59555
Largest Partitions:
[12345] 189888705
[99049] 2965017
[99007] 2860391
[99017] 2826094
[99010] 2818038
...
above example has parititon key an int, it will print out key like:
Widest Partitions:
[frodo] 1
Largest Partitions:
[frodo] 104
You can find the total number of partitions available for a table with nodetool command. ./nodetool cfstats <keyspace>.<table>.
If you know the partition key, you can fire a select count(*) for the partition to get no. of the records in that partition. It's possible that query can timeout for count queries on big partitions set cqlsh request-timeout before executing the query.
To understand how to calculate the physical partition size, go through the Datastax DS220: Data Modeling partition size
Instaclustr has a tool to find the partition size. However, this does not show the number of records in each partition:
https://github.com/instaclustr/cassandra-sstable-tools
As mentioned above either use inbuilt node tool, which could be find within Cassandra Folder extracted from jar , and run nodetool inside terminal .
nodetool toppartitions
Additionally you can also use online tool such as : https://www.cqlguru.io/ , but this need some prior information as vaerage number of rows per partition, average number of text in varchar and all . But this tool is good for rough estimation.

Partition multiple table based on primary key using Apache spark or any big data tool

I have a data of 75 e-commerce customer account data in a csv file.
Also, I have transaction records in another file. Here, Account number is a primary key. Every account is having average 500 transactions.
Now, I want to process this data and make some decision about giving promotional offers. As amount of data is very huge, I decided to go for SparkSQL.
But, the problem is, when I will join this two tables, there will be a lots of shuffling between Cluster nodes. I want to avoid this clustering.
For that if I can ensure that an account's' data on the same partition as transaction data. How can I do that ?
Temporary solution is, I can divide 75 million accounts in 75 files, 1 million account each. and get their transactions in similar fashion. and then spin up 75 spark instances to process them all. Is there any other way to do this ?
Transaction and account details are different dataframe and can't be in the same partition.
However you can use hive bucketing to reduce shuffling. You can save both the files bucketBy the accountId (Maybe apply sorting also). That way when you do a join spark won't do shuffle.
To better understand about hive bucketing with Spark 2.0 Please check this

Resources