How Cassandra Handle a select query? - cassandra

I'm working on designing Cassandra column family.
I met with a situation of higher GC while SELECTing, after loading a higher density of data. That is, amount of data in a partition increased. Also for low density data, it works fine.
I want to know how Cassandra does the SELECT query (with both partition and cluster key specified)?
Is the whole set of data in a partition is loaded into memory while we execute SELECT?
Will large number of partition keys affect performance?

Cassandra does not load the entire partition into memory, but it does load IndexInfo objects which help Cassandra find the relevant CQL rows within the partition. These are short lived java objects which can create quite a bit of heap pressure (GC pauses) - this is a design issue that will be addressed in CASSANDRA-9754 (known as Birch, a b-tree implementation of the index data structure).
Until cassandra-4.0 is released, you should target 100MB for your max partition size, and break larger partitions into smaller pieces.

Related

Size of a partition or RDD

How can we calculate the size of a partition in a RDD? Is it not a recommended to calculate the partition size ? I want to dynamically set the number of shuffle partition before I call any action, hence need to calculate the partition size and depending on the number of executors want to set the shuffle partition count.
"I want to dynamically set the number of shuffle partition before I call any action"
unfortunately that's challenging todo in spark without diving deep into the low level code. In fact this is something that adaptive execution in spark 3.0 is bringing to the table. What it will do is over partition the dataset and then dynamically combine small partitions to reach a certain threshold.
https://databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html
you can get the RDD partition size using below command:
someRDD.partitions.size
you can use different methods of partitioning like:
based on the columns
based on the (ataset size)/(block size)
based on the cores available

Maximum number of Partition in a Cassandra Table and How it depends on Disk space?

I am using Cassandra DB, i want to know what could be the maximum number of partitions that can be made ?
Will more number of partition impact on the performance of Cassandra ?
how number of partition depend on the disk space ?
There is no maximum number of partitions in Cassandra (that I know of). The number of partitions does have an impact on performance, and the more the better (this implies a good distribution of data which is exactly what you want for scalability). Disk space has nothing to do with it. You'll use whatever you're going to store.

Spark dataset exceeds total ram size

I am recently working in spark and came across few queries which I still couldn't resolve.
Let's say i have a dataset of 100GB and my ram size of the cluster is
16 GB.
Now, I know in case of simply reading the file and saving it in the HDFS will work as Spark will do it for each partition. What will happen when I perform sorting or aggregation transformation on 100GB data? How will it process 100GB in memory since we need entire data in case of sorting?
I have gone through below link but this only tells us what spark do in case of persisting, what I am looking is Spark aggregations or sorting on dataset greater than ram size.
Spark RDD - is partition(s) always in RAM?
Any help is appreciated.
There are 2 things you might want to know.
Once Spark reaches the memory limit, it will start spilling data to
disk. Please check this Spark faq and also there are severals
question from SO talking about the same, for example, this one.
There is an algorihtm called external sort that allows you to sort datasets which do not fit in memory. Essentially, you divide the large dataset by chunks which actually fit in memory, sort each chunk and write each chunk to disk. Finally, merge every sorted chunk in order to get the whole dataset sorted. Spark supports external sorting as you can see here and here is the implementation.
Answering your question, you do not really need that your data fit in memory in order to sort it, as I explained to you before. Now, I would encourage you to think about an algorithm for data aggregation dividing the data by chunks, just like external sort does.
There are multiple things you need to consider. Because you have 16RAM and 100GB data set, it will be good idea to keep persistence in DISK. It maybe difficult as when aggregating if data set has high cardinality. If the cardinality is low you will be better of to do aggregate at each RDD before merging into whole dataset. Also remember to make sure that each partition in RDD is less than memory (default value 0.4*container_size)

Does spark keep all elements of an RDD[K,V] for a particular key in a single partition after "groupByKey" even if the data for a key is very huge?

Consider I have a PairedRDD of,say 10 partitions. But the keys are not evenly distributed, i.e, all the 9 partitions having data belongs to a single key say a and rest of the keys say b,c are there in last partition only.This is represented by the below figure:
Now if I do a groupByKey on this rdd, from my understanding all data for same key will eventually go to different partitions or no data for the same key will not be in multiple partitions. Please correct me if I am wrong.
If that is the case then there can be a chance that the partition for key a can be of size that may not fit in a worker's RAM. In that case what spark will do ? My assumption is like it will spill the data to worker's disk.
Is that correct?
Or how spark handle such situations
Does spark keep all elements (...) for a particular key in a single partition after groupByKey
Yes, it does. This is a whole point of the shuffle.
the partition for key a can be of size that may not fit in a worker's RAM. In that case what spark will do
Size of a particular partition is not the biggest issue here. Partitions are represented using lazy Iterators and can easily store data which exceeds amount of available memory. The main problem is non-lazy local data structure generated in the process of grouping.
All values for the particular key are stored in memory as a CompactBuffer so a single large group can result in OOM. Even if each record separately fits in memory you may still encounter serious GC issues.
In general:
It is safe, although not optimal performance wise, to repartition data where amount of data assigned to a partition exceeds amount of available memory.
It is not safe to use PairRDDFunctions.groupByKey in the same situation.
Note: You shouldn't extrapolate this to different implementations of groupByKey though. In particular both Spark Dataset and PySpark RDD.groupByKey use more sophisticated mechanisms.

Cassandra running out of memory for cql queries

We have a 32 node Cassandra cluster with around 100Gb per node using Murmur3 partitioner. It has time series data and we have build secondary indexes on two columns to perform range queries. Currently, the cluster is stable with all the data bulk loaded and all the secondary indexes rebuilt. The issue occurs when we are performing range queries using cql client or hector, just the query for count of rows takes a huge amount of time and it most cases causes nodes to fail due to memory issues. The nodes have 8gb memory, Cassandra MAX Heap is allotted to 4 GB. Has anyone else faced such an issue ? Is there a better way to do count queries ?
I've had similar issues and most often this can be solved by redesigning the schema bearing in mind the queries that you plan to execute against the data in Cassandra. For a timeseries data it is better to have wide tables with granularity depending on your queries. If your query requires data at a granularity of 1 hour, then it is best to have a wide table with all timestamped data points stored within a single row for every hour so you can get all the required data for 1 hour by reading just 1 row.
Since you say the data is bulk loaded, I am assuming that you may have put all the data into a single table which is why the get_count query is taking an enormous amount of time. We have a a cluster with 8GB RAM but have set the heap size to 3 GB because at 4GB, the RAM utilization is almost always at 8GB [full utilization].

Resources