We have a 32 node Cassandra cluster with around 100Gb per node using Murmur3 partitioner. It has time series data and we have build secondary indexes on two columns to perform range queries. Currently, the cluster is stable with all the data bulk loaded and all the secondary indexes rebuilt. The issue occurs when we are performing range queries using cql client or hector, just the query for count of rows takes a huge amount of time and it most cases causes nodes to fail due to memory issues. The nodes have 8gb memory, Cassandra MAX Heap is allotted to 4 GB. Has anyone else faced such an issue ? Is there a better way to do count queries ?
I've had similar issues and most often this can be solved by redesigning the schema bearing in mind the queries that you plan to execute against the data in Cassandra. For a timeseries data it is better to have wide tables with granularity depending on your queries. If your query requires data at a granularity of 1 hour, then it is best to have a wide table with all timestamped data points stored within a single row for every hour so you can get all the required data for 1 hour by reading just 1 row.
Since you say the data is bulk loaded, I am assuming that you may have put all the data into a single table which is why the get_count query is taking an enormous amount of time. We have a a cluster with 8GB RAM but have set the heap size to 3 GB because at 4GB, the RAM utilization is almost always at 8GB [full utilization].
Related
I have a delta table which is partitioned by multiple keys, one of which includes date excluding minute details(only upto hour, example - Fri, 15 Jul 2022 07)
Now, with the data keep ingesting via batch and streaming ingestion workflow, what would be the best strategy to evaluate number of executors to read all the data from delta table?
One of the very naive way could be to just let spark autoscale but we may still need to play with shuffle partitions etc. Looking for hints or best practices around the same. Thanks!
If you want to "read all the data from delta table" it does not really matter whether this table is partitioned or not since the query reads all the data and hence loads the whole table.
This is the worst possible query - the dreaded full scan. If it's inevitable, just know that that is the kind of queries where Spark SQL shines so bright utilising the full power of a Spark cluster. You've been warned :)
Executors are simply machines with CPU cores and memory. You're probably more interested in the number of CPU cores for all the tasks to load the delta table.
I'd start this calculation with the number of files for a given version of the delta table. Files are of different size and (I might be wrong here) they are usually chunked (I don't want to use the overloaded term partitioned here, but that's what springs to my mind) to 512MB splits.
The number of splits (512MB blocks) for all the files of a given version of the delta table would be the number of tasks. That would give you the number of CPU cores and hence their "containers", i.e. Spark executors (to evenly saturate available physical resources for the best performance).
I have Spark dataframe with just 2 columns like { Key| Value}. And this dataframe has 10 million records. I am inserting this into HBase table (has 10 pre-split regions) using bulk load approach from Spark. This works fine and loads the data successfully. When I checked the size table it was like 151GB (453 gb with 3x hadoop replication). I ran major compaction on that table, and table size got reduced to 35GB (105gb with 3x replication).
I am trying to run the same code and same data in a different cluster. But here I have quota limitation of 2TB to my namespace. My process fails while loading HFiles to HBase saying its quota limit exceeded.
I would like to know whether Spark creates much more data files than the required 151GB during the bulk load? If so how to avoid that? or is there better approach to load the same?
The question is that if actual data is around 151gb (before major_compact), then why 2TB size is not enough?
If I use spark in my case, based on block and cores will it be useful ?
I have 400 GB of data in single table i.e. User_events with multiple columns in MySQL. This table stores all user events from application. Indexes are there on required columns. I have an user interface where user can try different permutation and combination of fields under user_events
Currently I am facing the performance issues where query either takes 15/20 seconds or even longer or times out.
I have gone through couple of Spark tutorial but I am not sure if it can help here. Per mine understanding from spark,
First Spark has to bring all the data in memory. Bring 100 M record on netwok will be costly operation and I will be needing big memory for the
same. Isn't it ?
Once data in memory, Spark can distribute the data among partition based on cores and input data size. Then it can filter the data on each partition
in parallel. Here Spark can be beneficial as it can do the parallel operation while MySQL will be sequential. Is that correct ?
Is my understanding correct ?
I'm working on designing Cassandra column family.
I met with a situation of higher GC while SELECTing, after loading a higher density of data. That is, amount of data in a partition increased. Also for low density data, it works fine.
I want to know how Cassandra does the SELECT query (with both partition and cluster key specified)?
Is the whole set of data in a partition is loaded into memory while we execute SELECT?
Will large number of partition keys affect performance?
Cassandra does not load the entire partition into memory, but it does load IndexInfo objects which help Cassandra find the relevant CQL rows within the partition. These are short lived java objects which can create quite a bit of heap pressure (GC pauses) - this is a design issue that will be addressed in CASSANDRA-9754 (known as Birch, a b-tree implementation of the index data structure).
Until cassandra-4.0 is released, you should target 100MB for your max partition size, and break larger partitions into smaller pieces.
DB used: Datastax cassandra community 3.0.9
Cluster: 3 x (8core 64GB AWS) with 300GB io1 with 3000iops.
Java heap memory allocated = 8g
Write consistency: Quorum , read consistency: ONE Replication factor: 3
Problem: I loaded our servers with 50,000 users and each user had 1000 records initially and after sometime, 20 more records were added to each users. I wanted to fetch the 20 additional records that were added later(Query : select * from table where userID='xyz' and timestamp > 123)
CREATE TABLE tbl (
userID text,
timestamp timestamp,
....
PRIMARY KEY (userID, timestamp)
);
I have added additional 200GB of data into the tbl apart from the original data for 50,000 users.
Heap memory usage is in the range 2-4 GB but almost all the remaining off heap memory (56 GB) is eaten up by cassandra.
From this point onwards, if more data is added to the table, a sharp decline in read throughput is observed due to unavailable memory.
Though, it meets the read throughput SLA, but does not seem to be a scalable solution - (3 x 64 GB) RAM for 200 GB data.
Note:
In the load test experiment, records for the only initial 50,000 users
are being fetched.
Row cache is disabled.
It's read intensive application - 2000 reads / sec
What could be the possible reason for high off heap memory usage?
If your partitions are very large then read times take longer. Since your data is partitioned by userid, all data associated with each user id is stored in a single partition on disk (within the partition, your data is ordered by the timestamp- your clustering key). When reading, cassandra must traverse the whole partition to find the data you are requesting in the read. If each userid has a lot of data associated with it, you could have quite large partitions on disk which will take longer to read.