Heap Size of a Node in Cassandra Cluster

Heap Size of a Node in Cassandra Cluster - cassandra

I have a cassandra setup with 6 Node Ring single DC with RF:6 and Read:CL:1. Now at times if a particular node gets lot of requests which in turn would be passed on to all the nodes (cause of RF) it gets on to Compaction with CMS and then finally with ParNew where the whole ring gets in a hanged situation which in turn making it pretty unusable. I found that we could only solve it with increasing Heap size or tweaking in the cassandra code for only reading from the local node(as RF:6 would guarantee every node having the same data although repair etc have to be dealt separately).
How to calculate Heap Size for Cassandra node(I have two Keyspace's with total 14 CF's apart from System CF's).As per cassandra wiki this should be the heap size atleast : memtable_throughput_in_mb * 3 * number of hot CFs + 1G + internal caches where memtable_throughput_in_mb=128mb for my setup. The max row size for a particular CF should matter here. I am not using any row or key cache. Can someone suggest me the same.

Related

Cassandra CPU imbalance in Azure

We have a 30+ node Cassandra cluster (3.11.2) in 4 data centers. One of the centers consists of 8 nodes in Azure running on Standard DS12 v2 (4cpu, 28gb) nodes with a 500GB premium SSD drive. All in the same data center (central US).
We are seeing a dramatic CPU imbalance in the node activity when pushed to the max. We have a keyspace with about 200 million records, and we're running a process to check and refresh the records if necessary from another data stream.
What's happening, is we have 4 nodes that are running at 70-90% CPU compared to 15-25% of the other 4. The measurement of the CPU is being done in the nodes themselves, because Azure's own metrics is broken and never represents what is actually happening.
Digging into a pair of nodes (one low CPU and one high) the difference is the iowait% of the two. The data in the keyspace is balanced (within reason - they are all within 5% of another in record count and size). It looks like the number of reads is balanced, and even the read latency as reported by Cassandra is similar.
When I do an iostat compare of the nodes, the high CPU node is reporting a much higher (by 50 to 100%) rKB/s numbers... which is likely leading to the difference in iowait% time.
These nodes are 100% configured the same, running the same version of everything (OS, libraries, everything) that I can think to look. I cannot figure out why some nodes are deciding to do more disk reads that the others, resulting in the cluster as a whole slowing down.
Anybody have any suggestions on where I can look for differences?
The only thing that is a pattern, is the nodes that are slower are the 4 nodes that were added later in our expansion. We started with 4 nodes for a while and added 4 more when we needed space. All the appropriate repairs and other tasks required with node additions were done - the fact that the records and physical size of the data files on disk being equal should attest to that.
When we shut down our refresh process, the all the nodes settle down to a even 5% or less CPU across the board. No compaction or any other maintenance is happening that would indicate something different.
plz help... :)

Our final solution for this - to fix ONLY the unbalanced problem was to cleanup, full repair and compact. At that point the nodes are relatively equally used. We suspect expanding the cluster (adding nodes) may have left elements of data on the older nodes that were not compacted out based on regular compaction events.
We are still working to try to solve the load issue; but now at least all the nodes are feeling the same CPU crunch.

Cassandra cfstats: differences between Live and Total used space values

For about 1 month I'm seeing the following values of used space for the 3 nodes ( I have replication factor = 3) in my Cassandra cluster in nodetool cfstats output:
Pending Tasks: 0
Column Family: BinaryData
SSTable count: 8145
Space used (live): 787858513883
Space used (total): 1060488819870
For other nodes I see good values, something like:
Space used (live): 780599901299
Space used (total): 780599901299
You can note a 25% difference (~254Gb) between Live and Total space. It seems I have a lot garbage on these 3 nodes which cannot be compacted for some reason.
The column family I'm talking about has a LeveledCompaction strategy configured with SSTable size of 100Mb:
create column family BinaryData with key_validation_class=UTF8Type
and compaction_strategy=LeveledCompactionStrategy
and compaction_strategy_options={sstable_size_in_mb: 100};
Note, that total value staying for month on all of the three nodes. I relied Cassandra normalize data automatically.
What I tried to decrease space (without result):
nodetool cleanup
nodetool repair -pr
nodetool compact [KEYSPACE] BinaryData (nothing happens: major compaction is ignored for LeveledCompaction strategy)
Are there any other things I should try to cleanup a garbage and free space?

Ok, I have a solution. It looks like Cassandra issue.
First, I went deep into the Cassandra 1.1.9 sources and noted that Cassandra perform some re-analysing of SStables during node starting. It removes the SStables marked as compacted, performs recalculation of used space, and do some other staff.
So, what I did is restarted the 3 problem nodes. The Total and Live values have become equals immediately after restart was completed and then Compaction process has been started and used space is reducing now.

Leveled compaction creates sstables of a fixed, relatively small size,
in your case it is 100Mb that are grouped into “levels”. Within each
level, sstables are guaranteed to be non-overlapping. Each level is
ten times as large as the previous.
So basically from this statement provided in cassandra doc, we can conclude that may be in your case ten time large level background is not formed yet, resulting to no compaction.
Coming to second question, since you have kept the replication factor as 3, so data has 3 duplicate copies, for which you have this anomaly.
And finally 25% difference between Live and Total space, as you know its due over deletion operation.

For LeveledCompactionStrategy you want to set the sstable size to a max of around 15 MB. 100MB is going to cause you a lot of needless disk IO and it will cause it to take a long time for data to propagate to higher levels, making deleted data stick around for a long time.
With a lot of deletes, you are most likely hitting some of the issues with minor compactions not doing a great job cleaning up deleted data in Cassandra 1.1. There are a bunch of fixes for tombstone cleanup during minor compaction in Cassandra 1.2. Especially when combined with LCS. I would take a look at testing Cassandra 1.2 in your Dev/QA environment. 1.2 does still have some kinks being ironed out, so you will want to make sure to keep up to date with installing new versions, or even running off of the 1.2 branch in git, but for your data size and usage pattern, I think it will give you some definite improvements.

Cassandra 1.1 - Setup for 24GB RAM and Row Cache

I would like to tune Cassandra for heavy read scenario with skinny rows (5-50 columns). The idea is to use row cache, and enable key cache just in case - when data is to large for row cache.
I have dual Intel Xeon server with 24GB RAM (3 in ring, two data centers - gives 6 machines in total)
Those are changes that I've made to default configuration:
cassandra-env.sh
#JVM_OPTS="$JVM_OPTS -ea"
MAX_HEAP_SIZE="6G"
HEAP_NEWSIZE="500M"
cassandra.yaml
# do not persist caches to disk
key_cache_save_period: 0
row_cache_save_period: 0
key_cache_size_in_mb: 512
row_cache_size_in_mb: 14336
row_cache_provider: SerializingCacheProvider
The idea it to dedicate 6GB to Cassandra JVM, 0.5GB to key cache (out of 6GB heap), and 14GB to row cache as off-heap.
OS has still 4GB which should be enough, since there is running only one JVM process and it should have overhead of max 2GB.
Is this setup optimal? Any hints?
Thanks,
Maciej

I'm using 1.1.6 version.
SerializingCacheProvider will save cache data at Native Heap area.
That area is not for GC inspect. so It will not be occurred GC.
Your row_cache_size_in_mb setting is for SerializingCache's reference object.
That reference is saved using FreeableMemory(It is in 1.1.x. but after 1.2, it changed).
In other words, Your real cache value is not calculated when calculating row_cache_size_in_mb.
At the result If you want to calculate row_cache_size_in_mb, try to set from minimal size.
In my case, when I set 500mb, each node was using 2G old gen.(in according to deal which data set)

Run the heapspace_calculator and use the suggested value as an initial heap configuration. Monitor your heap usage with "nodetool info".
Try to use short column names and merge columns when possible.

This setup works just fine - I've tested it.

cassandra read performance for large number of keys

Here is situation
I am trying to fetch around 10k keys from CF.
Size of cluster : 10 nodes
Data on node : 250 GB
Heap allotted : 12 GB
Snitch used : property snitch with 2 racks in same Data center.
no. of sstables for cf per node : around 8 to 10
I am supercolumn approach.Each row contains around 300 supercolumn which in terms contain 5-10 columns.I am firing multiget with 10k row keys and 1 supercolumn.
When fire the call 1st time it take around 30 to 50 secs to return the result.After that cassandra serves the data from key cache.Then it return the result in 2-4 secs.
So cassandra read performance is hampering our project.I am using phpcassa.Is there any way I can tweak cassandra servers so that I can get result faster?
Is super column approach affects the read performance?

Use of super columns is best suited for use cases where the number of sub-columns is a relatively small number. Read more here:
http://www.datastax.com/docs/0.8/ddl/column_family

Just in case you haven't done this already, since you're using phpcassa library, make sure that you've compiled the Thrift library. Per the "INSTALLING" text file in the phpcassa library folder:
Using the C Extension
The C extension is crucial for phpcassa's performance.
You need to configure and make to be able to use the C extension.
cd thrift/ext/thrift_protocol
phpize
./configure
make
sudo make install
Add the following line to your php.ini file:
extension=thrift_protocol.so

After doing much of RND about this stuff we figured there is no way you can get this working optimally.
When cassandra is fetching these 10k rows 1st time it is going to take time and there is no way to optimize this.
1) However in practical, probability of people accessing same records are more.So we take maximum advantage of key cache.Default setting for key cache is 2 MB. So we can afford to increase it to 128 MB with no problems of memory.
After data loading run the expected queries to warm up the key cache.
2) JVM works optimally at 8-10 GB (Dont have numbers to prove it.Just observation).
3) Most important if you are using physical machines (not cloud OR virtual machine) then do check out the disk scheduler you are using.Set it NOOP which is good for cassandra as it reads all keys from one section reducing disk header movement.
Above changes helped to bring down time required for querying within acceptable limits.
Along with above changes if you have CFs which are small in size but frequently accessed enable row caching for it.
Hope above info is useful.

Cassandra Compaction takes all the resources and leads to node failure

I met very strange problem during testing cassandra. I have a very simple column family that stores video data (keys point to time period and there is only one column containing ~2MB video for this period).
Use Case
I start to load data using Hector API (round-robin) to 6 empty nodes (8GB RAM for Cassandra)- load is run in 4 threads adding 4 rows in second for each thread.
After a while (running load for hour or so) near 100-200 GB are added to the node (depending on the replication factor) and then one or several nodes become unreachable. (no pinging just reboot helps)
Why Compaction
I do use tiered-level compaction and monitoring the system(Debian) i can see that it actually not writes but compaction that takes almost all resources (disk, memory) and cause server to refuse writes and than fail.
After like 30-40 minutes of test compaction tasks just cannot be handled and get queued. Interesting thing is that there are no deletes and updates - so compaction just reads/writes data again and again without bringing actual value to me (like it can be compacted once in the evening).
When i slow down the pace - i.e running 2 threads with 1 second delay things go better but whether it still be working when i have 20TB not 100 GB on a node.
Is Cassandra optimized for such type of workload? How the resources are normally distributed between compaction and reads/writes?
Update
Update of network driver solved problem with unreachable cluster
Thanks,
Sergey.

Cassandra will use up to in_memory_compaction_limit_in_mb memory for a compaction. It is routine to have compaction running while reads and writes are served simultaneously. It is also normal that compaction can fall behind if you continue to throw writes at it as fast as possible; if your read workload requires that compaction be up to date or close to it at all times, then you'll need a larger cluster to spread the load around more machines.
Recommended amount of disk per node for online queries is up to 500GB, maybe 1TB if you're pushing it. Remember that this amount of data will have to be rebuilt if a node fails. Typical Cassandra workloads are CPU-bound or iops-bound, not disk-space bound, so you won't be able to make good use of that space anyway.
(It's also possible to do batch analytics against Cassandra, which we do with the Cassandra Filesystem, in which case higher disk:cpu ratios are desirable, but we use a custom compaction strategy for that as well.)
It's not clear from your report why a server would become unreachable. This is really an OS-level problem. (Are you swapping? Disabling swap would be a good first step.)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string