How Cassandra In-memory feature works - cassandra

I have few questions with regards to the In-Memory feature in Cassandra
1.) I have a 4 node datacenter and in Opscenter, under memory usage , it shows there is 100GB of in-memory available. Does it mean that each of the 4 nodes have 100GB memory available or is the 100Gb the total in memory capacity for my datacenter?
2.) If really 100GB is available for In-Memory for a datacenter, is it advisable to use the full capacity? Do I need to factor replication factor as well? Say I have a 15GB data which I want to store it in In-Memory, if the replication factor is 2, will it be like we have 30GB of data in In-memory for the datacenter?
3.) In dse.yaml file, there is a property which has the value like percentage of system memory "max_memory_to_lock_fraction" and by default it is 20%. As per the guidelines from Datastax Cassandra, we need to ensure that the in memory usage does not exceed 45% of total available system memory for each node. Is this "max_memory_to_lock_fraction" the parameter that needs to be set for 45%?
4.) Datastax documentation says compression needs to be removed for In-memory table. If compression is indeed set, will it affect the read/write performance?
5.) Output of dsetool inmemorystatus has a parameter called "Current Total memory not able to lock". Is the value present in this parameter denote the available memory. Like say if the value is 1024MB, does it mean that still 1GB In-memory is available for use.
I am using DSE 4.8.11 version. Please help me as I am trying to understand this feature so as to leverage it best.
Thanks in advance.

1) It depends on how you configure it it can be per cluster (all of the available memory) or you can view graphs of individual nodes
2) Yes, replication factor increases data by factor times in total. You will have to factor that in on the cluster level. Very nice tool to help you start: https://www.ecyrd.com/cassandracalculator/
3) Yes max_memory_to_lock_fraction is what you are looking for
4) It will increase processing time, since writes in cassandra are actually cpu bound this might not be best performance wise idea.
5) Yes this means there is still memory (of specified amount), but due to settings cassandra is unable to lock it.

Related

Evaluation of Cassandra ring

I am not sure if this is the right place to post this but let me know if not. So, I have a Cassandra ring with 16 nodes which contains ~1.1 billion records. I would like to know how can I evaluate this cluster. What kind of metrics are important to collect and how (i.e. memory consumption during the inserts?)? For example writing/reading speed? Compression ratio? Compacted partitions? Should I also use Jconsole somehow?
Feel free to post any documentation or links.
Important metrics to watch for on an Apache Cassandra cluster:
Java heap usage
reads/sec
writes/sec
current compactions
disk space
disk latency
disk IOPS
nodes down
CPU (although, I wouldn't alert on it)
That should be a good enough list to help you get started. Check out the Monitoring page from the official docs for more information and additional metrics.

Cassandra read latency increases while writing

I have a cassandra cluster, its read latency increases during writes. The writes mostly happen via spark jobs during the night time. The writes happen in huge bursts, is there a way to reduce read latency during the writes. The writes happen using LOCAL_QUORUM and reads happen using LOCAL_ONE. Is there a way to reduce read latency when writes are happening?
Cassandra Cluster Configs
10 Node cassandra cluster (5 in DC1, 5 in DC2)
CPU: 8 Core
Memory: 32GB
Grafana Metrics
I can give some advice:
Use LCS compaction strategy.
Prefer round-robin load balancing policy for reads.
Choose partition_key wisely so that requests are not bombarded on a single partition.
Partition size also play a good role. Cassandra recommends to have smaller partition size. However, I have tested with Partitions of 10000 rows each with each row having size of 800 bytes. It worked better than with 3000 rows(or even 1 row). Very tiny partitions tend to increase CPU usage when data stored is large in terms of row count. However, very large partitions should be avoided even.
Replication Factor should be chosen strategically . Write consistency level should be decided considering the replication of all keyspaces.

Cassandra compaction: does replication factor have any influence?

Let’s assume that the total disk usage of all keyspaces is 100GB before replication. The replication factor is 3. Making the total physical disk usage = 100GB x 3 = 300GB.
We use the default compaction strategy (size-tiered) and let’s assume the worse case where Cassandra needs as much free space as the the data to complete the compaction. Does Cassandra needs 100GB (before replication) or 300GB (100GB x3 with replication)?
In other words, when Cassandra needs free disk space for performing compaction, does the replication factor has any influence?
Compaction in Cassandra is local to a Node.
Now let's say you have a 3 node cluster, replication factor is also 3, and the original data size is 100GB. This means that each node has 100GB worth of data.
Hence on each node, I will need 100GB free space to compact the data present on that node.
TLDR; Free space required for Compaction is equal to the total data present on the node.
Because data is replicated between the nodes, every node will need to have up to 100Gb free space - so it's total of the 300Gb, but not on one node...

How dataproc works with google cloud storage?

I am searching for working of google dataproc with GCS. I am using pyspark of dataproc. Data is read from and written to GCS.But unable to figure out best machine types for my use case. Questions
1) Does spark on dataproc copies data to local disk? e.g. If I am processing 2 TB of data, is it ok If I use 4 machine node with 200GB hdd? OR I should at least provide disk that can hold input data?
2) If the local disk is not at all used then is it ok to use high memory low disk instances?
3) If local disk is used then which instance type is good for processing 2 TB of data with minimum possible number of nodes? I mean is good to use SSD ?
Thanks
Manish
Spark will read data directly into memory and/or disk depending on if you use RDD or DataFrame. You should have at least enough disk to hold all data. If you are performing joins, then amount of disk necessary grows to handle shuffle spill.
This equation changes if you discard significant amount of data through filtering.
Whether you use pd-standard, pd-ssd, or local-ssd comes down to cost and if your application is CPU or IO bound.
Disk IOPS is proportional to disk size, so very small disks are inadvisable. Keep in mind that disk (relative to CPU) is cheap.
Same advice goes for network IO: more CPUs = more bandwidth.
Finally, default Dataproc settings are a reasonable place to start experimenting and tweaking your settings.
Source: https://cloud.google.com/compute/docs/disks/performance

Cassanda, the available RAM is used to a maximum

I want to make a simple query over approximately 10 mio rows.
I have 32GB RAM (20GB is free). And Cassandra is using so much memory, that the available RAM is used to a maximum, and the process is killed.
How can I optimize Cassandra? I have read about "Tuning Java resources" and changing the Java heap sizing, but I still have no solution.
Cassandra will use up as much memory as is available to it on the system. It's a greedy process and will use any available memory for caching, similar to the way the kernel page cache works. Don't worry if Cassandra is using all your hosts memory, it will just be in cache and will be released to other processes if necessary.
If your query is suffering from timeouts this will probably be from reading too much data from a single partition so that the query doesn't return in under read_request_timeout_in_ms. If this is the case you should look at making your partition sizes smaller.

Resources