Cassandra outofMemory - cassandra

We are having a 5 cluster setup for Cassandra 3.0.9. We are seeing outofMemory exception in Cassandra. It is using thrift library API 0.9.2. These outof memory Exceptions are every 2-3 days on random nodes from Cluster.
The Max heap size for each Cassandra process is 8GB and RAM is 32GB.
We tried to Analyze the heap dump and it shows each of the thrift Thread Object is 128MB and there are around 55 threads. These thrift Object are consuming a lot of memory i.e. around 7GB.
Heap Dump
We are not sure whether there is any Memory leak into the thrift API.
Any help would be really helpful.

Related

How to start Cassandra and Spark services assigning half of the RAM memory on each program?

I have a 4 nodes cluster on which there are installed Spark and Cassandra on each node.
Spark version 3.1.2 and Cassandra v3.11
Let me say that each nodes have 4GB of RAM and I want to run my "spark+cassandra" program all over the cluster.
How can I assign 2GB of RAM for Cassandra execution and 2GB for Spark execution?
I noted that.
If my Cassandra cluster is up and I run start-worker.sh command on a worker node to make my spark cluster up, suddenly Cassandra service stops but spark still works. Basically, Spark steals RAM resources to Cassandra. How can I avoid also this?
On Cassandra logs of the crashed node I read the message:
There is insufficient memory for the Java Runtime Environment to continue.
In fact typing top -c and then shift+M i can see Spark Service at the top of column Memory
Thanks for any suggestions.
By default, Spark workers take up the total RAM less 1GB. On a 4GB machine, the worker JVM consumes 3GB of memory. This is the reason the machine runs out of memory.
You'll need to configure the SPARK_WORKER_MEMORY to 1GB to leave enough memory for the operating system. For details, see Starting a Spark cluster manually.
It's very important to note as Alex Ott already pointed out, a machine with only 4GB of RAM is not going to be able to do much so expect to run into performance issues. Cheers!

Spark driver pod getting killed with 'OOMKilled' status

We are running a Spark Streaming application on a Kubernetes cluster using spark 2.4.5.
The application is receiving massive amounts of data through a Kafka topic (one message each 3ms). 4 executors and 4 kafka partitions are being used.
While running, the memory of the driver pod keeps increasing until it is getting killed by K8s with an 'OOMKilled' status. The memory of executors is not facing any issues.
When checking the driver pod resources using this command :
kubectl top pod podName
We can see that the memory increases until it reaches 1.4GB, and the pod is getting killed.
However, when checking the storage memory of the driver on Spark UI, we can see that the storage memory is not fully used (50.3 KB / 434 MB). Is there any difference between the storage memory of the driver, and the memory of the pod containing the driver ?
Has anyone had experience with a similar issue before?
Any help would be appreciated.
Here are few more details about the app :
Kubernetes version : 1.18
Spark version : 2.4.5
Batch interval of spark streaming context : 5 sec
Rate of input data : 1 kafka message each 3 ms
Scala language
In brief, the Spark memory consists of three parts:
Reversed memory (300MB)
User memory ((all - 300MB)*0.4), used for data processing logic.
Spark memory ((all-300MB)*0.6(spark.memory.fraction)), used for cache and shuffle in Spark.
Besides this, there is also max(executor memory * 0.1, 384MB)(0.1 is spark.kubernetes.memoryOverheadFactor) extra memory used by non-JVM memory in K8s.
Adding executor memory limit by memory overhead in K8S should fix the OOM.
You can also decrease spark.memory.fraction to allocate more RAM to user memory.

Spark driver memory exceeded the storage memory

I can not understand why in Spark UI, the driver used storage memory (2.1GB) exceeds the total available memory (1.5GB).
When I use the same application with Spark 2.1.1 I don't have the same behavior, the Spark driver memory is few Mb. Also, the application behavior becomes slower and slower with the same data.
My questions:
The used storage memory is a accumulation and not the current used memory?
Is a UI bug?
What are these 2.1Gb of data?

Cassandra keep using 100% of CPU and not utilizing memoery?

We have setup Cassandra single node of 3.11 with JDK 1.8 on ec2 with instance type t2.large which has 2 CPU and 7 GB of RAM.
We facing the issue that Cassandra keeps reaching CPU 100% even we do not have that much load.
We have 7GB of RAM but Cassandra not utilizing that Memory.it only uses 1.7-1.8 GB of RAM.
What configuration needs to change to reduce CPU utilization to not reach to 100%.
what best configuration to get better performance out of Cassandra.
Right now we able to get only about 100-120 read and 50-100 write operation per sec.
Please, some one helps us to understand the issue and what ways to improve performance configuration.

Java OutOfMemoryError in Windows Azure Virtual Machine

When I run my Java applications on a Window Azure's Ubuntu 12.04 VM,
with 4 by 1.6GHZ core and 7G RAM, I get the following out of memory error after a few minutes.
java.lang.OutOfMemoryError: GC overhead limit exceeded
I have a swap size of 15G byte, and the max heap size is set to 2G. I am using a Oracle Java 1.6. Increase the max heap size only delays the out of memory error.
It seems the JVM is not doing garbage collection.
However, when I run the above Java application on my local Windows 8 PC (core i7) , with the same JVM parameters, it runs fine. The heap size never exceed 1G.
Is there any extra setting on Windows Azure linux VM for running Java apps ?
On Azure VM, I used the following JVM parameters
-XX:+HeapDumpOnOutOfMemoryError
to get a heap dump. The heap dump shows an actor mailbox and Camel messages are taking up all the 2G.
In my Akka application, I have used Akka Camel Redis to publish processed messages to a Redis channel.
The out of memory error goes away when I stub out the above Camel Actor. It looks as though Akka Camel Redis Actor
is not performant on the VM, which has a slower cpu clock speed than my Xeon CPU.
Shing
The GC throws this exception when too much time is spent in garbage collection without collecting anything. I believe the default settings are 98% of CPU time being spent on GC with only 2% of heap being recovered.
This is to prevent applications from running for an extended period of time while making no progress because the heap is too small.
You can turn this off with the command line option -XX:-UseGCOverheadLimit

Resources