How to optimaly tune my JVM settings of my DSE spark nodes? - apache-spark

I have a 6 node cluster with 32 core CPU and 64 GB RAM.
As of now, all are running with default JVM settings of Cassandra (v2.1.5). With this setting, each node uses 40GB RAM and 20% CPU. It is a read heavy cluster with a constant flow of data and deletes.
Do I need to tune the JVM settings of Cassandra to utilize more memory? What other things should I be looking at to make appropriate settings?

Related

How to start Cassandra and Spark services assigning half of the RAM memory on each program?

I have a 4 nodes cluster on which there are installed Spark and Cassandra on each node.
Spark version 3.1.2 and Cassandra v3.11
Let me say that each nodes have 4GB of RAM and I want to run my "spark+cassandra" program all over the cluster.
How can I assign 2GB of RAM for Cassandra execution and 2GB for Spark execution?
I noted that.
If my Cassandra cluster is up and I run start-worker.sh command on a worker node to make my spark cluster up, suddenly Cassandra service stops but spark still works. Basically, Spark steals RAM resources to Cassandra. How can I avoid also this?
On Cassandra logs of the crashed node I read the message:
There is insufficient memory for the Java Runtime Environment to continue.
In fact typing top -c and then shift+M i can see Spark Service at the top of column Memory
Thanks for any suggestions.
By default, Spark workers take up the total RAM less 1GB. On a 4GB machine, the worker JVM consumes 3GB of memory. This is the reason the machine runs out of memory.
You'll need to configure the SPARK_WORKER_MEMORY to 1GB to leave enough memory for the operating system. For details, see Starting a Spark cluster manually.
It's very important to note as Alex Ott already pointed out, a machine with only 4GB of RAM is not going to be able to do much so expect to run into performance issues. Cheers!

Cassandra - Out of memory (large server)

About cassandra, we started using a cluster with a very high workload (4 hosts) The machines have 128Gb each and 64 CPU. My max_heap is with 48 GB and my new_heap is with 2 GB. What is considered in this case? Anyone?
There is no script or anything that analyzes this type of configuration.
The older servers running 60Gb and 16 procs, and thats ok.
Why this problem?

Hadoop YARN Cluster / Spark and RAM Disks

Because my computational tasks require fast disk I/O, I am interested in mounting large RAM disks on each worker node in a YARN cluster that runs Spark, and am thus wondering how the YARN cluster manager handles the memory occupied by such a RAM disk.
If I were to allocate 32GB to a RAM disk on each 128GB RAM machine, for example, would the YARN cluster manager know how to allocate RAM so as to avoid over allocating memory when performing tasks (in this case, does YARN of RAM to the requisitioned tasks, or at most only 96GB)?
If so, is there any way to indicate to the YARN cluster manager that a RAM disk is present and so, a specific partition of the RAM is off limits to YARN? Will Spark know about these constraints either?
In Spark configurations you can set driver and executors configs like cores and memory allocation amount. Moreover, when you use yarn as the resource manager there is some extra configs supported by it you can help you to manage the cluster resources better. "spark.driver.memoryOverhead" or "spark.yarn.am.memoryOverhead" which is the amount of off-heap space with the default value of
AM memory * 0.10, with minimum of 384
for further information here is the link.

Why does my Spark only use two computers in the cluster?

I'm using Spark 1.3.1 on StandAlone mode in my cluster which has 7 machines. 2 of the machines are powerful and have 64 cores and 1024 GB memory, while the others have 40 cores and 256 GB memory. One of the powerful machines is set to be the master, and others are set to be the slaves. Each of the slave machine runs 4 workers.
When I'm running my driver program on one of the powerful machines, I see that it takes the cores only from the two powerful machines. Below is a part of the web UI of my spark master.
My configuration of this Spark driver program is as follows:
spark.scheduling.mode=FAIR
spark.default.parallelism=32
spark.cores.max=512
spark.executor.memory=256g
spark.logConf=true
Why spark does this? Is this a good thing or a bad thing? Thanks!
Consider lowering your executors memory from the 256GB that you have defined.
For the future, take in consideration assigning around 75% of available memory.

Apache Spark does not see all the ram of my machines

I have created a Spark cluster of 8 machines. Each machine have 104 GB of RAM and 16 virtual cores.
I seems that Spark only sees 42 GB of RAM per machine which is not correct. Do you know why Spark does not see all the RAM of the machines?
PS : I am using Apache Spark 1.2
Seems like a common misconception. What is displayed is the spark.storage.memoryFraction :
https://stackoverflow.com/a/28363743/4278362
Spark makes no attempt at guessing the available memory. Executors use as much memory as you specify with the spark.executor.memory setting. Looks like it's set to 42 GB.

Resources