I had a question about memory config. I am running a 3 node cluster with Spark, Cassandra, Hadoop, Thrift and Yarn. I want to store my files in hdfs, so i loaded hadoop. I am finding that i am running out of memory when running my queries. I was able to figure out how to restrict cassandra to run in less than 4gb. Is there such a setting for hadoop? How about Yarn? As i only use hadoop to load up my flat files, i think setting it to 1 or 2gb should be fine. My boxes have 32gb of ram and 16 cores each.
It is hard to say without the error message you are facing. But if you want to check about allocation of memory at your workers you can setup these two configurations at your yarn-site.xml:
<name>yarn.nodemanager.resource.memory-mb</name>
<value>40960</value> <!-- 40 GB -->
<name>yarn.scheduler.maximum-allocation-mb</name> <!-Max RAM-per-container->
<value>8192</value>
You can see more details here in this question.
Related
I have a 4 nodes cluster on which there are installed Spark and Cassandra on each node.
Spark version 3.1.2 and Cassandra v3.11
Let me say that each nodes have 4GB of RAM and I want to run my "spark+cassandra" program all over the cluster.
How can I assign 2GB of RAM for Cassandra execution and 2GB for Spark execution?
I noted that.
If my Cassandra cluster is up and I run start-worker.sh command on a worker node to make my spark cluster up, suddenly Cassandra service stops but spark still works. Basically, Spark steals RAM resources to Cassandra. How can I avoid also this?
On Cassandra logs of the crashed node I read the message:
There is insufficient memory for the Java Runtime Environment to continue.
In fact typing top -c and then shift+M i can see Spark Service at the top of column Memory
Thanks for any suggestions.
By default, Spark workers take up the total RAM less 1GB. On a 4GB machine, the worker JVM consumes 3GB of memory. This is the reason the machine runs out of memory.
You'll need to configure the SPARK_WORKER_MEMORY to 1GB to leave enough memory for the operating system. For details, see Starting a Spark cluster manually.
It's very important to note as Alex Ott already pointed out, a machine with only 4GB of RAM is not going to be able to do much so expect to run into performance issues. Cheers!
I created a spark cluster(learning so did not create high memory-cpu cluster) with 1 master node and 2 Core to run executors using below config
Master:Running1m4.large (2 Core , 8GB)
Core:Running2c4.large (2 core , 3.5 GB)
Hive 2.1.1, Pig 0.16.0, Hue 3.11.0, Spark 2.1.0, Sqoop 1.4.6, HBase 1.3.0
When pyspark is run getting below error
Required executor memory (1024+384 MB) is above the max threshold (896 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'.
Before trying to increase yarn-site.xml config , curious to understand why EMR is taking just 896MB as limit when master has 8GB and worker node has 3.5GB each.
And Resource manager URL (for master- http://master-public-dns-name:8088/) is showing 1.75 GB where as memory for vm is 8GB. Is hbase or other sws taking up too much memory?
If anyone encountered similar issue , please share your insight why it is EMR is setting low defaults. Thanks!
Before trying to increase yarn-site.xml config , curious to understand
why EMR is taking just 896MB as limit when master has 8GB and worker
node has 3.5GB each.
If you run spark jobs with yarn cluster mode (which you probably were using) , the executors will be run on core's and masters memory will not be used.
Now, all-though your CORE EC2 instance (c4.large) has 3.75 GB to use, EMR configures YARN not to use all this memory for running YARN containers or spark executors. This is because you gotta leave enough memory for other permanent daemons ( like HDFS's datanode , YARN's nodemanager , EMR's own daemons etc.. based on app's you provision)
EMR does publish this default YARN configuration it sets for all instance types on this page : http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hadoop-task-config.html
c4.large
Configuration Option Default Value
mapreduce.map.java.opts -Xmx717m
mapreduce.map.memory.mb 896
yarn.scheduler.maximum-allocation-mb 1792
yarn.nodemanager.resource.memory-mb 1792
So, yarn.nodemanager.resource.memory-mb = 1792, which means 1792 MB is the physical memory that will be allocated to YARN containers on that core node having 3.75 actual memory. Also, check spark-defaults.xml where EMR has some defaults for spark executor memory. These are default's and of course you can change those before starting cluster using EMR's configurations API . But keep in mind that if you over provision memory for YARN containers , you might starve some other processes.
Given that it is important to understand YARN configs and how SPARK interacts with YARN .
https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
http://spark.apache.org/docs/latest/running-on-yarn.html
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
It's not really a property of EMR but rather of YARN, which is the resource manager running on EMR.
My personal take on YARN is that is really build for managing long running clusters that continuously take in a variety of jobs that it has to run simultaneously. In these cases it makes sense for YARN to only assign a small part of the available memory to each job.
Unfortunately, when it comes to specific-purpose clusters (like: "I will just spin up a cluster run my job and terminate the cluster again") these YARN-defaults are simply annoying, and you have to configure a bunch of stuff in order to make YARN utilise your resources optimally. But running on EMR it's what we are stuck with these days, so one has to live with that...
I have this question that I have not been able to find its answer anywhere.
I am using the following lines to load data within a PySpark application:
loadFile = self.tableName+".csv"
dfInput= self.sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load(loadFile)
My cluster configuration is as follows:
I am using a Spark cluster with 3 nodes: 1 node is used to start the master, the other 2 nodes are running 1 worker each.
I submit the application from outside the cluster on a login node, with a script.
The script submits the Spark application with cluster deploy mode which I think, then in this case, makes a driver run on any of the 3 nodes I am utilising.
The input CSV files are stored in a globally visible temporary file system (Lustre).
In Apache Spark Standalone, how is the process of loading partitions to RAM?
Is it that each executor accesses to the driver's node RAM and loads partitions to its own RAM from there? (Storage --> driver's RAM --> executor's RAM)
Is it that each executor accesses to storage and loads to its own RAM? (Storage --> executor's RAM)
Is it none of these and I am missing something here? How can I witness this process by myself (monitoring tool, unix command, somewhere in Spark)?
Any comment or resource in which I can get deep into this would be very helpful. Thanks in advance.
The second scenario is correct:
each executor accesses to storage and loads to its own RAM? (Storage --> executor's RAM)
I'm trying to benchmark a program on an Azure cluster using Spark. We previously ran this on EC2 and know that 150 GB of RAM is sufficient. I have tried multiple setups for the executors and given them 160-180GB of RAM but regardless of what I do, the program dies due to executors requesting more memory.
What can I do? Are there more launch options I should consider, I have tried every conceivable executor setup and nothing seems to want to work. I'm at a total loss.
For your command, you specified 7 executor and each with 40g of memory. That's 280G of memory in total, but you said your cluster has only 160-180 G of memory? If only 150G of memory is needed, why the spark-submit is configured that way?
What's your HDI cluster node type and how many of them you created?
Were you using YARN previously on EC2 as well? In that case, are the configuration the same?
I have created a Spark cluster of 8 machines. Each machine have 104 GB of RAM and 16 virtual cores.
I seems that Spark only sees 42 GB of RAM per machine which is not correct. Do you know why Spark does not see all the RAM of the machines?
PS : I am using Apache Spark 1.2
Seems like a common misconception. What is displayed is the spark.storage.memoryFraction :
https://stackoverflow.com/a/28363743/4278362
Spark makes no attempt at guessing the available memory. Executors use as much memory as you specify with the spark.executor.memory setting. Looks like it's set to 42 GB.