What is the minimum Hardware insfracture required for spark to run on spark standalone cluster mode? - apache-spark

I am running spark standalone cluster mode in my local computer .This is hardware information about my computer
Intel Core i5
Number of Processors: 1
Total Number of Cores: 2
Memory: 4 GB.
I am trying to run spark program from eclipse on spark standalone cluster .This is some part of my code .
String logFile = "/Users/BigDinosaur/Downloads/spark-2.0.1-bin-hadoop2.7 2/README.md"; //
SparkConf conf = new SparkConf().setAppName("Simple Application").setMaster("spark://BigDinosaur.local:7077"));
after running program in eclipse I am getting following warning message
Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resource
This is my screen shot of web UI
After going through other people answer on similar problem it seems like hardware resource mismatch is the root cause.
I want to get more information on
What is Minimum Hardware insfracture required for spark standalone cluster to run application on it ?

It started running after i run following command
./start-slave.sh spark://localhost:7077 --cores 1 --memory 1g
I gave for core 1 and memory 1 g

As per I know. Spark allocates memory from whatever memory is available when spark job starts.
You may want to try with explicitely providing cores and executor memory when starting job.

Related

emr spark master node runs out of memory in yarn cluster mode

I am new to EMR and I am running an EMR cluster, with 1 master (32gb) and 5 core nodes (16gb). I launch 11 apps. The apps have to be separated in case one of them fail (all of them are streaming apps). I must mention that I also got ElasticSearch running on the cluster.
After some time the master node is running out of memory and stops responding and some apps starting to fail. In the process overview I found many smaller hadoop processes with that occupy 1-1.3GB of RAM. I guess these are the driver processes from each app. I tried to reduce the the driver memory under "spark.driver.memory" to 512MB, but it's still at 1.3GB after relaunching the apps. Is this because of yarn?
ES just allocates ca. 6.5 GB of RAM of the master node
I had to specify the driver memory in spark-submit command like this:
spark-submit --driver-memory 500M
because to specify it inside the python file is too late, when you run the driver in client mode, because it allocates the memory before

What is the benefit of using more then 1 driver core in spark yarn cluster mode?

what is the difference in using 1 vs 2 driver core in spark yarn cluster mode? If i use 2 driver cores in yarn cluster mode, then spark driver will be relaunched incase of failure? If so, how many retry if would do before failing?
Appreciate if anyone can share any article on this?
When you launch application in YARN cluster mode, it will create container for your driver.
This container - depending on your application - might need multiple cores and multiple gigs of memory. It all depends on how many sessions will connect to your Spark application at the same time and on complexity of your query.
If it looks like your query compiles slowly or your Spark Web UI/app hangs, it might be worth it to increase core count.
From the point of YARN, there is still only one driver container.

Spark client memory configuration

I'm trying to run multiple spark clients on Airflow(ETL scheduler).
I'm running in cluster mode on YARN, therefore ApplicationMaster Executor and Driver are all running on executor in Yarn context.
However, my spark client which sample the process and monitor the state is running in airflow worker.
The problem is that the Spark client take lot's of memory ~500 MB per job. It may sound as not much in terms of executors or drivers but for the role of spark client it sounds crazy.
My question is, how can I configure/manipulate spark client memory/cpu requirements can I limit it's intervals ? can I limit it's memory with flags?
So in spark code it make a distinction if it's running in standalone mode or cluster mode. For standalone it set a default of -Xmx 1G and in cluster mode it doesn't have default but it trying to read java options from environment variable called SPARK_SUBMIT_OPTS.
So if you wanna set any java opts for the client java process only use SPARK_SUBMIT_OPTS

Spark Standalone cluster, memory per executor issue

Hi i am launch my Spark application with the spark submit script as such
spark-submit --master spark://Maatari-xxxxxxx.local:7077 --class EstimatorApp /Users/sul.maatari/IdeaProjects/Workshit/target/scala-2.11/Workshit-assembly-1.0.jar --d
eploy-mode cluster --executor-memory 15G num-executors 2
I have a spark standalone cluster deployed on two nodes (my 2 laptops). The cluster is running fine. By default it set 15G for the workers and 8 cores for the executors. Now i am experiencing the following strange behavior. Although i am explicity setting the memory and this can also be seen in the environmement variable of the sparconf UI, in the Cluster UI it says that my application is limited to 1024MB for the executor memory. This makes me think of the default 1G parameter. I wonder why that it.
My application indeed fail because of the memory issue. I know that i need a lot of memory for that application.
One last point of confusion is the Driver program. Why given that i am on cluster mode, spark submit does not return immediately ? I though that given that the driver is executed on the cluster, the client i.e. submit application should return immediately. This further suggest me that something is not right with my conf and how things are being executed.
Can anyone help diagnose that ?
Two possibilities:
given that your command line has the --num-executors mis-specified: it may be that Spark "gives up" on the other setting as well.
how much memory does your laptop have? Most of us use mac's .. and then you would not be able to run it with more than about 8GB in my experience.

Is it possible to run multiple Spark applications on a mesos cluster?

I have a Mesos cluster with 1 Master and 3 slaves (with 2 cores and 4GB RAM each) that has a Spark application already up and running. I wanted to run another application on the same cluster, as the CPU and Memory utilization isn't high. Regardless, when I try to run the new Application, I get the error:
16/02/25 13:40:18 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
I guess the new process is not getting any CPU as the old one occupies all 6.
I have tried enabling dynamic allocation, making the spark app Fine grained. Assigning numerous combinations of executor cores and number of executors. What I am missing here? Is it possible to run a Mesos Cluster with multiple Spark Frameworks at all?
You can try setting spark.cores.max to limit the number of CPUs used by each Spark driver, which will free up some resources.
Docs: https://spark.apache.org/docs/latest/configuration.html#scheduling

Resources