Spark Submit Executor AllocationManager Warning - apache-spark

I'm running a Spark job on EMR Cluster but i keep getting this in logs
Is this an important warning and how can i fix it i think it has to do with Cluster Scaling
Also after looking in Spark Job History I found that executors got removed before the job finished
I run the job with :
spark-submit --master yarn --deploy-mode client --executor-cores 4 --num-executors 7 myJob.py
And also the job takes over 1h is it normal ?
the Job is : i'm reading csv file ( 1gb) and filling some empty fields and then returning new csv file

Related

Why only one core used instead of 32 in my spark-submit command?

Hi and thank you for your help,
I know that there are a lot of topic with this issue, I read a lot of them, try a lot of solution but nothing happens, my spark-submit job is still ONLY use one core on my 32 available core.
With my spark-submit command, I launch a Pyspark script. This Pyspark script does a spark.sql command over a lot of parquet files (around 6000 files around 6M each, for a total of 600 millions database tuple).
I use an AWS instance with 32 cpu and 128 Go and a 2To EBS DD on which are stored my parquet files (it's not an hdfs file system)
I doesn't launch spark as a master, just using it in standalone solution on my single EC2 instance.
Everything works fine but the process takes 2h using only one core on my 32 cores so I expect to reduce the process time by using all available cores !
I launch my pyspark script like that :
spark-submit --driver-memory 96G --executor-cores 24 ./my_pyspark.py input.txt output.txt
I tried to add master parameters with local like this :
spark-submit --master local[24] --driver-memory 96G ./my_pyspark.py input.txt output.txt
I tried to start my spark as a server and give the url to the master parameter :
spark-class org.apache.spark.deploy.master.Master
spark-submit --master spark://10.0.1.20:7077 --driver-memory 96G --executor-cores 24 ./my_pyspark.py input.txt output.txt
But none of this solution works. Will looking at the process with htop I see that ONLY one core is used. What did I miss ???
Thanx
You spark submit command is wrong.
You should'nt allocate 96G for the driver and you should specifie the number of executor and the number of core for each executor.
For exemple, you can try :
spark-submit --driver-memory 8G --num-executors 15 --executors-memory 7 --executor-cores 2 ./my_pyspark.py input.txt output.txt
And you should probably use yarn as a ressource manager. --master yarn
Also, define master("local") in the sparkContext, override your spark-submit command, you should remove it from your code.

Spark Standalone: application gets 0 cores

I seem to be unable to assign cores to an application. This leads to the following (apparently common) error message:
Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
I have one master and two slaves in a Spark cluster. All are 8-core i7s with 16GB of RAM.
I have left the spark-env.sh virtually virgin on all three, just specifying the master's IP address.
My spark-submit is the following:
nohup ./bin/spark-submit
--jars ./ikoda/extrajars/ikoda_assembled_ml_nlp.jar,./ikoda/extrajars/stanford-corenlp-3.8.0.jar,./ikoda/extrajars/stanford-parser-3.8.0.jar \
--packages datastax:spark-cassandra-connector:2.0.1-s_2.11 \
--class ikoda.mlserver.Application \
--conf spark.cassandra.connection.host=192.168.0.33 \
--conf spark.cores.max=4 \
--driver-memory 4g –num-executors 2 --executor-memory 2g --executor-cores 2 \
--master spark://192.168.0.141:7077 ./ikoda/ikodaanalysis-mlserver-0.1.0.jar 1000 > ./logs/nohup.out &
I suspect I am conflating the sparkConf initialization in my code with the spark-submit. I need this as the app involves SparkStreaming which can require reinitializing the SparkContext.
The sparkConf setup is as follows:
val conf = new SparkConf().setMaster(s"spark://$sparkmaster:7077").setAppName("MLPCURLModelGenerationDataStream")
conf.set("spark.streaming.stopGracefullyOnShutdown", "true")
conf.set("spark.cassandra.connection.host", sparkcassandraconnectionhost)
conf.set("spark.driver.maxResultSize", sparkdrivermaxResultSize)
conf.set("spark.network.timeout", sparknetworktimeout)
conf.set("spark.jars.packages", "datastax:spark-cassandra-connector:"+datastaxpackageversion)
conf.set("spark.cores.max", sparkcoresmax)
The Spark UI shows the following:
OK, this is definitely a case of programmer error.
But maybe others will make a similar error. The Master had been used as a local Spark previously. I had put some executor settings in spark-defaults.conf and then months later had forgotten about this.
There is a cascading hierarchy whereby SparkConf settings get precedence, then spark-submit settings and then spark-defaults.conf. spark-defaults.conf overrides defaults set by Apache Spark team
Once I removed the settings from spark-defaults, all was fixed.
It is because of the maximum of your physical memory.
your spark memory in spark UI 14.6GB So you must request memory for each executor below the volume
14.6GB, for this you can add config to your spark conf something like this:
conf.set("spark.executor.memory", "10gb")
if you request more than your physical memory spark does't allocate cpu cores to your job and display 0 in Cores in spark UI and run NOTHING.

How to submit a spark job in a 4 node CDH cluster

I have a cluster with following configurations.
Distribution : CDH5,
Number nodes : 4,
RAM : 126GB,
Number of cores : 24 per node,
Harddisk : 5TB
My input file size is 10GB. It takes a lot of time (Around 20 mins) when I submit with following command.
spark-submit --jars xxxx --files xxx,yyy --master yarn /home/me/python/ParseMain.py
In my python code I am setting the following:
sparkConf = SparkConf().setAppName("myapp")
sc = SparkContext(conf = sparkConf)
hContext = HiveContext(sc)
How can I change the spark submit arguments so that I can achieve better performance?
Some spark-submit options that you could try
--driver-cores 4
--num-executors 4
--executor-cores 20
--executor-memory 5G
CDH has to be configured to have enough vCore and vMemory. Otherwise the submitted job would remain ACCEPTED it wouldn't RUN.

How to prevent Spark Executors from getting Lost when using YARN client mode?

I have one Spark job which runs fine locally with less data but when I schedule it on YARN to execute I keep on getting the following error and slowly all executors get removed from UI and my job fails
15/07/30 10:18:13 ERROR cluster.YarnScheduler: Lost executor 8 on myhost1.com: remote Rpc client disassociated
15/07/30 10:18:13 ERROR cluster.YarnScheduler: Lost executor 6 on myhost2.com: remote Rpc client disassociated
I use the following command to schedule Spark job in yarn-client mode
./spark-submit --class com.xyz.MySpark --conf "spark.executor.extraJavaOptions=-XX:MaxPermSize=512M" --driver-java-options -XX:MaxPermSize=512m --driver-memory 3g --master yarn-client --executor-memory 2G --executor-cores 8 --num-executors 12 /home/myuser/myspark-1.0.jar
What is the problem here? I am new to Spark.
I had a very similar problem. I had many executors being lost no matter how much memory we allocated to them.
The solution if you're using yarn was to set --conf spark.yarn.executor.memoryOverhead=600, alternatively if your cluster uses mesos you can try --conf spark.mesos.executor.memoryOverhead=600 instead.
In spark 2.3.1+ the configuration option is now --conf spark.yarn.executor.memoryOverhead=600
It seems like we were not leaving sufficient memory for YARN itself and containers were being killed because of it. After setting that we've had different out of memory errors, but not the same lost executor problem.
You can follow this AWS post to calculate memory overhead (and other spark configs to tune): best-practices-for-successfully-managing-memory-for-apache-spark-applications-on-amazon-emr
When I had the same issue, deleting logs and free up more hdfs space worked.

Spark 2 application on the same time

I am using spark streaming and saving the processed output in a data.csv file
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
JavaStreamingContext jssc = new JavaStreamingContext(conf, new Duration(1000))
JavaReceiverInputDStream<String> lines = jssc.socketTextStream("localhost", 9999);
At the same time i would like to read output of NetworkWordCount data.csv along with another newfile and process it again simultaneously
My Question here is
Is it possible to run two spark applications at the same time?
Is it possible to submit a spark application through the code itself
I am using mac, Now i am submitting spark application from the spark folder with the following command
bin/spark-submit --class "com.abc.test.SparkStreamingTest" --master spark://xyz:7077 --executor-memory 20G --total-executor-cores 100 ../workspace/Test/target/Test-0.0.1-SNAPSHOT-jar-with-dependencies.jar 1000
or just without spark:ip:port and executor memory, total executor core
bin/spark-submit --class "com.abc.test.SparkStreamingTest" --master local[4] ../workspace/Test/target/Test-0.0.1-SNAPSHOT-jar-with-dependencies.jar
and the other application which read the textfile for batch processing like follows
bin/spark-submit --class "com.abc.test.BatchTest" --master local[4] ../workspace/Batch/target/BatchTesting-0.0.1-SNAPSHOT-jar-with-dependencies.jar
when i run the both the applictions SparkStreamingTest and BatchTest separately both works fine , but when i tried to run both simultaneously, i get the following error
Currently i am using spark stand alone mode
WARN AbstractLifeCycle: FAILED SelectChannelConnector#0.0.0.0:4040: java.net.BindException: Address already in use
java.net.BindException: Address already in use
Any help is much appriciated.. i am totally out of my mind
From http://spark.apache.org/docs/1.1.0/monitoring.html
If multiple SparkContexts are running on the same host, they will bind to successive ports beginning with 4040 (4041, 4042, etc).
Your apps should be able to run. It's just a warning to tell you about port conflicts. It's because you run the two Spark apps in the same time. But don't worry about it. Spark will try 4041, 4042 until it finds an available port. So in your case, you will find two Web UIs: ip:4040, ip:4041 for these two apps.

Resources