Spark-submit create only 1 executor when pyspark interactive shell create 4 (both using yarn-client) - apache-spark

I'm using the quickstart cloudera VM (CDH 5.10.1) with Pyspark (1.6.0) and Yarn (MR2 Included) to aggregate numerical data per hour. I've got 1 CPU with 4 cores and 32 Go of RAM.
I've got a file named aggregate.py but until today I never submitted the job with spark-submit, I used pyspark interactive shell and copy/paste the code to test it.
When starting pyspark interactive shell I used :
pyspark --master yarn-client
I followed the treatment in the web UI accessible at quickstart.cloudera:8088/cluster and could see that Yarn created 3 executors and 1 driver with one core each (Not a good configuration but the main purpose is to make a proof of concept, until we move to a real cluster)
When submitting the same code with spark-submit :
spark-submit --verbose
--master yarn
--deploy-mode client \
--num-executors 2 \
--driver-memory 3G \
--executor-memory 6G \
--executor-cores 2 \
aggregate.py
I only have the driver, which also executes the tasks. Note that spark.dynamicAllocation.enabled is set to true in the environment tab, and spark.dynamicAllocation.minExecutors is set to 2.
I tried using spark-submit aggregate.py only, I still got only the driver as executor. I can't manage to have more than 1 executor with spark-submit, yet it works in spark interactive shell !
My Yarn configuration is as follow :
yarn.nodemanager.resource.memory-mb = 17 GiB
yarn.nodemanager.resource.cpu-vcores = 4
yarn.scheduler.minimum-allocation-mb = 3 GiB
yarn.scheduler.maximum-allocation-mb = 16 GiB
yarn.scheduler.minimum-allocation-vcores = 1
yarn.scheduler.maximum-allocation-vcores = 2
If someone can explain me what I'm doing wrong it would be a great help !

You have to set the driver memory and executor memory in to spark-defaults.conf.
It's located at
$SPARK_HOME/conf/spark-defaults.conf
and if there is a file like
spark-defaults.conf.template
then you have to rename the file as
spark-defaults.conf
and then set the number of executors, executor-memory ,number of executor-cores. you get the example from the template file or check this link
https://spark.apache.org/docs/latest/configuration.html.
or
When we used pyspark It's used default executor-memory but here in spark-submit you set executor-memory = 6G. I think you have to reduce the memory or remove this field so it can used default memory.

just a guess, as you said earlier "Yarn created 3 executors and 1 driver with one core each", so you have 4-cores in total.
Now as per your spark-submit statement,
cores = num-executors 2 * executor-cores 2 + for_driver 1 = 5
#but in total you have 4 cores. So it is unable to give you executors(as after driver only 3 cores left)
#Check if this is the issue.

Related

Why only one core used instead of 32 in my spark-submit command?

Hi and thank you for your help,
I know that there are a lot of topic with this issue, I read a lot of them, try a lot of solution but nothing happens, my spark-submit job is still ONLY use one core on my 32 available core.
With my spark-submit command, I launch a Pyspark script. This Pyspark script does a spark.sql command over a lot of parquet files (around 6000 files around 6M each, for a total of 600 millions database tuple).
I use an AWS instance with 32 cpu and 128 Go and a 2To EBS DD on which are stored my parquet files (it's not an hdfs file system)
I doesn't launch spark as a master, just using it in standalone solution on my single EC2 instance.
Everything works fine but the process takes 2h using only one core on my 32 cores so I expect to reduce the process time by using all available cores !
I launch my pyspark script like that :
spark-submit --driver-memory 96G --executor-cores 24 ./my_pyspark.py input.txt output.txt
I tried to add master parameters with local like this :
spark-submit --master local[24] --driver-memory 96G ./my_pyspark.py input.txt output.txt
I tried to start my spark as a server and give the url to the master parameter :
spark-class org.apache.spark.deploy.master.Master
spark-submit --master spark://10.0.1.20:7077 --driver-memory 96G --executor-cores 24 ./my_pyspark.py input.txt output.txt
But none of this solution works. Will looking at the process with htop I see that ONLY one core is used. What did I miss ???
Thanx
You spark submit command is wrong.
You should'nt allocate 96G for the driver and you should specifie the number of executor and the number of core for each executor.
For exemple, you can try :
spark-submit --driver-memory 8G --num-executors 15 --executors-memory 7 --executor-cores 2 ./my_pyspark.py input.txt output.txt
And you should probably use yarn as a ressource manager. --master yarn
Also, define master("local") in the sparkContext, override your spark-submit command, you should remove it from your code.

Spark: Entire dataset concentrated in one executor

I am running a spark job with 3 files each of 100MB size, for some reason my spark UI shows all dataset concentrated into 2 executors.This is making the job run for 19 hrs and still running.
Below is my spark configuration . spark 2.3 is the version used.
spark2-submit --class org.mySparkDriver \
--master yarn-cluster \
--deploy-mode cluster \
--driver-memory 8g \
--num-executors 100 \
--conf spark.default.parallelism=40 \
--conf spark.yarn.executor.memoryOverhead=6000mb \
--conf spark.dynamicAllocation.executorIdleTimeout=6000s \
--conf spark.executor.cores=3 \
--conf spark.executor.memory=8G \
I tried repartitioning inside the code which works , as this makes the file go into 20 partitions (i used rdd.repartition(20)). But why should I repartition , i believe specifying spark.default.parallelism=40 in the script should let spark divide the input file to 40 executors and process the file in 40 executors.
Can anyone help.
Thanks,
Neethu
I am assuming you're running your jobs in YARN if yes, you can check following properties.
yarn.scheduler.maximum-allocation-mb
yarn.nodemanager.resource.memory-mb
yarn.scheduler.maximum-allocation-vcores
yarn.nodemanager.resource.cpu-vcores
In YARN these properties would affect number of containers that can be instantiated in a NodeManager based on spark.executor.cores, spark.executor.memory property values (along with executor memory overhead)
For example, if a cluster with 10 nodes (RAM : 16 GB, cores : 6) and set with following yarn properties
yarn.scheduler.maximum-allocation-mb=10GB
yarn.nodemanager.resource.memory-mb=10GB
yarn.scheduler.maximum-allocation-vcores=4
yarn.nodemanager.resource.cpu-vcores=4
Then with spark properties spark.executor.cores=2, spark.executor.memory=4GB you can expect 2 Executors/Node so total you'll get 19 executors + 1 container for Driver
If the spark properties are spark.executor.cores=3, spark.executor.memory=8GB then you will get 9 Executor (only 1 Executor/Node) + 1 container for Driver
you can refer to link for more details
Hope this helps

Spark thrift server use only 2 cores

Google dataproc one node cluster, VCores Total = 8. I've tried from user spark:
/usr/lib/spark/sbin/start-thriftserver.sh --num-executors 2 --executor-cores 4
tried to change /usr/lib/spark/conf/spark-defaults.conf
tried to execute
export SPARK_WORKER_INSTANCES=6
export SPARK_WORKER_CORES=8
before start-thriftserver.sh
No success. In yarn UI I can see that thrift app use only 2 cores and 6 cores available.
UPDATE1:
environment tab at spark ui:
spark.submit.deployMode client
spark.master yarn
spark.dynamicAllocation.minExecutors 6
spark.dynamicAllocation.maxExecutors 10000
spark.executor.cores 4
spark.executor.instances 1
It depends on what yarn mode is that app in.
Can be yarn client - 1 core for Application Master(the app will be running on the machine where you ran command start-thriftserver.sh).
In case of yarn cluster - Driver will be inside AM container, so you can tweak cores with spark.driver.cores. Other cores will be used by executors (1 executor = 1 core by default)
Beware that --num-executors 2 --executor-cores 4 wouldn't work as you have 8 cores max and +1 will be needed for AM container (total of 9)
You can check cores usage from Spark UI - http://sparkhistoryserverip:18080/history/application_1534847473069_0001/executors/
Options below are only for Spark standalone mode:
export SPARK_WORKER_INSTANCES=6
export SPARK_WORKER_CORES=8
Please review all configs here - Spark Configuration (latest)
In your case you can edit spark-defaults.conf and add:
spark.executor.cores 3
spark.executor.instances 2
Or use local[8] mode as you have only one node anyway.
If you want YARN shows you proper number of cores allocated to executors change value in capacity-scheduler.xml for:
yarn.scheduler.capacity.resource-calculator
from:
org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator
to:
org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
Otherwise it doesn't matter how many cores you ask for your executors, YARN will show you only one core per container.
Actually this config changes resource allocation behavior. More details: https://hortonworks.com/blog/managing-cpu-resources-in-your-hadoop-yarn-clusters/

Run Spark application with custom number of cores and memory size

I'm totally new at this, so I don't understand really well how it's doing.
I need to run spark on my machine (login with ssh) and set up memory 60g, and 6 cores for execution.
This is what I've tried.
spark-submit --master yarn --deploy-mode cluster --executor-memory 60g --executor-cores 6
And this is what I got:
SPARK_MAJOR_VERSION is set to 2, using Spark2
Exception in thread "main" java.lang.IllegalArgumentException: Missing application resource.
at org.apache.spark.launcher.CommandBuilderUtils.checkArgument(CommandBuilderUtils.java:253)
at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSparkSubmitArgs(SparkSubmitCommandBuilder.java:160)
at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSparkSubmitCommand(SparkSubmitCommandBuilder.java:276)
at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildCommand(SparkSubmitCommandBuilder.java:151)
at org.apache.spark.launcher.Main.main(Main.java:87)
So, I guess there is some things to add to this code line for running and I have no idea what.
Here:
spark-submit --master yarn --deploy-mode cluster --executor-memory 60g --executor-cores 6
you don't specify which the entry point and your application!
Check the spark-submit documentation, which states:
Some of the commonly used options are:
--class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)
--master: The master URL for the cluster (e.g. spark://23.195.26.187:7077)
--deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client)
†
--conf: Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap “key=value” in quotes (as shown).
application-jar: Path to a bundled jar including your application and
all dependencies. The URL must be globally visible inside of your
cluster, for instance, an hdfs:// path or a file:// path that is
present on all nodes.
application-arguments: Arguments passed to the main method of your
main class, if any
For Python applications, simply pass a .py file in the place of <application-jar> instead of a JAR, and add Python .zip, .egg or .py files to the search path with --py-files.
Here is an example that takes some JARs and a python file (I didn't include your additional parameters for simplicity):
./spark-submit --jars myjar1.jar,myjar2.jar --py-files path/to/my/main.py arg1 arg2
I hope I can entry to spark shell (with that much memory and cores) and type code in there
Then you need pyspark, not spark-submit! What is the difference between spark-submit and pyspark?
So what you really want to do is this:
pyspark --master yarn --deploy-mode cluster --executor-memory 60g --executor-cores 6
If I understand your question correctly, you total number of cores=6 and total memory is 60GB. The parameeters
--executor-memory
--executor-cores
are actually for each executor inside spark. probably you should try
--executor-memory 8G
--executor-cores 1
this will create about 6 executors of 8Gb each (total 6*8 = 48GB). The rest 12 GB for operating system processing and metadata.

How to submit a spark job in a 4 node CDH cluster

I have a cluster with following configurations.
Distribution : CDH5,
Number nodes : 4,
RAM : 126GB,
Number of cores : 24 per node,
Harddisk : 5TB
My input file size is 10GB. It takes a lot of time (Around 20 mins) when I submit with following command.
spark-submit --jars xxxx --files xxx,yyy --master yarn /home/me/python/ParseMain.py
In my python code I am setting the following:
sparkConf = SparkConf().setAppName("myapp")
sc = SparkContext(conf = sparkConf)
hContext = HiveContext(sc)
How can I change the spark submit arguments so that I can achieve better performance?
Some spark-submit options that you could try
--driver-cores 4
--num-executors 4
--executor-cores 20
--executor-memory 5G
CDH has to be configured to have enough vCore and vMemory. Otherwise the submitted job would remain ACCEPTED it wouldn't RUN.

Resources