Submitting Spark Job On Scheduler Pool - apache-spark

I am running a spark streaming job on cluster mode , i have created a pool with memory of 200GB(CDH). I wanted to run my spark streaming job on that pool, i tried setting
sc.setLocalProperty("spark.scheduler.pool", "pool")
in code but its not working and i also tried the
spark.scheduler.pool seems not working in spark streaming, whenever i run the job it goes in the default pool. What would be the possible issue? Is there any configuration i can add while submitting the job?

In yarn we can add the
--conf spark.yarn.queue="que_name"
to the spark-submit command . Then it will use that particular queue and its resources only.

I ran into this same issue with Spark 2.4. In my case, the problem was resolved by removing the default "spark.scheduler.pool" option in my Spark config.
I traced the issue to a bug in Spark - https://issues.apache.org/jira/browse/SPARK-26988. The problem is that if you set the config property "spark.scheduler.pool" in the base configuration, you can't then override it using setLocalProperty. Removing it from the base configuration made it work correctly. See the bug description for more detail.

Related

How to pass Spark job properties to DataProcSparkOperator in Airflow?

I am trying to execute Spark jar on Dataproc using Airflow's DataProcSparkOperator. The jar is located on GCS, and I am creating Dataproc cluster on the fly and then executing this jar on the newly created Dataproc cluster.
I am able to execute this with DataProcSparkOperator of Airflow with default settings, but I am not able to configure Spark job properties (e.g. --master, --deploy-mode, --driver-memory etc.).
From documentation of airflow didn't got any help. Also tried many things but didn't worked out.
Help is appreciated.
To configure Spark job through DataProcSparkOperator you need to use dataproc_spark_properties parameter.
For example, you can set deployMode like this:
DataProcSparkOperator(
dataproc_spark_properties={ 'spark.submit.deployMode': 'cluster' })
In this answer you can find more details.

Set Cloudera application tags for Spark application

I have set spark.yarn.tags in my spark application and it is visible as well in my config when printed.
But Cloudera manager is unable to detect it in application_tags field of yarn application.
Does application_tags map to spark.yarn.tags for spark applications?
I think I found the solution.
When spark.yarn.tags is set while calling spark-submit, cloudera manager detects it. So I believe it is something it requires before spark context is created, hence it has to be passed as conf while submitting.
This is how it can be passed to the spark-submit
--conf spark.yarn.tags=tag-name

How to make sure Spark master node is using the worker nodes? (Google cluster)

I just created a Google Cloud cluster (1 master and 6 workers) and by default Spark is configured.
I have a pure python code that uses NLTK to build the dependency tree for each line from a text file. When I run this code on the master spark-submit run.py I get the same execution time when I run it using my machine.
How to make sure that the master is using the workers in order to reduce the execution time ?
You can check the spark UI. If its running on top of yarn, please open the yarn UI and click on your application id which will open the spark UI. Check under the executors tab it will have the node ip address also.
could you please share your spark submit config.
Your command 'spark-submit run.py' doesn't seem to send your job to YARN. To do such thing, you need to add the --master parameter. For example, a valid command to execute a job in YARN is:
./bin/spark-submit --master yarn python/pi.py 1000
If you execute your job from the master, this execution will be straightforward. Anyway, check this link for another parameter that spark-submit accept.
For a Dataproc cluster (Hadoop Google cluster) you have two options to check the job history including the ones that are running:
By command line from the master: yarn application -list, this option sometimes needs additional configuration. If you have troubles, this link will be useful.
By UI. Dataproc enables you to access the Spark Web UI, it improves monitoring tasks. Check this link to learn how to access the Spark UI and other Dataproc UIs. In summary, you have to create a tunnel and configure your browser to use socks proxy.
Hope the information above help you.

Configuring master node in spark cluster

Apologies in advance as I am new to spark. I have created a spark cluster in standalone mode with 4 workers, and after successfully being able to configure worker properties, I wanted to know how to configure the master properties.
I am writing an application and connecting it to the cluster using SparkSession.builder (I do not want to submit it using spark-submit.)
I know that that the workers can be configured in the conf/spark-env.sh file and has parameters which can be set such as 'SPARK_WORKER_MEMORY' and 'SPARK_WORKER_CORES'
My question is: How do I configure the properties for the master? Because there is no 'SPARK_MASTER_CORES' or 'SPARK_MASTER_MEMORY' in this file.
I thought about setting this in the spark-defaults.conf file, however it seems that this is only used for spark-submit.
I thought about setting it in the application using SparkConf().set("spark.driver.cores", "XX") however this only specifies the number of cores for this application to use.
Any help would be greatly appreciated.
Thanks.
Three ways of setting the configurations of Spark Master node (Driver) and spark worker nodes. I will show examples of setting the memory of the master node. Other settings can be found here
1- Programatically through SpackConf class.
Example:
new SparkConf().set("spark.driver.memory","8g")
2- Using Spark-Submit: make sure not to set the same configuraiton in your code (Programatically like 1) and while doing spark submit. if you already configured settings programatically, every job configuration mentioned in spark-submit that overlap with (1) will be ignored.
example :
spark-submit --driver-memory 8g
3- through the Spark-defaults.conf:
In case none of the above is set this settings will be the defaults.
example :
spark.driver.memory 8g

Spark-submit Executers are not getting the properties

I am trying to deploy the Spark application to 4 node DSE spark cluster, and I have created a fat jar with all dependent Jars and I have created a property file under src/main/resources which has properties like batch interval master URL etc.
I have copied this fat jar to master and I am submitting the application with "spark-submit" and below is my submit command.
dse spark-submit --class com.Processor.utils.jobLauncher --supervise application-1.0.0-develop-SNAPSHOT.jar qa
everything works properly when I run on single node cluster, but if run on DSE spark standalone cluster, the properties mentioned above like batch interval become unavailable to executors. I have googled and found that is the common issue many has solved it. so I have followed one of the solutions and created a fat jar and tried to run, but still, my properties are unavailable to executors.
can someone please give any pointers on how to solve the issue ?
I am using DSE 4.8.5 and Spark 1.4.2
and this is how I am loading the properties
System.setProperty("env",args(0))
val conf = com.typesafe.config.ConfigFactory.load(System.getProperty("env") + "_application")
figured out the solution:
I am referring the property file name from system property(i am setting it main method with the command line parameter) and when the code gets shipped and executed on worker node the system property is not available (obviously..!!) , so instead of using typesafe ConfigFactory to load property file I am using simple Scala file reading.

Resources