Pyspark job queue config precedence - spark-submit vs SparkSession.builder - apache-spark

I have a shell script which runs a spark-submit command. I want to specify the resource queue name onto which the job runs.
When I use:
spark-submit --queue myQueue job.py (here the job is properly submitted on 'myQueue')
But when I use: spark-submit job.py and inside job.py I create a spark session like:
spark=SparkSession.builder.appName(appName).config("spark.yarn.queue", "myQueue") - In this case the job runs on default queue. Also on checking the configs of this running job on the spark UI, it shows me that queue name is "myQueue" but still the job runs on default queue only.
Can someone explain how can I pass the queue name in sparkSession.builder configs so that it takes into effect.
Using pyspark version 2.3

Related

Why spark executor are not dying

Here is my setup:
Kubernetes cluster running airflow, which submits the spark job to Kubernetes cluster, job runs fine but the container are suppose to die once the job is done but they are still hanging there.
Airflow Setup comes up on K8S cluster.
Dag is baked in the airflow docker image because somehow I am not able to sync the dags from s3. For some reason the cron wont run.
Submits the spark job to K8S Cluster and job runs fine.
But now instead of dying post execution and completion of job it still hangs around.
Here is my SparkSubmitOperator function
spark_submit_task = SparkSubmitOperator(
task_id='spark_submit_job_from_airflow',
conn_id='k8s_spark',
java_class='com.dom.rom.mainclass',
application='s3a://some-bucket/jars/demo-jar-with-dependencies.jar',
application_args=['300000'],
total_executor_cores='8',
executor_memory='20g',
num_executors='9',
name='mainclass',
verbose=True,
driver_memory='10g',
conf={
'spark.hadoop.fs.s3a.aws.credentials.provider': 'com.amazonaws.auth.InstanceProfileCredentialsProvider',
'spark.rpc.message.maxSize': '1024',
'spark.hadoop.fs.s3a.impl': 'org.apache.hadoop.fs.s3a.S3AFileSystem',
'spark.kubernetes.container.image': 'dockerhub/spark-image:v0.1',
'spark.kubernetes.namespace' : 'random',
'spark.kubernetes.container.image.pullPolicy': 'IfNotPresent',
'spark.kubernetes.authenticate.driver.serviceAccountName': 'airflow-spark'
},
dag=dag,
)
Figured the problem it was my mistake I wasn't closing the spark session, added the following
session.stop();

How to print out Spark connection of Spark session ?

Suppose I run pyspark command and got global variable spark of type SparkSession. As I understand, this spark holds a connection to the Spark master. Can I print out the details of this connection including the hostname of this Spark master ?
For basic information you can use master property:
spark.sparkContext.master
To get details on YARN you might have to dig through hadoopConfiguration:
hadoopConfiguration = spark.sparkContext._jsc.hadoopConfiguration()
hadoopConfiguration.get("yarn.resourcemanager.hostname")
or
hadoopConfiguration.get("yarn.resourcemanager.address")
When submitted to YARN Spark uses Hadoop configuration to determine the resource manger so these values should match ones present in configuration placed in HADOOP_CONF_DIR or YARN_CONF_DIR.

Kerberos ticket renewal on Spark streaming job that communicates to Kafka

I have a long running Spark streaming job that runs on a kerberized Hadoop cluster. It fails every few days with the following error:
Diagnostics: token (token for XXXXXXX: HDFS_DELEGATION_TOKEN owner=XXXXXXXXX#XX.COM, renewer=yarn, realUser=, issueDate=XXXXXXXXXXXXXXX, maxDate=XXXXXXXXXX, sequenceNumber=XXXXXXXX, masterKeyId=XXX) can't be found in cache
I tried adding in --keytab and --principal options to spark-submit. But we already have the following options that do the same thing:
For the second option, we already pass in the keytab and principal with the following:
'spark.driver.extraJavaOptions=-Djava.security.auth.login.config=kafka_client_jaas.conf -Djava.security.krb5.conf=krb5.conf -XX:+UseCompressedOops -XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=12' \
Same for spark.executor.extraJavaOptions. If we add the options --principal and --keytab it results in attempt to add file (keytab) multiple times to distributed cache
There are 2 ways that you can do it.
Have a shell script that does the keytab/ticket generation on a regular interval.
[RECOMMENDED] Pass your keytab to Spark with strict access only to spark user and it can automatically regenerate the tickets for you. Visit this Cloudera community page for more details. It's just a simple bunch of steps and you can get going!
Hope that helps!

Spark 2.0 Standalone mode Dynamic Resource Allocation Worker Launch Error

I'm running Spark 2.0 on Standalone mode, successfully configured it to launch on a server and also was able to configure Ipython Kernel PySpark as option into Jupyter Notebook. Everything works fine but I'm facing the problem that for each Notebook that I launch, all of my 4 workers are assigned to that application. So if another person from my team try to launch another Notebook with PySpark kernel, it simply does not work until I stop the first notebook and release all the workers.
To solve this problem I'm trying to follow the instructions from Spark 2.0 Documentation.
So, on my $SPARK_HOME/conf/spark-defaults.conf I have the following lines:
spark.dynamicAllocation.enabled true
spark.shuffle.service.enabled true
spark.dynamicAllocation.executorIdleTimeout 10
Also, on $SPARK_HOME/conf/spark-env.sh I have:
export SPARK_WORKER_MEMORY=1g
export SPARK_EXECUTOR_MEMORY=512m
export SPARK_WORKER_INSTANCES=4
export SPARK_WORKER_CORES=1
But when I try to launch the workers, using $SPARK_HOME/sbin/start-slaves.sh, only the first worker is successfully launched. The log from the first worker end up like this:
16/11/24 13:32:06 INFO Worker: Successfully registered with master
spark://cerberus:7077
But the log from workers 2-4 show me this error:
INFO ExternalShuffleService: Starting shuffle service on port 7337
with useSasl = false 16/11/24 13:32:08 ERROR Inbox: Ignoring error
java.net.BindException: Address already in use
It seems (to me) that the first worker successfully starts the shuffle-service at port 7337, but the workers 2-4 "does not know" about this and try to launch another shuffle-service on the same port.
The problem occurs also for all workers (1-4) if I first launch a shuffle-service (using $SPARK_HOME/sbin/start-shuffle-service.sh) and then try to launch all the workers ($SPARK_HOME/sbin/start-slaves.sh).
Is any option to get around this? To be able to all workers verfy if there is a shuffle service running and connect to it instead of try to create a new service?
I had the same issue and seemed to get it working by removing the spark.shuffle.service.enabled item from the config file (in fact I don't have any dynamicAllocation-related items in there) and instead put this in the SparkConf when I request a SparkContext:
sconf = pyspark.SparkConf() \
.setAppName("sc1") \
.set("spark.dynamicAllocation.enabled", "true") \
.set("spark.shuffle.service.enabled", "true")
sc1 = pyspark.SparkContext(conf=sconf)
I start the master & slaves as normal:
$SPARK_HOME/sbin/start-all.sh
And I have to start one instance of the shuffler-service:
$SPARK_HOME/sbin/start-shuffle-service.sh
Then I started two notebooks with this context and got them both to do a small job. The first notebook's application does the job and is in the RUNNING state, the second notebook's application is in the WAITING state. After a minute (default idle timeout), the resources get reallocated and the second context gets to do its job (and both are in RUNNING state).
Hope this helps,
John

Run spark-submit in Scala code

Is it possible to execute below spark-submit script within code and then get application ID that'll assign by YARN?
bin/spark-submit
--class com.my.application.XApp
--master yarn-cluster --executor-memory 100m
--num-executors 50 hdfs://name.node.server:8020/user/root/x-service-1.0.0-201512141101-assembly.jar
1000
This is to enable user to start and stop the job via REST API.
I found,
https://spark.apache.org/docs/latest/api/java/org/apache/spark/launcher/SparkLauncher.html
import org.apache.spark.launcher.SparkLauncher;
public class MyLauncher {
public static void main(String[] args) throws Exception {
Process spark = new SparkLauncher()
.setAppResource("/my/app.jar")
.setMainClass("my.spark.app.Main")
.setMaster("local")
.setConf(SparkLauncher.DRIVER_MEMORY, "2g")
.launch();
spark.waitFor();
}
}
But I couldn't find a method to get application ID , also seems like app.jar has to be pre built before executing above code ?
Yes your application jar does need to be prebuilt in those cases. It seems like something like the Spark Job Server or IBM Spark Kernel may be closer to what you want (although they reuse a Spark Context).
SparkLauncher will only submit your built application. To get the application ID, you need to access the SparkContext within your application jar.
In your example, you could access the application ID in "/my/app.jar" (perhaps in "my.spark.app.Main") with:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
...
val sc = new SparkContext(new SparkConf())
sc.applicationId
This application ID will be the YARN application ID when the application is built and submitted in yarn-cluster mode.
See the Spark Scala API docs.
Support for accessing launched applications seems to be coming in Spark 1.6 (SPARK-8673). A Scala example derived from this test suite is below.
val handle = new SparkLauncher()
... // application configuration
.setMaster("yarn-client")
.startApplication()
try {
handle.getAppId() should startWith ("application_")
handle.stop()
} finally {
handle.kill()
}
Handlers may be added to launched applications, but a listener API is exposed and is the recommended way for monitoring launched applications. See this pull request for details.
Scala has SparkContext.applicationId, which is a unique identifier for the Spark application. Its format depends on the scheduler implementation. (i.e. in case of local spark app something like 'local-1433865536131' in case of YARN something like 'application_1433865536131_34483' )
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext

Resources