How to get driver-Id in spark submission - apache-spark

Spark cluster information:
Spark version: 2.2.0
Cluster contains a master node with 2 worker nodes
Cluster Manager Type: standalone
I submit a jar to spark cluster from one of the workers and I want to receive the driver-Id from the submission so that I can use that id for later application status checking. The problem is that I am not getting any output in the console. I use port 6066 for submission and set deploy mode to cluster.
By running
spark-submit --deploy-mode cluster --supervise --class "path/to/class" --master "spark://spark-master-headless:6066" path/to/app.jar
in the spark log file I am able to see the json response of the submission below, which is the exact thing that I want:
[INFO] 2018-07-18 12:48:40,030 org.apache.spark.deploy.rest.RestSubmissionClient logInfo - Submitting a request to launch an application in spark://spark-master-headless:6066.
[INFO] 2018-07-18 12:48:41,074 org.apache.spark.deploy.rest.RestSubmissionClient logInfo - Submission successfully created as driver-20180718124840-0023. Polling submission state...
[INFO] 2018-07-18 12:48:41,077 org.apache.spark.deploy.rest.RestSubmissionClient logInfo - Submitting a request for the status of submission driver-20180718124840-0023 in spark://spark-master-headless:6066.
[INFO] 2018-07-18 12:48:41,092 org.apache.spark.deploy.rest.RestSubmissionClient logInfo - State of driver driver-20180718124840-0023 is now RUNNING.
[INFO] 2018-07-18 12:48:41,093 org.apache.spark.deploy.rest.RestSubmissionClient logInfo - Driver is running on worker worker-20180707104934-<some-ip-was-here>-7078 at <some-ip-was-here>:7078.
[INFO] 2018-07-18 12:48:41,114 org.apache.spark.deploy.rest.RestSubmissionClient logInfo - Server responded with CreateSubmissionResponse:
{
"action" : "CreateSubmissionResponse",
"message" : "Driver successfully submitted as driver-20180718124840-0023",
"serverSparkVersion" : "2.2.0",
"submissionId" : "driver-20180718124840-0023",
"success" : true
}
[INFO] 2018-07-18 12:48:46,572 org.apache.spark.executor.CoarseGrainedExecutorBackend initDaemon - Started daemon with process name: 31605#spark-worker-662224983-4qpfw
[INFO] 2018-07-18 12:48:46,580 org.apache.spark.util.SignalUtils logInfo - Registered signal handler for TERM
[INFO] 2018-07-18 12:48:46,583 org.apache.spark.util.SignalUtils logInfo - Registered signal handler for HUP
[INFO] 2018-07-18 12:48:46,583 org.apache.spark.util.SignalUtils logInfo - Registered signal handler for INT
[WARN] 2018-07-18 12:48:47,293 org.apache.hadoop.util.NativeCodeLoader <clinit> - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[INFO] 2018-07-18 12:48:47,607 org.apache.spark.SecurityManager logInfo - Changing view acls to: root
[INFO] 2018-07-18 12:48:47,608 org.apache.spark.SecurityManager logInfo - Changing modify acls to: root
...
But I want to have this information in the console so that I can direct it to a separate file than the spark logs. I assume that some messages should be get printed when the above command gets run. I even used --verbose mode in the command so that maybe that helps but still the output in the console is empty.
The only thing that gets printed to console is
Running Spark using the REST application submission protocol. While in this page's question section, the user is able to see more than this.
I even tried to change the Logger level in my application code but that also didn't help. (based on some ideas from here)
Then the question would be, how come that I am not getting any output in the console and what can I do to get the info that I want to get printed to console?
P.S. I have developed and tweaked the cluster and the jar file to a good amount that maybe I have something somewhere causing the output not to get printed. What are the possible places I can check to fix this.
Update:
I found out that the default log4j.properties of spark has been edited. Here is the content:
# Set everything to be logged to the console
log4j.rootCategory=INFO, RollingAppender
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
log4j.appender.RollingAppender=org.apache.log4j.DailyRollingFileAppender
log4j.appender.RollingAppender.File=/var/log/spark.log
log4j.appender.RollingAppender.DatePattern='.'yyyy-MM-dd
log4j.appender.RollingAppender.layout=org.apache.log4j.PatternLayout
log4j.appender.RollingAppender.layout.ConversionPattern=[%p] %d %c %M - %m%n
# Set the default spark-shell log level to WARN. When running the spark-shell, the
# log level for this class is used to overwrite the root logger's log level, so that
# the user can have different defaults for the shell and regular Spark apps.
log4j.logger.org.apache.spark.repl.Main=INFO
# Settings to quiet third party logs that are too verbose
log4j.logger.org.spark_project.jetty=INFO
log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR
# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
I assume that this is not letting the verbose command to work. How can I change this to have get some content with --verbose?

As you are running the job in cluster mode the driver can be any node in cluster, so whatever you do print/console redirect may not return to client/edge node/worker node where the console is opened.
Try submitting the apllication in client mode

Related

Dataproc: limit log size for long-running / streaming Spark jobs

I've a Spark Structured Streaming job on GCP Dataproc - which picks up data from Kafka, does processing and pushes data back into kafka topics.
Couple of questions :
Does Spark put all the log (incl. INFO, WARN etc) into stderr ?
What I notice is that stdout is empty, while all the logging is put in to stderr
Is there a way for me to expire the data in stderr (i.e. expire the older logs) ?
Since I've a long running streaming job, the stderr gets filled up over time and nodes/VMs become unavailable.
Pls advice.
Here is output of the yarn logs command :
root#versa-structured-stream-v1-w-1:/home/karanalang# yarn logs -applicationId application_1663623368960_0008 -log_files stderr -size -500
2022-09-19 23:25:34,876 INFO client.RMProxy: Connecting to ResourceManager at versa-structured-stream-v1-m/10.142.0.62:8032
2022-09-19 23:25:35,144 INFO client.AHSProxy: Connecting to Application History server at versa-structured-stream-v1-m/10.142.0.62:10200
Can not find any log file matching the pattern: [stderr] for the container: container_e01_1663623368960_0008_01_000003 within the application: application_1663623368960_0008
Container: container_e01_1663623368960_0008_01_000002 on versa-structured-stream-v1-w-2.c.versa-sml-googl.internal:8026
LogAggregationType: LOCAL
=======================================================================================================================
LogType:stderr
LogLastModifiedTime:Mon Sep 19 23:25:35 +0000 2022
LogLength:43251469683
LogContents:
applianceName=usa-isn0784-rt01, tenantName=NOV, mstatsTimeBlock=1663507200, tenantId=2, vsnId=0, mstatsTotSentOctets=11596, mstatsTotRecvdOctets=24481, mstatsTotSessDuration=300000, mstatsTotSessCount=1, mstatsType=sdwan-acc-ckt-app-stats, appId=https, site=usa-isn0784-rt01, accCkt=WAN-DIA, siteId=442, accCktId=1, user=10.126.117.196, risk=3, productivity=3, family=general-internet, subFamily=web, bzTag=Unknown,topic=syslog.ueba-us4.v1.versa.demo3,customer=versa type(row) is -> <class 'str'>
End of LogType:stderr.This log file belongs to a running container (container_e01_1663623368960_0008_01_000002) and so may not be complete.
***********************************************************************
Container: container_e01_1663623368960_0008_01_000001 on versa-structured-stream-v1-w-1.c.versa-sml-googl.internal:8026
LogAggregationType: LOCAL
=======================================================================================================================
LogType:stderr
LogLastModifiedTime:Mon Sep 19 22:54:55 +0000 2022
LogLength:17367929
LogContents:
on syslog.ueba-us4.v1.versa.demo3-2
22/09/19 22:52:52 INFO org.apache.kafka.clients.consumer.internals.SubscriptionState: [Consumer clientId=consumer-spark-kafka-source-0f984ad9-f663-4ce1-9ef1-349419f3e6ec-1714963016-executor-1, groupId=spark-kafka-source-0f984ad9-f663-4ce1-9ef1-349419f3e6ec-1714963016-executor] Resetting offset for partition syslog.ueba-us4.v1.versa.demo3-2 to offset 449568676.
22/09/19 22:54:55 ERROR org.apache.spark.executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
End of LogType:stderr.
***********************************************************************
root#versa-structured-stream-v1-w-1:/home/karanalang# yarn logs -applicationId application_1663623368960_0008 -log_files stderr -size -500
2022-09-19 23:26:01,439 INFO client.RMProxy: Connecting to ResourceManager at versa-structured-stream-v1-m/10.142.0.62:8032
2022-09-19 23:26:01,696 INFO client.AHSProxy: Connecting to Application History server at versa-structured-stream-v1-m/10.142.0.62:10200
Can not find any log file matching the pattern: [stderr] for the container: container_e01_1663623368960_0008_01_000003 within the application: application_1663623368960_0008
Container: container_e01_1663623368960_0008_01_000002 on versa-structured-stream-v1-w-2.c.versa-sml-googl.internal:8026
LogAggregationType: LOCAL
=======================================================================================================================
LogType:stderr
LogLastModifiedTime:Mon Sep 19 23:26:02 +0000 2022
LogLength:44309782124
LogContents:
, tenantId=3, vsnId=0, mstatsTotSentOctets=48210, mstatsTotRecvdOctets=242351, mstatsTotSessDuration=300000, mstatsTotSessCount=34, mstatsType=dest-stats, destIp=165.225.216.24, mstatsAttribs=,topic=syslog.ueba-us4.v1.versa.demo3,customer=versa type(row) is -> <class 'str'>
22/09/19 23:26:02 WARN org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer: KafkaDataConsumer is not running in UninterruptibleThread. It may hang when KafkaDataConsumer's methods are interrupted because of KAFKA-1894
End of LogType:stderr.This log file belongs to a running container (container_e01_1663623368960_0008_01_000002) and so may not be complete.
***********************************************************************
Container: container_e01_1663623368960_0008_01_000001 on versa-structured-stream-v1-w-1.c.versa-sml-googl.internal:8026
LogAggregationType: LOCAL
=======================================================================================================================
LogType:stderr
LogLastModifiedTime:Mon Sep 19 22:54:55 +0000 2022
LogLength:17367929
LogContents:
on syslog.ueba-us4.v1.versa.demo3-2
22/09/19 22:52:52 INFO org.apache.kafka.clients.consumer.internals.SubscriptionState: [Consumer clientId=consumer-spark-kafka-source-0f984ad9-f663-4ce1-9ef1-349419f3e6ec-1714963016-executor-1, groupId=spark-kafka-source-0f984ad9-f663-4ce1-9ef1-349419f3e6ec-1714963016-executor] Resetting offset for partition syslog.ueba-us4.v1.versa.demo3-2 to offset 449568676.
22/09/19 22:54:55 ERROR org.apache.spark.executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
End of LogType:stderr.
Update :
Based on #Dagang's note, i'm using the RollingFileAppender in the log4j.properties .. and the new log file is getting created. However - some data is still getting into std err.
Here is the updated code :
spark-submit
gcloud dataproc jobs submit pyspark process-appstat.py \
--cluster $CLUSTER \
--properties ^#^spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,org.mongodb.spark:mongo-spark-connector_2.12:3.0.2#spark.dynamicAllocation.enabled=true#spark.dynamicAllocation.executorIdleTimeout=120s#spark.shuffle.service.enabled=true#spark.executor.extraJavaOptions=-Dlog4j.configuration=file:log4j-executor.properties#spark.driver.extraJavaOptions=-Dlog4j.configuration=file:log4j-driver.properties\
--jars=gs://dataproc-spark-jars/spark-avro_2.12-3.1.3.jar,gs://dataproc-spark-jars/isolation-forest_2.4.3_2.12-2.0.8.jar,gs://dataproc-spark-jars/spark-bigquery-with-dependencies_2.12-0.23.2.jar,gs://dataproc-spark-jars/mongo-spark-connector_2.12-3.0.2.jar,gs://dataproc-spark-jars/bson-4.0.5.jar,gs://dataproc-spark-jars/mongodb-driver-sync-4.0.5.jar,gs://dataproc-spark-jars/mongodb-driver-core-4.0.5.jar \
--files=gs://kafka-certs/versa-kafka-gke-ca.p12,gs://kafka-certs/syslog-vani-noacl.p12,gs://kafka-certs/alarm-compression-user.p12,gs://kafka-certs/alarm-compression-user-test.p12,gs://kafka-certs/appstats-user.p12,gs://kafka-certs/appstats-user-test.p12,gs://kafka-certs/insights-user.p12,gs://kafka-certs/insights-user-test.p12,gs://kafka-certs/intfutil-user.p12,gs://kafka-certs/intfutil-user-test.p12,gs://dataproc-spark-configs/metrics.properties,gs://dataproc-spark-configs/params.cfg,gs://kafka-certs/appstat-anomaly-user.p12,gs://kafka-certs/appstat-anomaly-user-test.p12,gs://kafka-certs/appstat-agg-user.p12,gs://kafka-certs/appstat-agg-user-test.p12,gs://kafka-certs/alarmblock-user.p12,gs://kafka-certs/alarmblock-user-test.p12,gs://kafka-certs/versa-alarmblock-test-user.p12,gs://kafka-certs/versa-bandwidth-test-user.p12,gs://kafka-certs/versa-appstat-test-user.p12,gs://kafka-certs/versa-alarmblock-user.p12,gs://kafka-certs/versa-bandwidth-user.p12,gs://kafka-certs/versa-appstat-user.p12,gs://dataproc-spark-configs/log4j-executor.properties,gs://dataproc-spark-configs/log4j-driver.properties \
--region $REGION \
--py-files streams.zip,utils.zip \
-- isdebug=$isdebug
log4j-executor.properties:
--------------------------
# Set everything to be logged to the console
# log4j.rootCategory=INFO, console
# log4j.appender.console=org.apache.log4j.ConsoleAppender
# log4j.appender.console.target=System.err
# log4j.appender.console.layout=org.apache.log4j.PatternLayout
# log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c: %m%n
# logging to rolling_file, using RolligFileAppender
log4j.rootLogger=INFO, rolling_file
log4j.appender.rolling_file=org.apache.log4j.RollingFileAppender
log4j.appender.rolling_file.File=${spark.yarn.app.container.log.dir}/versa-ss-executor.log
log4j.appender.rolling_file.MaxFileSize=100MB
log4j.appender.rolling_file.MaxBackupIndex=10
log4j.appender.rolling_file.layout=org.apache.log4j.PatternLayout
log4j.appender.rolling_file.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c: %m%n
# Settings to quiet third party logs that are too verbose
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.eclipse.jetty=WARN
# Allow INFO logging from Spark Env for EFM
log4j.logger.org.apache.spark.SparkEnv=INFO
# Spark 3.x
log4j.logger.org.sparkproject.jetty.server.handler.ContextHandler=WARN
# Spark 2.x
log4j.logger.org.spark_project.jetty.server.handler.ContextHandler=WARN
# Reduce verbosity for other spammy core classes
log4j.logger.org.apache.hadoop.conf.Configuration.deprecation=WARN
log4j.logger.org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter=WARN
log4j.logger.org.apache.spark.ExecutorAllocationManager=ERROR
log4j.logger.org.apache.spark=WARN
log4j-driver.properties:
-------------------------
log4j.rootLogger=INFO, rolling_file
log4j.appender.rolling_file=org.apache.log4j.RollingFileAppender
log4j.appender.rolling_file.File=${spark.yarn.app.container.log.dir}/versa-ss-driver.log
log4j.appender.rolling_file.MaxFileSize=100MB
log4j.appender.rolling_file.MaxBackupIndex=10
log4j.appender.rolling_file.layout=org.apache.log4j.PatternLayout
log4j.appender.rolling_file.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c: %m%n
# Settings to quiet third party logs that are too verbose
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.eclipse.jetty=WARN
# Allow INFO logging from Spark Env for EFM
log4j.logger.org.apache.spark.SparkEnv=INFO
# Spark 3.x
log4j.logger.org.sparkproject.jetty.server.handler.ContextHandler=WARN
# Spark 2.x
log4j.logger.org.spark_project.jetty.server.handler.ContextHandler=WARN
# Reduce verbosity for other spammy core classes
log4j.logger.org.apache.hadoop.conf.Configuration.deprecation=WARN
log4j.logger.org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter=WARN
log4j.logger.org.apache.spark.ExecutorAllocationManager=ERROR
log4j.logger.org.apache.spark=WARN
any ideas on what needs to be done for this ?
Question on -> ${spark.yarn.app.container.log.dir}
What location does this get translated to ?
when i logon worker node and check this, i get the following :
karanalang#versa-structured-stream-v1-w-0:~$ echo $spark.yarn.app.container.log.dir
.yarn.app.container.log.dir
In yarn-site.xml:
Here are the relevant configs:
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/hadoop/yarn/nm-local-dir</value>
<description>
Directories on the local machine in which to application temp files.
</description>
</property>
<property>
<property>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>gs://dataproc-temp-us-east1-939354532596-4ln8c3y1/fe57047f-13d9-4b9b-8bce-baa4a911aa65/yarn-logs</value>
<description>
The remote path, on the default FS, to store logs.
</description>
</property>
However the logs are in the location below:
root#versa-structured-stream-v1-w-0:/# find . -name versa-ss-executor.log
./var/log/hadoop-yarn/userlogs/application_1664926662510_0002/container_1664926662510_0002_01_000001/versa-ss-executor.log
./var/log/hadoop-yarn/userlogs/application_1664926662510_0003/container_1664926662510_0003_01_000179/versa-ss-executor.log
./var/log/hadoop-yarn/userlogs/application_1664926662510_0003/container_1664926662510_0003_01_000250/versa-ss-executor.log
./var/log/hadoop-yarn/userlogs/application_1664926662510_0003/container_1664926662510_0003_01_000299/versa-ss-executor.log
where is the location - ./var/log/hadoop-yarn/userlogs - taken from (it is not in yarn-site.sml)?
Short answer:
You can use a custom log4j config with RollingFileAppender to limit the log size for long-running jobs.
Long answer:
The default log4j config for Spark on Dataproc is at /etc/spark/conf/log4j.properties. It configures root logger to stderr at INFO level. But at runtime driver logs (in client mode) will be directed by the Dataproc agent to GCS and streamed back to the client, and executor logs (and driver logs in cluster mode) will be redirected by YARN to the stderr file in the container's YARN log dir. Consider using /etc/spark/conf/log4j.properties as the template for your custom config.
In your custom config, you can configure logs to be written to a RollingFileAppender, e.g.,
log4j.rootLogger=INFO, rolling_file
log4j.appender.rolling_file=org.apache.log4j.RollingFileAppender
log4j.appender.rolling_file.File=${spark.yarn.app.container.log.dir}/my_app.log
log4j.appender.rolling_file.MaxFileSize=100MB
log4j.appender.rolling_file.MaxBackupIndex=10
...
Note that for executors (and drivers in cluster mode), the value of log4j.appender.rolling_file.File needs to be a path under ${spark.yarn.app.container.log.dir}, see this question and this doc.
Upload your log4j config(s) to a GCS bucket, driver and executor may or may not share the same config. In your case, you might want to update executor log4j config only, just use the default for driver.
Then submit the job with the custom log4j config with one of the following ways:
The file name must be log4j.properties, driver and executor will share the same config:
gcloud dataproc jobs submit spark ... \
--files gs://my-bucket/log4j.properties
The file name doesn't have to be log4j.properties, driver and executor can have different config:
gcloud dataproc jobs submit spark ... \
--files gs://my-bucket/my-log4j.properties \
--properties 'spark.executor.extraJavaOptions=-Dlog4j.configuration=file:my-log4j.properties'
The expectation is that there will be rolling logs under the YARN container log dirs (configurable through yarn.nodemanager.log-dirs with default value /var/log/hadoop-yarn/userlogs on Dataproc) for the Spark executors, they will be automatically aggregated and stored in GCS and Cloud Logging.

pyspark job execution in yarn cluster

I am trying to understand how spark job work in yarn cluster
I am using below commands to submit job
spark-submit --master yarn --deploy-mode cluster sparksessionexample.py
After submitting job console shows below console log
2020-05-29 20:52:48,668 INFO yarn.Client: Uploading resource file:/tmp/spark-bcd415f0-a22e-46b2-951c-5b6e4385a0c6/__spark_libs__2908230569257238890.zip -> hdfs://localhost:9000/user/hadoop/.sparkStaging/application_1590759398715_0003/__spark_libs__2908230569257238890.zip
2020-05-29 20:53:14,164 INFO yarn.Client: Uploading resource file:/home/hadoop/pythonprojects/Python/src/spark_jobs/sparksessionexample.py -> hdfs://localhost:9000/user/hadoop/.sparkStaging/application_1590759398715_0003/sparksessionexample.py
2020-05-29 20:53:14,610 INFO yarn.Client: Uploading resource file:/home/hadoop/clouderaapp/apache-spark/python/lib/pyspark.zip -> hdfs://localhost:9000/user/hadoop/.sparkStaging/application_1590759398715_0003/pyspark.zip
2020-05-29 20:53:15,984 INFO yarn.Client: Uploading resource file:/home/hadoop/clouderaapp/apache-spark/python/lib/py4j-0.10.7-src.zip -> hdfs://localhost:9000/user/hadoop/.sparkStaging/application_1590759398715_0003/py4j-0.10.7-src.zip
2020-05-29 20:53:18,362 INFO yarn.Client: Uploading resource file:/tmp/spark-bcd415f0-a22e-46b2-951c-5b6e4385a0c6/__spark_conf__7123551182035223076.zip -> hdfs://localhost:9000/user/hadoop/.sparkStaging/application_1590759398715_0003/__spark_conf__.zip
I just to understand how yarn execute sparksessionexample.py file, i mean whether it create python virtual env on node? as above log shows only uploading lib, confs but what about python client to execute sparksessionexample.py?
Can anyone help understand this?
The "Spark client" is used to bootstrap the Spark job execution.
In your case it is the only thing that runs on your local machine, because you requested cluster execution mode:
the "client" contacts the cluster manager (here YARN Resource Manager, could be Kubernetes Master, etc.) to start the Spark driver inside an AppMaster container
then the driver contacts again the cluster manager to request some containers for the executors
then the driver runs your Python code and distributes the work to the executors
finally the driver de-allocates its executors and itself
at this point the "client" notices that the YARN job has reached success or failure status, and can terminate
In short, the "client" never gets any kind of useful information from the driver running inside the cluster. You must inspect the YARN logs for the container running the driver (it's the AppMaster, typically number 00001).
If you want to see some feedback from the driver, then run your job in client execution mode -- it means the driver will run in the same JVM as the "client", in your local machine, and spit its logs in your console.

Why doesn't the pyspark driver download jar files to local storage?

I am using spark-on-k8s-operator to deploy Spark 2.4.4 on Kubernetes. However, I'm pretty sure this questions is about Spark itself, not about a Kubernetes deployment of it.
I include several files when I deploy a job to the kubernetes cluster, including jars, pyfiles and a main. In spark-on-k8s; this is done via a config file:
spec:
mainApplicationFile: "s3a://project-folder/jobs/test/db_read_k8.py"
deps:
jars:
- "s3a://project-folder/jars/mysql-connector-java-8.0.17.jar"
pyFiles:
- "s3a://project-folder/pyfiles/pyspark_jdbc.zip"
This would be equivalent to
spark-submit \
--jars s3a://project-folder/jars/mysql-connector-java-8.0.17.jar \
--py-files s3a://project-folder/pyfiles/pyspark_jdbc.zip \
s3a://project-folder/jobs/test/db_read_k8.py
In spark-on-k8s, there is a sparkapplication kubernetes pod that manages the submitted spark jobs, and that pod spark-submits to a driver pod (which then interacts with the worker pods). My issue occurs on the driver pod. Once the driver recieves the spark-submit command, it goes about its business, and pull the required files from AWS S3, as expected. Except, it does not pull the jar file:
spark-kubernetes-driver 19/11/05 17:01:19 INFO SparkContext: Added JAR s3a://project-folder/jars/mysql-connector-java-8.0.17.jar at s3a://sezzle-spark/jars/mysql-connector-java-8.0.17.jar with timestamp 1572973279830
spark-kubernetes-driver 19/11/05 17:01:19 INFO SparkContext: Added file s3a://project-folder/jobs/test/db_read_k8.py at s3a://sezzle-spark/jobs/test/db_read_k8.py with timestamp 1572973279872
spark-kubernetes-driver 19/11/05 17:01:19 INFO Utils: Fetching s3a://project-folder/jobs/test/db_read_k8.py to /var/data/spark-f54f76a6-8f2b-4bd5-9644-c406aecac2dd/spark-42e3cd23-55c5-4099-a6af-455efb5dc4f2/userFiles-ae47c908-d0f0-4ff5-aee6-4dadc5c9b95f/fetchFileTemp1013256051456720708.tmp
spark-kubernetes-driver 19/11/05 17:01:19 INFO SparkContext: Added file s3a://project-folder/pyfiles/pyspark_jdbc.zip at s3a://sezzle-spark/pyfiles/pyspark_jdbc.zip with timestamp 1572973279962
spark-kubernetes-driver 19/11/05 17:01:20 INFO Utils: Fetching s3a://project-folder/pyfiles/pyspark_jdbc.zip to /var/data/spark-f54f76a6-8f2b-4bd5-9644-c406aecac2dd/spark-42e3cd23-55c5-4099-a6af-455efb5dc4f2/userFiles-ae47c908-d0f0-4ff5-aee6-4dadc5c9b95f/fetchFileTemp6740168219531159007.tmp
All three required files are "added" but only the main and pyfiles are "fetched." Looking through the driver pod, I can't find the jar file anywhere; it just doesn't get downloaded locally. This, of course, crashes my application, because the mysql driver isn't in the classpath.
Why doesn't spark download jar files to the driver's local filesystem the way it does for the pyfiles and python main?
PySpark has a bit unclear and not enough documented dependency management.
If your problem is with adding .jar only I would recommend you to use --packages ... instead (spark-operator should have the analogous option).
Hope it'll work for you.

Zeppelin --> Shiro --> Livy Integration Error: Cannot start spark | livy is not allowed to impersonate user1

I am facing an issue with Zeppelin --> Shiro --> Livy Integration. It would be great if someone could help me on this.
My current environment set up as follows:
• 1 Master node and 2 slave nodes running.
• Zeppelin installed on Master node up and running
• Shiro authentication has been enabled using shiro.ini file and zeppelin with shiro works fine as well(No LDAP authentication Yet)
• Livy server installed on Master Node Up and running
core-site.xml under etc/hadoop has been configured as follows:
<property>
<name>hadoop.proxyuser.livy.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.livy.groups</name>
<value>*</value>
</property>
*************************************************************
Also livy.conf under /livy/conf has been ocnfigurd as follows:
*************************************************************
# What port to start the server on.
livy.server.port 8998
# What spark master Livy sessions should use.
livy.spark.master yarn
livy.impersonation.enabled true
***************************************************************************
From Zeppelin UI , I have configured the %livy interpreter with below values
***************************************************************************
livy.spark.master : local[*]
zeppelin.livy.url : http://localhost:8998
My Testing:
Logged into to Zeppelin as “user1“ successfully. And to test the connectivity between the Zeppelin - -> Shiro -- > Livy Integration I am running below simple codes .
Code 1:
%livy.spark
val accum = sc.accumulator(0)
sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)
accum.value
Output Error: Cannot start spark.
Code 2:
%livy.spark
sc.version
Output Error on zeppelin ui: --> Cannot start spark.
Code3:
%livy.pyspark
print("1")
Output Error: --> Cannot start spark.
Issue:
The notebook takes a while for running and throws error message "Cannot start spark." on the zeppelin UI against the notebook output.
Further while investigating the Log file “livy-livy-server.out” under ,Log File Path: /var/log/livy/
Below error is visible on the log file.
18/04/09 12:15:25 INFO WebServer: Starting server on http://IReplcaedmyHostNameFromHere:8999
18/04/09 12:17:56 INFO InteractiveSession$: Creating Interactive session 0: [owner: null, request: [kind: spark, proxyUser: Some(user1), conf: spark.master -> local[*], heartbeatTimeoutInSecond: 0]]
18/04/09 12:17:56 INFO RpcServer: Connected to the port 10001
18/04/09 12:17:56 WARN RSCConf: Your hostname, <My Host>, resolves to a loopback address, but we couldn't find any external IP address!
18/04/09 12:17:56 WARN RSCConf: Set livy.rsc.rpc.server.address if you need to bind to another address.
18/04/09 12:17:56 INFO InteractiveSessionManager: Registering new session 0
18/04/09 12:17:59 INFO LineBufferedStream: stdout: 18/04/09 12:17:59 INFO RSCDriver: Connecting to: IReplcaedmyHostNameFromHere.internal:10001
.
.
.
.
18/04/09 12:18:07 INFO LineBufferedStream: stdout: ERROR: org.apache.hadoop.security.authorize.AuthorizationException: User: livy is not allowed to impersonate user1

Spark streaming application on a single virtual machine, standalone mode

I have created spark streaming application, which worked fine when deploy mode was client.
On my virtual machine I have master and only one worker.
When I tried to change mode to "cluster" it fails. In web UI, I see that the driver is running, but application is failed.
EDITED
In the log, I see following content:
16/03/23 09:06:25 INFO Master: Driver submitted org.apache.spark.deploy.worker.DriverWrapper
16/03/23 09:06:25 INFO Master: Launching driver driver-20160323090625-0001 on worker worker-20160323085541-10.0.2.15-36648
16/03/23 09:06:32 INFO Master: metering.dev.enerbyte.com:37168 got disassociated, removing it.
16/03/23 09:06:32 INFO Master: 10.0.2.15:59942 got disassociated, removing it.
16/03/23 09:06:32 INFO Master: metering.dev.enerbyte.com:37166 got disassociated, removing it.
16/03/23 09:06:46 INFO Master: Registering app wibeee-pipeline
16/03/23 09:06:46 INFO Master: Registered app wibeee-pipeline with ID app-20160323090646-0007
16/03/23 09:06:46 INFO Master: Launching executor app-20160323090646-0007/0 on worker worker-20160323085541-10.0.2.15-36648
16/03/23 09:06:50 INFO Master: Received unregister request from application app-20160323090646-0007
16/03/23 09:06:50 INFO Master: Removing app app-20160323090646-0007
16/03/23 09:06:50 WARN Master: Got status update for unknown executor app-20160323090646-0007/0
16/03/23 09:06:50 INFO Master: metering.dev.enerbyte.com:37172 got disassociated, removing it.
16/03/23 09:06:50 INFO Master: 10.0.2.15:45079 got disassociated, removing it.
16/03/23 09:06:51 INFO Master: Removing driver: driver-20160323090625-0001
So what happens is that master launches the driver on the worker,application gets registered, and then executir is tried to be launched on the same worker, which fails (although I have only one worker!)
EDIT
Can the issue be related to the fact that I use checkpointing, because I have "updateStateByKey" transformation in my code. It is set to "/tmp", but I always get a warning that "when run in cluster mode, "/tmp" needs to change. How should I set it?
Can that be the reason of my problem?
Thank you
According to log you have provided, it may not because of properties file but check this.
spark-submit only copies jar file to driver when running in cluster mode, so if your application tries to read properties file kept in the system from where you are running spark-submit, driver can not find it when running in cluster mode.
reading from properties file works in client mode because driver starts at the same machine where your are executing spark-submit.
You can copy properties to same directory in all nodes or keep properties file in cassandra file system and read from there.

Resources