How to use custom jars in spark-submit --packages - apache-spark

I have a PySpark project which is doing spark structured streaming, to get the query metrics I have created a java project which listens to the microbatch events and logs the data in a log files. The logging works when I pass the jar as a file and using --jar to read it in spark-submit.
But, this process has manual work involved after where jar has to be uploaded manually. To solve it
uploaded the jar in JFrog repository. Now when running the spark-submit command I have added --repositories and also added the coordinate in --packages which already includes few packages like kafka, avro etc. All the packages downloads from the jfrog but when it reaches myjar it throws below error, but the repo url from log if if I try from browser it actually downloads the jar and pom as well!!!
:: problems summary ::
:::: WARNINGS
module not found: <myjar>;<version>
==== central: tried
https://<repo>/myjar.pom
-- artifact <myjar>.jar:
https://<repo>/myjar.jar
==== repo-1: tried
https://<repo>/myjar.pom
-- artifact <myjar>.jar:
https://<repo>/myjar.jar
::::::::::::::::::::::::::::::::::::::::::::::
:: UNRESOLVED DEPENDENCIES ::
::::::::::::::::::::::::::::::::::::::::::::::
:: com.spark.extension#<myjar>;<version>: not found
::::::::::::::::::::::::::::::::::::::::::::::
:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: com.spark.extension#<myjar>;<verion>: not found]
at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1428)
at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:54)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:308)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:902)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1038)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1047)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Command exiting with ret '1'
EDIT-
Taken from EMR (some of the url/names are omitted)-
spark-submit --name "A Adapter" --deploy-mode cluster --master yarn --repositories https://<jfrog repo>/artifactory/all/ --packages com.spark.extension:spark-listeners:0.3.8,org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1,org.postgresql:postgresql:42.2.22,software.amazon.cloudwatchlogs:aws-embedded-metrics:2.0.0-beta-1 --driver-cores 2 --driver-memory 12g --executor-memory 12g --num-executors 1 --executor-cores 2 --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.minExecutors=1 --conf spark.dynamicAllocation.maxExecutors=6 --files s3://<url>/log4j.properties,s3://<url>/logging.json --conf spark.yarn.dist.archives=s3://<url>/libs.zip#app-site-packages --conf spark.yarn.appMasterEnv.PYTHONPATH=app-site-packages --conf "spark.yarn.appMasterEnv.SPARK_APP_NAME=A Adapter" --conf spark.yarn.appMasterEnv.CLUSTER_STUB=dev-b1 --conf "spark.yarn.appMasterEnv.AWS_EMF_SERVICE_NAME=A Adapter" --conf spark.yarn.appMasterEnv.AWS_EMF_SERVICE_TYPE=dev-b1-emr --conf spark.yarn.appMasterEnv.AWS_EMF_LOG_GROUP_NAME=dev-b1-spark-structured-stream-logs --conf spark.yarn.appMasterEnv.AWS_EMF_LOG_STREAM_NAME=dev-b1-spark-structured-stream-logs --conf spark.yarn.appMasterEnv.AWS_EMF_AGENT_ENDPOINT=udp://127.0.0.1:25888 --conf spark.driver.extraJavaOptions= --conf spark.executor.extraJavaOptions= --conf spark.executorEnv.PYTHONPATH=app-site-packages --py-files s3://<url>/libs.zip,s3://<url>/jobs.zip,s3://<url>/.env s3://<url>/main.py --job acc

Related

Spark metrics sink doesn't expose executor's metrics

I'm using Spark on YARN with
Ambari 2.7.4
HDP Standalone 3.1.4
Spark 2.3.2
Hadoop 3.1.1
Graphite on Docker latest
I was trying to get Spark metrics with Graphite sink following this tutorial.
Advanced spark2-metrics-properties in Ambari are:
driver.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
executor.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
worker.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
master.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
*.sink.graphite.host=ap-test-m.c.gcp-ps.internal
*.sink.graphite.port=2003
*.sink.graphite.protocol=tcp
*.sink.graphite.period=10
*.sink.graphite.unit=seconds
*.sink.graphite.prefix=app-test
*.source.jvm.class=org.apache.spark.metrics.source.JvmSource
Spark submit:
export HADOOP_CONF_DIR=/usr/hdp/3.1.4.0-315/hadoop/conf/; spark-submit --class com.Main --master yarn --deploy-mode client --driver-memory 1g --executor-memory 10g --num-executors 2 --executor-cores 2 spark-app.jar /data
As a result I'm only getting driver metrics.
Also, I was trying to add metrics.properties to spark-submit command together with global spark metrics props, but that didn't help.
And finally, I tried conf in spark-submit and in java SparkConf:
--conf "spark.metrics.conf.driver.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink"
--conf "spark.metrics.conf.executor.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink"
--conf "worker.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink"
--conf "master.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink"
--conf "spark.metrics.conf.*.sink.graphite.host"="host"
--conf "spark.metrics.conf.*.sink.graphite.port"=2003
--conf "spark.metrics.conf.*.sink.graphite.period"=10
--conf "spark.metrics.conf.*.sink.graphite.unit"=seconds
--conf "spark.metrics.conf.*.sink.graphite.prefix"="app-test"
--conf "spark.metrics.conf.*.source.jvm.class"="org.apache.spark.metrics.source.JvmSource"
But that didn't help either.
CSVSink also gives only driver metrics.
UPD
When I submit job in cluster mode - I'm getting the same metrics as in Spark History Server. But the jvm metrics are still absent.
Posting to a dated question, but maybe it will help.
Seems like executors do not have metrics.properties file on their filesystems.
One way to confirm this would be to look at the executor logs:
2020-01-16 10:00:10 ERROR MetricsConfig:91 - Error loading configuration file metrics.properties
java.io.FileNotFoundException: metrics.properties (No such file or directory)
at org.apache.spark.metrics.MetricsConfig.loadPropertiesFromFile(MetricsConfig.scala:132)
at org.apache.spark.metrics.MetricsConfig.initialize(MetricsConfig.scala:55)
at org.apache.spark.metrics.MetricsSystem.<init>(MetricsSystem.scala:95)
at org.apache.spark.metrics.MetricsSystem$.createMetricsSystem(MetricsSystem.scala:233)
To fix this on yarn pass two parameters to spark-submit:
$ spark-submit \
--files metrics.properties \
--conf spark.metrics.conf=metrics.properties
The --files option ensures that files specified in the option will be shared to executors.
The spark.metrics.conf option specifies a custom file location for the metrics.
Another way to fix the issue would be to place the metrics.properties file into $SPARK_HOME/conf/metrics.properties on both the driver and executor before starting the job.
More on metrics here: https://spark.apache.org/docs/latest/monitoring.html

spark-submit fails giving FileNotFoundException

I am using spark 1.6. Below is the spark-submit command I am executing. Variables prefixed with '$' are dynamic variables.
spark-submit --queue "queue1" --conf spark.driver.extraJavaOptions="-Dlog4j.configuration=file:./conf/log4j.properties -Dspark.yarn.keytab=$KEYTAB -Dspark.yarn.principal=$PRINCIPAL -Dkeytab=$KEYTAB -Dprincipal=$PRINCIPAL" --conf spark.yarn.am.cores=$am_cores --conf spark.yarn.am.memory="$am_memory" --conf spark.yarn.executor.memoryOverhead=$executor_memoryOverhead --conf spark.yarn.principal=$PRINCIPAL --conf spark.yarn.keytab=$KEYTAB --conf spark.sql.hive.convertMetastoreParquet=false --conf spark.ui.enabled=true --conf spark.driver.extraClassPath=/opt/cloudera/parcels/CDH/lib/hbase/conf/ --conf spark.executor.extraClassPath=/opt/cloudera/parcels/CDH/lib/hbase/conf/ --conf spark.sql.shuffle.partitions=$shuffle_partitions --name $APPNAME --master yarn --num-executors $num_executors --executor-memory $executor_memory --executor-cores $executor_cores --driver-memory $driver_memory --deploy-mode cluster --principal $PRINCIPAL --keytab $KEYTAB --files ./conf/log4j.properties --jars $APPSCALALIBJARFILES --class $APPCLASS $APPLIB
I am running spark-submit job on yarn cluster during which it is uploading dependent jars in default HDFS staging directory which is /user/<user id>/.sparkStaging/<yarn applicationId>/*.jar.
On verification during spark-submit job, I see that jar is getting uploaded but spark-submit is failing with below error - file owner and group belongs to the same id using which spark-submit is performed. I also tried using configuration parameter spark.yarn.StagingDir but even that didn't helped.
Your professional inputs will help in addressing this issue.
Error stack trace -
Diagnostics: File does not exist: hdfs://user/<user id>/.sparkStaging/<yarn application_id>/chill-java-0.5.0.jar
java.io.FileNotFoundException: File does not exist:
hdfs://user/<user id>/.sparkStaging/<yarn application_id>/chill-java-0.5.0.jar
at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1257)
at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1249)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1249)
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:251)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:61)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:357)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:356)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Exception in thread "main" org.apache.spark.SparkException: Must specify the driver container image

I am trying to do spark-submit on minikube(Kubernetes) from local machine CLI with command
spark-submit --master k8s://https://127.0.0.1:8001 --name cfe2
--deploy-mode cluster --class com.yyy.Test --conf spark.executor.instances=2 --conf spark.kubernetes.container.image docker.io/anantpukale/spark_app:1.1 local://spark-0.0.1-SNAPSHOT.jar
I have a simple spark job jar built on verison 2.3.0. I also have containerized it in docker and minikube up and running on virtual box.
Below is exception stack:
Exception in thread "main" org.apache.spark.SparkException: Must specify the driver container image at org.apache.spark.deploy.k8s.submit.steps.BasicDriverConfigurationStep$$anonfun$3.apply(BasicDriverConfigurationStep.scala:51) at org.apache.spark.deploy.k8s.submit.steps.BasicDriverConfigurationStep$$anonfun$3.apply(BasicDriverConfigurationStep.scala:51) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.deploy.k8s.submit.steps.BasicDriverConfigurationStep.<init>(BasicDriverConfigurationStep.scala:51)
at org.apache.spark.deploy.k8s.submit.DriverConfigOrchestrator.getAllConfigurationSteps(DriverConfigOrchestrator.scala:82)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply(KubernetesClientApplication.scala:229)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply(KubernetesClientApplication.scala:227)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2585)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:227)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:192)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 2018-04-06 13:33:52 INFO ShutdownHookManager:54 - Shutdown hook called 2018-04-06 13:33:52 INFO ShutdownHookManager:54 - Deleting directory C:\Users\anant\AppData\Local\Temp\spark-6da93408-88cb-4fc7-a2de-18ed166c3c66
Look like bug with default value for parameters spark.kubernetes.driver.container.image, that must be spark.kubernetes.container.image. So try specify driver/executor container image directly:
spark.kubernetes.driver.container.image
spark.kubernetes.executor.container.image
From the source code, the only available conf options are:
spark.kubernetes.container.image
spark.kubernetes.driver.container.image
spark.kubernetes.executor.container.image
And I noticed that Spark 2.3.0 has changed a lot in terms of k8s implementation compared to 2.2.0. For example, instead of specifying driver and executor separately, the official starter's guide is to use a single image given to spark.kubernetes.container.image.
See if this works:
spark-submit \
--master k8s://http://127.0.0.1:8001 \
--name cfe2 \
--deploy-mode cluster \
--class com.oracle.Test \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.container.image=docker/anantpukale/spark_app:1.1 \
--conf spark.kubernetes.authenticate.submission.oauthToken=YOUR_TOKEN \
--conf spark.kubernetes.authenticate.submission.caCertFile=PATH_TO_YOUR_CERT \
local://spark-0.0.1-SNAPSHOT.jar
The token and cert can be found on k8s dashboard. Follow the instructions to make Spark 2.3.0 compatible docker images.

Dynamic Resource allocation in Spark-Yarn Cluster Mode

When i use the below setting to start the spark application (default is yarn-client mode) works fine
spark_memory_setting="--master yarn --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.yarn.queue=ciqhigh --conf spark.dynamicAllocation.initialExecutors=50 --conf spark.dynamicAllocation.maxExecutors=50 --executor-memory 2G --driver-memory 4G"
ISSUE
Whereas when i change the deploy mode as cluster,application not starting up. Not even throwing any error to move on.
spark_memory_setting="--master yarn-cluster --deploy-mode=cluster --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.yarn.queue=ciqhigh --conf spark.dynamicAllocation.initialExecutors=50 --conf spark.dynamicAllocation.maxExecutors=50 --executor-memory 2G --driver-memory 4G"
LOG
18/01/08 01:21:00 WARN Client: spark.yarn.am.extraJavaOptions will not
take effect in cluster mode
This is the last line from the logger.
Any suggestions most welcome.
One important think to highlight here, the spark application which am trying to deploy starts the apache thrift server. After my searching, i think its coz of thrift am not able to able to run yarn in cluster mode. Any help to run in cluster mode.
the option --master yarn-cluster is wrong.. this is not a valid master url it should be just "yarn" instead of "yarn-cluster".. just cross check..

java.lang.ClassNotFoundException: org.apache.spark.Logging

I just upgraded to Spark 2.1.0 and decided to test out my data with beeline, but for some reason it gives me:
Error:
org.spark_project.guava.util.concurrent.UncheckedExecutionException:
java.lang.ClassNotFoundException: org.apache.spark.Logging was removed
in Spark 2.0. Please check if your library is compatible with Spark
2.0 (state=,code=0)
I renamed the old directory so all files would be new. I'm not running my own code, but beeline that comes with Spark.
Here are steps that I followed:
cd /usr/local/spark
./sbin/start-thriftserver.sh --master spark://REMOVED:7077 --num-executors 2 --driver-memory 6G --executor-cores 6 --executor-memory 14G --hiveconf hive.server2.thrift.port=10015 --packages datastax:spark-cassandra-connector:1.6.4-s_2.11 --conf spark.cassandra.connection.host=REMOVED --conf spark.cassandra.auth.username=REMOVED --conf spark.cassandra.auth.password=REMOVED
./bin/beeline -u jdbc:hive2://REMOVED:10015
So I'm not very sure what to do now, any suggestions please?
You need to update datastax:spark-cassandra-connector also. Please, try:
--packages datastax:spark-cassandra-connector:2.0.0-M3-s_2.11

Resources