Spark 2.1.1 with typesafeconfig - apache-spark

I'm trying to support some external configuration file for my spark application using typesafeconfig.
I'm loading the application.conf file in my application code like this (driver):
val config = ConfigFactory.load()
val myProp = config.getString("app.property")
val df = spark.read.avro(myProp)
application.conf looks like this:
app.propety="some value"
spark-submit execution looks like this:
spark-submit
--class com.myapp.Main \
--conf spark.shuffle.service.enabled=true \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.minExecutors=56 \
--conf spark.dynamicAllocation.maxExecutors=1000 \
--driver-class-path $HOME/conf/*.conf \
--files $HOME/conf/application.conf \
my-app-0.0.1-SNAPSHOT.jar
seems it doesn't work and I'm getting:
Exception in thread "main" com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'app'
at com.typesafe.config.impl.SimpleConfig.findKey(SimpleConfig.java:124)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:147)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:159)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:164)
at com.typesafe.config.impl.SimpleConfig.getString(SimpleConfig.java:206)
at com.paypal.cfs.fpti.Main$.main(Main.scala:42)
at com.paypal.cfs.fpti.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:750)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
looking at the logs i do see that "--files" work, seems like a classpath issue...
18/03/13 01:08:30 INFO SparkContext: Added file file:/home/user/conf/application.conf at file:/home/user/conf/application.conf with timestamp 1520928510820
18/03/13 01:08:30 INFO Utils: Copying /home/user/conf/application.conf to /tmp/spark-2938fde1-fa4a-47af-8dc6-1c54b5e89d48/userFiles-c2cec57f-18c8-491d-8679-df7e7da45e05/application.conf

Turns out I was pretty close to the answer to begin with... here is how it worked for me:
spark-submit \
--class com.myapp.Main \
--conf spark.shuffle.service.enabled=true \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.minExecutors=56 \
--conf spark.dynamicAllocation.maxExecutors=1000 \
--driver-class-path $APP_HOME/conf \
--files $APP_HOME/conf/application.conf \
$APP_HOME/my-app-0.0.1-SNAPSHOT.jar
then $APP_HOME will contain the below:
conf/application.conf
my-app-0.0.1-SNAPSHOT.jar
I guess you need to make sure the application.conf is placed inside a folder, that is the trick.

In order to specify the config file path, you may pass it as an application argument, and then read it from the args variable of the main class.
This is how you would execute the spark-submit command. Note that I've specified the config file after the application jar.
spark-submit
--class com.myapp.Main \
--conf spark.shuffle.service.enabled=true \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.minExecutors=56 \
--conf spark.dynamicAllocation.maxExecutors=1000 \
my-app-0.0.1-SNAPSHOT.jar $HOME/conf/application.conf
And then, load the config file from the path specified in args(0):
import com.typesafe.config.ConfigFactory
[...]
val dbconfig = ConfigFactory.parseFile(new File(args(0))
Now you have access to the properties of your application.conf file.
val myProp = config.getString("app.property")
Hope it helps.

Related

Spark-submit with properties file for python

I am trying to load spark configuration through the properties file using the following command in Spark 2.4.0.
spark-submit --properties-file props.conf sample.py
It gives the following error
org.apache.spark.SparkException: Dynamic allocation of executors requires the external shuffle service. You may enable this through spark.shuffle.service.enabled.
The props.conf file has this
spark.master yarn
spark.submit.deployMode client
spark.authenticate true
spark.sql.crossJoin.enabled true
spark.dynamicAllocation.enabled true
spark.driver.memory 4g
spark.driver.memoryOverhead 2048
spark.executor.memory 2g
spark.executor.memoryOverhead 2048
Now, when I try to run the same by adding all arguments to the command itself, it works fine.
spark2-submit \
--conf spark.master=yarn \
--conf spark.submit.deployMode=client \
--conf spark.authenticate=true \
--conf spark.sql.crossJoin.enabled=true \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.driver.memory=4g \
--conf spark.driver.memoryOverhead=2048 \
--conf spark.executor.memory=2g \
--conf spark.executor.memoryOverhead=2048 \
sample.py
This works as expected.
I don't think spark supports --properties-file, one workaround is making the change on $SPARK_HOME/conf/spark-defaults.conf, spark will auto load it.
You can refer https://spark.apache.org/docs/latest/submitting-applications.html#loading-configuration-from-a-file

How to submit PySpark job on Kubernetes (minikube) using spark-submit

I have a PySpark job present locally on my laptop. If I want to submit it on my minikube cluster using spark-submit, any idea how to pass the python file ?
I'm using following command, but it isn't working
./spark-submit \
--master k8s://https://192.168.64.6:8443 \
--deploy-mode cluster \
--name amazon-data-review \
--conf spark.kubernetes.namespace=jupyter \
--conf spark.executor.instances=1 \
--conf spark.kubernetes.driver.limit.cores=1 \
--conf spark.executor.cores=1 \
--conf spark.executor.memory=500m \
--conf spark.kubernetes.container.image=prateek/spark-ubuntu-2.4.5 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.container.image.pullPolicy=Always \
--conf spark.kubernetes.container.image.pullSecrets=dockerlogin \
--conf spark.eventLog.enabled=true \
--conf spark.eventLog.dir=s3a://prateek/spark-hs/ \
--conf spark.hadoop.fs.s3a.access.key=xxxxx \
--conf spark.hadoop.fs.s3a.secret.key=xxxxx \
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
--conf spark.hadoop.fs.s3a.fast.upload=true \
/Users/prateek/apache-spark/amazon_data_review.py
Getting following error -
python3: can't open file '/Users/prateek/apache-spark/amazon_data_review.py': [Errno 2] No such file or directory
Is it required to keep the file within the Docker image itself. Can't we run it locally by keeping it on laptop
Spark on Kubernetes doesn't support submitting locally stored files with spark-submit.
What you could do to make it work in cluster mode is to build Spark Docker image based on prateek/spark-ubuntu-2.4.5 with amazon_data_review.py put inside of it (eg using Docker COPY /Users/prateek/apache-spark/amazon_data_review.py /amazon_data_review.py statement).
Then just refer to it in the spark-submit command using local:// file system, eg.:
spark-submit \
--master ... \
--conf ... \
...
local:///amazon_data_review.py
The alternative is to store that file on http(s):// or hdfs://-like accessible location.
It's solved. Running it with client mode helped to run it
--deploy-mode client

java.lang.ClassNotFoundException: org.apache.spark.deploy.kubernetes.submit.Client

I am running a sample spark job in kubernetes cluster with following command:
bin/spark-submit \
--deploy-mode cluster \
--class org.apache.spark.examples.SparkPi \
--master k8s://https://XXXXX \
--kubernetes-namespace sidartha-spark-cluster \
--conf spark.executor.instances=2 \
--conf spark.app.name=spark-pi \
--conf spark.kubernetes.driver.docker.image=kubespark/spark-driver:v2.1.0-kubernetes-0.1.0-rc1 \
--conf spark.kubernetes.executor.docker.image=kubespark/spark-executor:v2.1.0-kubernetes-0.1.0-rc1 \
examples/jars/spark-examples_2.11-2.1.0-k8s-0.1.0-SNAPSHOT.jar 1000
I am building the spark from apache-spark-on-k8s
I am not able find the jar for org.apache.spark.deploy.kubernetes.submit.Client Class.
This issue is resolved. We need to build the spark/resource-manager/kubernetes from the source.

spark throws java.lang.NoClassDefFoundError: kafka/common/TopicAndPartition

when I use spark-submit command in Cloudera Yarn environment, I got this kind of exception:
java.lang.NoClassDefFoundError: kafka/common/TopicAndPartition
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
at java.lang.Class.getDeclaredMethods(Class.java:1975)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.com$fasterxml$jackson$module$scala$introspect$BeanIntrospector$$listMethods$1(BeanIntrospector.scala:93)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.findMethod$1(BeanIntrospector.scala:99)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.com$fasterxml$jackson$module$scala$introspect$BeanIntrospector$$findGetter$1(BeanIntrospector.scala:124)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$$anonfun$3$$anonfun$apply$5.apply(BeanIntrospector.scala:177)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$$anonfun$3$$anonfun$apply$5.apply(BeanIntrospector.scala:173)
at scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$$anonfun$3.apply(BeanIntrospector.scala:173)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$$anonfun$3.apply(BeanIntrospector.scala:172)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.immutable.List.foreach(List.scala:318)
...
The spark-submit command is like:
spark-submit --master yarn-cluster \
--num-executors $2 \
--executor-cores $3 \
--class "APP" \
--deploy-mode cluster \
--properties-file $1 \
--files $HDFS_PATH/log4j.properties,$HDFS_PATH/metrics.properties \
--conf spark.metrics.conf=metrics.properties \
APP.jar
note that, TopicAndPartition.class is in shaded APP.jar.
Please try adding the Kafka jar using the --jars option as shown in the example below:
spark-submit --master yarn-cluster \
--num-executors $2 \
--executor-cores $3 \
--class "APP" \
--deploy-mode cluster \
--properties-file $1 \
--jars /path/to/kafka.jar
--files $HDFS_PATH/log4j.properties,$HDFS_PATH/metrics.properties \
--conf spark.metrics.conf=metrics.properties \
APP.jar
After using some methods, it turns out that the issue is caused because version incompatibility. As #user1050619 said, make sure the version of kafka, spark, zookeeper and scala are compatible with each other.

custom log using spark

I´m trying to configure a custom log using spark-submit, this my configure:
driver:
-DlogsPath=/var/opt/log\
-DlogsFile=spark-submit-driver.log\
-Dlog4j.configuration=jar:file:../bin/myapp.jar!/log4j.properties\
spark.driver.extraJavaOptions -> -DlogsPath=/var/opt/log -DlogsFile=spark-submit-driver.log -Dlog4j.configuration=jar:file:../bin/myapp.jar!/log4j.properties
executor:
-DlogsPath=/var/opt/log\
-DlogsFile=spark-submit-executor.log\
-Dlog4j.configuration=jar:file:../bin/myapp.jar!/log4j.properties\
spark.executor.extraJavaOptions -> -DlogsPath=/var/opt/log -DlogsFile=spark-submit-executor.log -Dlog4j.configuration=jar:file:../bin/myapp.jar!/log4j.properties
The spark-submit-drive.log is created and filled fine but spark-submit-executor.log is not crated
any idea?
Please try using log4j while running your job through spark submit.
Example:
spark-submit -- class com.something.Driver
--master yarn \
--driver-memory 1g \
--executor-memory 1g \
--driver-java-options '-Dlog4j.configuration=file:/absolute path to log4j property file/log4j.properties' \
--conf spark.executor.extraJavaOptions '-Dlog4j.configuration=file:/absolute path to log4j property file/log4j.properties' \
jarfilename.jar
Note: You have to define both the properties with driver-java-options and conf spark.executor.extraJavaOptions, also you can use the default log4j.properties
Please try to use
--conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:/Users/feng/SparkLog4j/SparkLog4jTest/target/log4j2.properties"
or
--file
/Users/feng/SparkLog4j/SparkLog4jTest/target/log4j2.properties
The below submit it works for me.
bin/spark-submit --class com.viaplay.log4jtest.log4jtest --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:/Users/feng/SparkLog4j/SparkLog4jTest/target/log4j2.properties" --master local[*] /Users/feng/SparkLog4j/SparkLog4jTest/target/SparkLog4jTest-1.0-jar-with-dependencies.jar

Resources