Spark-submit with properties file for python - apache-spark

I am trying to load spark configuration through the properties file using the following command in Spark 2.4.0.
spark-submit --properties-file props.conf sample.py
It gives the following error
org.apache.spark.SparkException: Dynamic allocation of executors requires the external shuffle service. You may enable this through spark.shuffle.service.enabled.
The props.conf file has this
spark.master yarn
spark.submit.deployMode client
spark.authenticate true
spark.sql.crossJoin.enabled true
spark.dynamicAllocation.enabled true
spark.driver.memory 4g
spark.driver.memoryOverhead 2048
spark.executor.memory 2g
spark.executor.memoryOverhead 2048
Now, when I try to run the same by adding all arguments to the command itself, it works fine.
spark2-submit \
--conf spark.master=yarn \
--conf spark.submit.deployMode=client \
--conf spark.authenticate=true \
--conf spark.sql.crossJoin.enabled=true \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.driver.memory=4g \
--conf spark.driver.memoryOverhead=2048 \
--conf spark.executor.memory=2g \
--conf spark.executor.memoryOverhead=2048 \
sample.py
This works as expected.

I don't think spark supports --properties-file, one workaround is making the change on $SPARK_HOME/conf/spark-defaults.conf, spark will auto load it.
You can refer https://spark.apache.org/docs/latest/submitting-applications.html#loading-configuration-from-a-file

Related

Can we have multiple executors in Spark master local[*] deployment code client

I have a 1 node Hadoop Cluster, I am submitting a spark job like this
spark-submit \
--class com.compq.scriptRunning \
--master local[*] \
--deploy-mode client \
--num-executors 3 \
--executor-cores 4 \
--executor-memory 21g \
--driver-cores 2 \
--driver-memory 5g \
--conf "spark.local.dir=/data/spark_tmp" \
--conf "spark.sql.shuffle.partitions=2000" \
--conf "spark.sql.inMemoryColumnarStorage.compressed=true" \
--conf "spark.sql.autoBroadcastJoinThreshold=200000" \
--conf "spark.speculation=false" \
--conf "spark.hadoop.mapreduce.map.speculative=false" \
--conf "spark.hadoop.mapreduce.reduce.speculative=false" \
--conf "spark.ui.port=8099" \
.....
Though I define 3 executors, I see only 1 executor in spark UI page running all the time. Can we have multiple executors running in parallel with
--master local[*] \
--deploy-mode client \
Its a on-prem, plain open source hadoop flavor installed in the cluster.
I tried changing master local to local[*] and playing around with deployment modes still, I could see only 1 executor running in spark UI

How to submit PySpark job on Kubernetes (minikube) using spark-submit

I have a PySpark job present locally on my laptop. If I want to submit it on my minikube cluster using spark-submit, any idea how to pass the python file ?
I'm using following command, but it isn't working
./spark-submit \
--master k8s://https://192.168.64.6:8443 \
--deploy-mode cluster \
--name amazon-data-review \
--conf spark.kubernetes.namespace=jupyter \
--conf spark.executor.instances=1 \
--conf spark.kubernetes.driver.limit.cores=1 \
--conf spark.executor.cores=1 \
--conf spark.executor.memory=500m \
--conf spark.kubernetes.container.image=prateek/spark-ubuntu-2.4.5 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.container.image.pullPolicy=Always \
--conf spark.kubernetes.container.image.pullSecrets=dockerlogin \
--conf spark.eventLog.enabled=true \
--conf spark.eventLog.dir=s3a://prateek/spark-hs/ \
--conf spark.hadoop.fs.s3a.access.key=xxxxx \
--conf spark.hadoop.fs.s3a.secret.key=xxxxx \
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
--conf spark.hadoop.fs.s3a.fast.upload=true \
/Users/prateek/apache-spark/amazon_data_review.py
Getting following error -
python3: can't open file '/Users/prateek/apache-spark/amazon_data_review.py': [Errno 2] No such file or directory
Is it required to keep the file within the Docker image itself. Can't we run it locally by keeping it on laptop
Spark on Kubernetes doesn't support submitting locally stored files with spark-submit.
What you could do to make it work in cluster mode is to build Spark Docker image based on prateek/spark-ubuntu-2.4.5 with amazon_data_review.py put inside of it (eg using Docker COPY /Users/prateek/apache-spark/amazon_data_review.py /amazon_data_review.py statement).
Then just refer to it in the spark-submit command using local:// file system, eg.:
spark-submit \
--master ... \
--conf ... \
...
local:///amazon_data_review.py
The alternative is to store that file on http(s):// or hdfs://-like accessible location.
It's solved. Running it with client mode helped to run it
--deploy-mode client

Kubernetes sport submit in cluster mode --packages not working as expected

I am trying to submit a spark job to a kubernetes cluster in cluster mode from a client in the cluster with --packages attribute to enable dependencies are downloaded by driver and executer but it is not working. It refers to path on submitting client. ( kubectl proxyis on )
here it the the submit options
/usr/local/bin/spark-submit \
--verbose \
--master=k8s://http://127.0.0.1:8001 \
--deploy-mode cluster \
--class org.apache.spark.examples.SparkPi \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.namespace=spark \
--conf spark.kubernetes.container.image= <...> \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.pyspark.pythonVersion=3 \
--conf spark.kubernetes.driver.secretKeyRef.AWS_ACCESS_KEY_ID=datazone-s3-secret:AWS_ACCESS_KEY_ID \
--conf spark.kubernetes.driver.secretKeyRef.AWS_SECRET_ACCESS_KEY=datazone-s3-secret:AWS_SECRET_ACCESS_KEY \
--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 \
s3.py 10
On the logs I can see that packages are referring my local file system.
Spark config:
(spark.kubernetes.namespace,spark)
(spark.jars,file:///Users/<my username>/.ivy2/jars/com.amazonaws_aws-java-sdk-1.7.4.jar,file:///Users/<my username>/.ivy2/jars/org.apache.hadoop_hadoop-aws-2.7.3.jar,file:///Users/<my username>/.ivy2/jars/joda-time_joda-time-2.10.5.jar, ....
Did someone face this problem?

Spark Thrift server queuing up queries

When Parallel queries are hitting Spark Thrift server, in Spark UI --> JDBC/ODBC Server , it shows up all queries as started but all of them gets executed in a sequential manner
Here's the Thrift Server startup script---
start_thriftserver (){
sudo /usr/lib/spark/sbin/start-thriftserver.sh \
--master yarn \
--deploy-mode client \
--executor-memory 3200m \
--executor-cores 2 \
--driver-memory 4g \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.shuffle.service.enabled=true \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.dynamicAllocation.schedulerBacklogTimeout=1s \
--conf spark.dynamicAllocation.minExecutors=50 \
--conf spark.executor.memoryOverhead=684
This is indeed a confusing topic.
spark.sql.hive.thriftServer.singleSession=false
Try this.
That said, I am a little sceptical on all this.

custom log using spark

I´m trying to configure a custom log using spark-submit, this my configure:
driver:
-DlogsPath=/var/opt/log\
-DlogsFile=spark-submit-driver.log\
-Dlog4j.configuration=jar:file:../bin/myapp.jar!/log4j.properties\
spark.driver.extraJavaOptions -> -DlogsPath=/var/opt/log -DlogsFile=spark-submit-driver.log -Dlog4j.configuration=jar:file:../bin/myapp.jar!/log4j.properties
executor:
-DlogsPath=/var/opt/log\
-DlogsFile=spark-submit-executor.log\
-Dlog4j.configuration=jar:file:../bin/myapp.jar!/log4j.properties\
spark.executor.extraJavaOptions -> -DlogsPath=/var/opt/log -DlogsFile=spark-submit-executor.log -Dlog4j.configuration=jar:file:../bin/myapp.jar!/log4j.properties
The spark-submit-drive.log is created and filled fine but spark-submit-executor.log is not crated
any idea?
Please try using log4j while running your job through spark submit.
Example:
spark-submit -- class com.something.Driver
--master yarn \
--driver-memory 1g \
--executor-memory 1g \
--driver-java-options '-Dlog4j.configuration=file:/absolute path to log4j property file/log4j.properties' \
--conf spark.executor.extraJavaOptions '-Dlog4j.configuration=file:/absolute path to log4j property file/log4j.properties' \
jarfilename.jar
Note: You have to define both the properties with driver-java-options and conf spark.executor.extraJavaOptions, also you can use the default log4j.properties
Please try to use
--conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:/Users/feng/SparkLog4j/SparkLog4jTest/target/log4j2.properties"
or
--file
/Users/feng/SparkLog4j/SparkLog4jTest/target/log4j2.properties
The below submit it works for me.
bin/spark-submit --class com.viaplay.log4jtest.log4jtest --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:/Users/feng/SparkLog4j/SparkLog4jTest/target/log4j2.properties" --master local[*] /Users/feng/SparkLog4j/SparkLog4jTest/target/SparkLog4jTest-1.0-jar-with-dependencies.jar

Resources