Spark 3.3.1 is throwing error could not find file - apache-spark

I am getting the below error even though I am running as administrator
C:\Spark\spark-3.3.1-bin-hadoop3\bin>
C:\Spark\spark-3.3.1-bin-hadoop3\bin>spark-shell --packages io.delta:delta-core_2.12:1.2.1,org.apache.hadoop:hadoop-aws:3.3.1 --conf spark.hadoop.fs.s3a.access.key=<my key> --conf spark.hadoop.fs.s3a.secret.key=<my secret> --conf "spark.hadoop.fs.s3a.endpoint=<my endpoint> --conf "spark.databricks.delta.retentionDurationCheck.enabled=false" --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
"C:\Program Files\Eclipse Adoptium\jdk-11.0.13.8-hotspot\bin\java" -cp "C:\Spark\spark-3.2.3-bin-hadoop3.2\bin\..\conf\;C:\Spark\spark-3.2.3-bin-hadoop3.2\jars\*" "-Dscala.usejavacp=true" -Xmx1g org.apache.spark.deploy.SparkSubmit --conf "spark.hadoop.fs.s3a.endpoint=<my endpoint> --conf spark.databricks.delta.retentionDurationCheck.enabled=false --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog > C:\Users\shari\AppData\Local\Temp\spark-class-launcher-output-8183.txt" --conf "spark.hadoop.fs.s3a.secret.key=<my secret>" --conf "spark.hadoop.fs.s3a.access.key=<my key>" --class org.apache.spark.repl.Main --name "Spark shell" --packages "io.delta:delta-core_2.12:1.2.1,org.apache.hadoop:hadoop-aws:3.3.1" spark-shell
The system cannot find the file C:\Users\shari\AppData\Local\Temp\spark-class-launcher-output-8183.txt.
Could Not Find C:\Users\shari\AppData\Local\Temp\spark-class-launcher-output-8183.txt
However, if I execute just spark-shell it works.
Can anyone please help me with it ?
OS: Windows 11
Spark: Apache Spark 3.3.1
Java: openjdk version "11.0.13" 2021-10-19
thanks

Related

SparkSubmit launched by Kubernetes Spark Operator stuck with no error details

In K8S Spark Operator, submitted job are getting stuck at Java thread, at the following command with no error details:
/opt/tools/Linux/jdk/openjdk1.8.0.332_8.62.0.20_x64/bin/java -cp /opt/spark/conf/:/opt/spark/jars/ org.apache.spark.deploy.SparkSubmit* --master k8s://https://x.y.z.a:443 --deploy-mode cluster --conf spark.kubernetes.container.image.pullPolicy=IfNotPresent --conf spark.executor.memory=512m --conf spark.driver.memory=512m --conf spark.network.crypto.enabled=true --conf spark.driver.cores=0.100000 --conf spark.io.encryption.enabled=true --conf spark.kubernetes.driver.limit.cores=200m --conf spark.kubernetes.driver.label.version=3.0.1 --conf spark.app.name=sparkimpersonationx42aa8bff --conf spark.kubernetes.submission.waitAppCompletion=false --conf spark.executor.cores=1 --conf spark.authenticate=true --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/launched-by-spark-operator=true --conf spark.kubernetes.namespace=abc --conf spark.kubernetes.container.image=placeholder:94 --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/launched-by-spark-operator=true --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/submission-id=b651fb42-90fd-4675-8e2f-9b4b6e380010 --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/app-name=sparkimpersonationx42aa8bff --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/submission-id=b651fb42-90fd-4675-8e2f-9b4b6e380010 --conf spark.kubernetes.driver.pod.name=sparkimpersonationx42aa8bff-driver --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-driver-abc --conf spark.executor.instances=1 --conf spark.kubernetes.executor.label.version=3.0.1 --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/app-name=sparkimpersonationx42aa8bff --class org.apache.spark.examples.SparkPi --jars local:///sample-apps/sample-basic-spark-operator/extra-jars/* local:///sample-apps/sample-basic-spark-operator/sample-basic-spark-operator.jar
From the available information, the causes for this can be:
The workload pods cannot get scheduled on your k8s nodes. You can check this with kubectl get pods, are the pods Running?
The resource limits have been reached and the pods are unresponsive.
The spark-operator itself might not be running. You should check the logs for the operator itself.
That's all I can say from what is available.

Spark-submit with properties file for python

I am trying to load spark configuration through the properties file using the following command in Spark 2.4.0.
spark-submit --properties-file props.conf sample.py
It gives the following error
org.apache.spark.SparkException: Dynamic allocation of executors requires the external shuffle service. You may enable this through spark.shuffle.service.enabled.
The props.conf file has this
spark.master yarn
spark.submit.deployMode client
spark.authenticate true
spark.sql.crossJoin.enabled true
spark.dynamicAllocation.enabled true
spark.driver.memory 4g
spark.driver.memoryOverhead 2048
spark.executor.memory 2g
spark.executor.memoryOverhead 2048
Now, when I try to run the same by adding all arguments to the command itself, it works fine.
spark2-submit \
--conf spark.master=yarn \
--conf spark.submit.deployMode=client \
--conf spark.authenticate=true \
--conf spark.sql.crossJoin.enabled=true \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.driver.memory=4g \
--conf spark.driver.memoryOverhead=2048 \
--conf spark.executor.memory=2g \
--conf spark.executor.memoryOverhead=2048 \
sample.py
This works as expected.
I don't think spark supports --properties-file, one workaround is making the change on $SPARK_HOME/conf/spark-defaults.conf, spark will auto load it.
You can refer https://spark.apache.org/docs/latest/submitting-applications.html#loading-configuration-from-a-file

can't run example, error '-Xmx1g org.apache.spark.deploy.SparkSubmit'

i'm trying to replicate wordcount example from cloudera website
spark-submit --master yarn --deploy-mode client --conf
"spark.dynamicAllocation.enabled=false" --jars
SPARK_HOME/examples/jars/spark-examples_2.11-2.4.6.jar
cloudera_analyze.py zookeeper_server:2181 transaction
But it gives me this error:
C:\openjdk\jdk8025209hotspot\bin\java -cp "C:\spark246hadoop27/conf\;C:\spark246hadoop27\jars\*"
-Xmx1g org.apache.spark.deploy.SparkSubmit --master yarn --deploy-mode client --conf
"spark.dynamicAllocation.enabled=false" --jars
SPARK_HOME/examples/jars/spark-examples_2.11-2.4.6.jar cloudera_analyze.py zookeeper_server:2181 transaction
what could be the reason?

Kubernetes sport submit in cluster mode --packages not working as expected

I am trying to submit a spark job to a kubernetes cluster in cluster mode from a client in the cluster with --packages attribute to enable dependencies are downloaded by driver and executer but it is not working. It refers to path on submitting client. ( kubectl proxyis on )
here it the the submit options
/usr/local/bin/spark-submit \
--verbose \
--master=k8s://http://127.0.0.1:8001 \
--deploy-mode cluster \
--class org.apache.spark.examples.SparkPi \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.namespace=spark \
--conf spark.kubernetes.container.image= <...> \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.pyspark.pythonVersion=3 \
--conf spark.kubernetes.driver.secretKeyRef.AWS_ACCESS_KEY_ID=datazone-s3-secret:AWS_ACCESS_KEY_ID \
--conf spark.kubernetes.driver.secretKeyRef.AWS_SECRET_ACCESS_KEY=datazone-s3-secret:AWS_SECRET_ACCESS_KEY \
--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 \
s3.py 10
On the logs I can see that packages are referring my local file system.
Spark config:
(spark.kubernetes.namespace,spark)
(spark.jars,file:///Users/<my username>/.ivy2/jars/com.amazonaws_aws-java-sdk-1.7.4.jar,file:///Users/<my username>/.ivy2/jars/org.apache.hadoop_hadoop-aws-2.7.3.jar,file:///Users/<my username>/.ivy2/jars/joda-time_joda-time-2.10.5.jar, ....
Did someone face this problem?

Why resources are being shared among exclusive node labels?

I'm having 5 different node labels and all of them are exclusive and each belongs to the same queue. When I'm submitting spark job on different node labels, resources are being shared between them but that should not be the case when using exclusive node labels.
What could be the possible reason?
HDP Version - HDP-3.1.0.0
My Spark Submit -
bash $SPARK_HOME/bin/spark-submit --packages net.java.dev.jets3t:jets3t:0.9.0,com.google.guava:guava:16.0.1,com.amazonaws:aws-java-sdk:1.7.4,com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0 --master yarn --queue prodQueue --conf "spark.executor.extraJavaOptions= -XX:SurvivorRatio=16 -XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintReferenceGC -XX:+PrintAdaptiveSizePolicy -XX:MaxDirectMemorySize=4g -XX:NewRatio=1" --conf spark.hadoop.yarn.timeline-service.enabled=false --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.yarn.executor.memoryOverhead=2048 --conf spark.task.maxFailures=40 --conf spark.kryoserializer.buffer.max=8m --conf spark.driver.memory=3g --conf spark.shuffle.sort.bypassMergeThreshold=5000 --conf spark.executor.heartbeatInterval=60s --conf spark.memory.storageFraction=0.20 --conf spark.ui.port=7070 --conf spark.reducer.maxReqsInFlight=10 --conf spark.scheduler.mode=FAIR --conf spark.port.maxRetries=100 --conf spark.yarn.max.executor.failures=280 --conf spark.shuffle.service.enabled=true --conf spark.cleaner.ttl=600 --executor-cores 2 --executor-memory 6g --num-executors 8 --conf spark.yarn.am.nodeLabelExpression=amNodeLabel --conf spark.yarn.executor.nodeLabelExpression=myNodeLabel my-application.jar
Thanks for helping.

Resources