can't run example, error '-Xmx1g org.apache.spark.deploy.SparkSubmit' - apache-spark

i'm trying to replicate wordcount example from cloudera website
spark-submit --master yarn --deploy-mode client --conf
"spark.dynamicAllocation.enabled=false" --jars
SPARK_HOME/examples/jars/spark-examples_2.11-2.4.6.jar
cloudera_analyze.py zookeeper_server:2181 transaction
But it gives me this error:
C:\openjdk\jdk8025209hotspot\bin\java -cp "C:\spark246hadoop27/conf\;C:\spark246hadoop27\jars\*"
-Xmx1g org.apache.spark.deploy.SparkSubmit --master yarn --deploy-mode client --conf
"spark.dynamicAllocation.enabled=false" --jars
SPARK_HOME/examples/jars/spark-examples_2.11-2.4.6.jar cloudera_analyze.py zookeeper_server:2181 transaction
what could be the reason?

Related

Spark 3.3.1 is throwing error could not find file

I am getting the below error even though I am running as administrator
C:\Spark\spark-3.3.1-bin-hadoop3\bin>
C:\Spark\spark-3.3.1-bin-hadoop3\bin>spark-shell --packages io.delta:delta-core_2.12:1.2.1,org.apache.hadoop:hadoop-aws:3.3.1 --conf spark.hadoop.fs.s3a.access.key=<my key> --conf spark.hadoop.fs.s3a.secret.key=<my secret> --conf "spark.hadoop.fs.s3a.endpoint=<my endpoint> --conf "spark.databricks.delta.retentionDurationCheck.enabled=false" --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
"C:\Program Files\Eclipse Adoptium\jdk-11.0.13.8-hotspot\bin\java" -cp "C:\Spark\spark-3.2.3-bin-hadoop3.2\bin\..\conf\;C:\Spark\spark-3.2.3-bin-hadoop3.2\jars\*" "-Dscala.usejavacp=true" -Xmx1g org.apache.spark.deploy.SparkSubmit --conf "spark.hadoop.fs.s3a.endpoint=<my endpoint> --conf spark.databricks.delta.retentionDurationCheck.enabled=false --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog > C:\Users\shari\AppData\Local\Temp\spark-class-launcher-output-8183.txt" --conf "spark.hadoop.fs.s3a.secret.key=<my secret>" --conf "spark.hadoop.fs.s3a.access.key=<my key>" --class org.apache.spark.repl.Main --name "Spark shell" --packages "io.delta:delta-core_2.12:1.2.1,org.apache.hadoop:hadoop-aws:3.3.1" spark-shell
The system cannot find the file C:\Users\shari\AppData\Local\Temp\spark-class-launcher-output-8183.txt.
Could Not Find C:\Users\shari\AppData\Local\Temp\spark-class-launcher-output-8183.txt
However, if I execute just spark-shell it works.
Can anyone please help me with it ?
OS: Windows 11
Spark: Apache Spark 3.3.1
Java: openjdk version "11.0.13" 2021-10-19
thanks

Load properties file in Spark classpath during spark-submit execution

I'm installing the Spark Atlas Connector in a spark submit script (https://github.com/hortonworks-spark/spark-atlas-connector)
Due to security restrictions, I can't put the atlas-application.properties in the spark/conf repository.
I used the two options in the spark-submit :
--driver-class-path "spark.driver.extraClassPath=hdfs:///directory_to_properties_files" \
--conf "spark.executor.extraClassPath=hdfs:///directory_to_properties_files" \
When I launch the spark-submit, I encounter this issue :
20/07/20 11:32:50 INFO ApplicationProperties: Looking for atlas-application.properties in classpath
20/07/20 11:32:50 INFO ApplicationProperties: Looking for /atlas-application.properties in classpath
20/07/20 11:32:50 INFO ApplicationProperties: Loading atlas-application.properties from null
Please find CDP Atals Configuration article.
https://community.cloudera.com/t5/Community-Articles/How-to-pass-atlas-application-properties-configuration-file/ta-p/322158
Client Mode:
spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode client --driver-java-options="-Datlas.conf=/tmp/" /opt/cloudera/parcels/CDH/jars/spark-examples*.jar 10
Cluster Mode:
sudo -u spark spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --files /tmp/atlas-application.properties --conf spark.driver.extraJavaOptions="-Datlas.conf=./" /opt/cloudera/parcels/CDH/jars/spark-examples*.jar 10

Pass arguments from a file to multiple spark jobs

Is it possible to have one master file that stores a list of arguments than can be referenced from a spark-submit command?
Example of the properties file, configurations.txt (does not have to be .txt):
school_library = "central"
school_canteen = "Nothernwall"
Expected requirement:
Calling it one spark-submit:
spark-submit --master yarn \
--deploy-mode cluster \
--jars sample.jar \
/home/user/helloworld.py configurations.school_library
Calling it in another spark-submit:
spark-submit --master yarn \
--deploy-mode cluster \
--jars sample.jar \
/home/user/helloworld.py configurations.school_canteen
Calling both in another spark-submit:
spark-submit --master yarn \
--deploy-mode cluster \
--jars sample.jar \
/home/user/helloworld.py configurations.school_library configurations.school_canteen
Yes.
You can do that by the conf --files
For example, You are submitting a spark job with a config file: /data/config.conf:
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster
--executor-memory 20G \
--num-executors 50 \
--files /data/config.conf \
/path/to/examples.jar
And this file will be uploaded & place in the working directory on the driver. So you have to access by its name.
Ex:
new FileInputStream("config.conf")
Spark-submit parameter "--properties-file" can be used.
Property names have to be started with "spark." prefix, for ex:
spark.mykey=myvalue
Values in this case extracted from configuration (SparkConf)

spark-submit yarn java.lang.ClassNotFoundException:com.microsoft.sqlserver.jdbc.SQLServerDriver

spark-submit --master yarn --deploy-mode cluster sqlserver.py --jars sqljdbc42.jar
I get Error :
java.lang.ClassNotFoundException:com.microsoft.sqlserver.jdbc.SQLServerDriver
Everything works when I use --deploy-mode client and copy the jar file to /usr/hdp/current/spark2-client/jars/sqljdbc42.jar
Should I copy the sqljdbc42.jar to /usr/hdp/current/hadoop-yarn-client/lib/ on all data nodes?

Dynamic Resource allocation in Spark-Yarn Cluster Mode

When i use the below setting to start the spark application (default is yarn-client mode) works fine
spark_memory_setting="--master yarn --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.yarn.queue=ciqhigh --conf spark.dynamicAllocation.initialExecutors=50 --conf spark.dynamicAllocation.maxExecutors=50 --executor-memory 2G --driver-memory 4G"
ISSUE
Whereas when i change the deploy mode as cluster,application not starting up. Not even throwing any error to move on.
spark_memory_setting="--master yarn-cluster --deploy-mode=cluster --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.yarn.queue=ciqhigh --conf spark.dynamicAllocation.initialExecutors=50 --conf spark.dynamicAllocation.maxExecutors=50 --executor-memory 2G --driver-memory 4G"
LOG
18/01/08 01:21:00 WARN Client: spark.yarn.am.extraJavaOptions will not
take effect in cluster mode
This is the last line from the logger.
Any suggestions most welcome.
One important think to highlight here, the spark application which am trying to deploy starts the apache thrift server. After my searching, i think its coz of thrift am not able to able to run yarn in cluster mode. Any help to run in cluster mode.
the option --master yarn-cluster is wrong.. this is not a valid master url it should be just "yarn" instead of "yarn-cluster".. just cross check..

Resources