Why does spark-shell --master yarn-client fail with "UnknownHostException: Invalid host name"? - apache-spark

This is Spark 1.6.1.
When I do below at spark/bin
$ ./spark-shell --master yarn-client
I get the following error.
I checked hostname at /etc/hosts and also in Hadoop but they are assigned as same hostname. Any idea?

Related

FileNotFound error when running spark-submit

I am trying to run the spark-submit command on my Hadoop cluster
Here is a summary of my Hadoop Cluster:
The cluster is built using 5 VirtualBox VM's connected on an internal network
There is 1 namenode and 4 datanodes created.
All the VM's were built from the Bitnami Hadoop Stack VirtualBox image
When I run the following command:
spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.12-3.0.3.jar 10
I receive the following error:
java.io.FileNotFoundException: File file:/home/bitnami/sparkStaging/bitnami/.sparkStaging/application_1658417340986_0002/__spark_conf__.zip does not exist
I also get a similar error when trying to create a sparkSession using PySpark:
spark = SparkSession.builder.appName('appName').getOrCreate()
I have tried/verified the following
environment variables: HADOOP_HOME, SPARK_HOME AND HADOOP_CONF_DIR have been set in my .bashrc file
SPARK_DIST_CLASSPATH and HADOOP_CONF_DIR have been defined in spark-env.sh
Added spark.master yarn, spark.yarn.stagingDir file:///home/bitnami/sparkStaging and spark.yarn.jars file:///opt/bitnami/hadoop/spark/jars/ in spark-defaults.conf
I believe spark.yarn.stagingDir needs to be an HDFS path.
More specifically, the "YARN Staging directory" needs to be available on all Spark executors, not just a local file path from where you run spark-submit
The path that isn't found is being reported from the YARN cluster, where /home/bitnami might not exist, or the Unix user running the Spark executor containers does not have access to that path.
Similarly, spark.yarn.jars (or spark.yarn.archive) should be HDFS paths because these will get downloaded, in parallel, across all executors.
Since the spark job is supposed to be submitted to the Hadoop cluster managed by YARN, master and deploy-mode has to be set. From the spark 3.3.0 docs:
# Run on a YARN cluster in cluster deploy mode
export HADOOP_CONF_DIR=XXX
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \
--executor-memory 20G \
--num-executors 50 \
/path/to/examples.jar \
1000
Or programatically:
spark = SparkSession.builder().appName('appName').master("yarn").config("spark.submit.deployMode","cluster").getOrCreate()

Load properties file in Spark classpath during spark-submit execution

I'm installing the Spark Atlas Connector in a spark submit script (https://github.com/hortonworks-spark/spark-atlas-connector)
Due to security restrictions, I can't put the atlas-application.properties in the spark/conf repository.
I used the two options in the spark-submit :
--driver-class-path "spark.driver.extraClassPath=hdfs:///directory_to_properties_files" \
--conf "spark.executor.extraClassPath=hdfs:///directory_to_properties_files" \
When I launch the spark-submit, I encounter this issue :
20/07/20 11:32:50 INFO ApplicationProperties: Looking for atlas-application.properties in classpath
20/07/20 11:32:50 INFO ApplicationProperties: Looking for /atlas-application.properties in classpath
20/07/20 11:32:50 INFO ApplicationProperties: Loading atlas-application.properties from null
Please find CDP Atals Configuration article.
https://community.cloudera.com/t5/Community-Articles/How-to-pass-atlas-application-properties-configuration-file/ta-p/322158
Client Mode:
spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode client --driver-java-options="-Datlas.conf=/tmp/" /opt/cloudera/parcels/CDH/jars/spark-examples*.jar 10
Cluster Mode:
sudo -u spark spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --files /tmp/atlas-application.properties --conf spark.driver.extraJavaOptions="-Datlas.conf=./" /opt/cloudera/parcels/CDH/jars/spark-examples*.jar 10

Unable to access Hive table when used yarn-cluster mode

I have kerbores enabled Cloudera cluster. Spark can access Hive table when I use client deployment mode.
I executed kinit command and then executed spark2-submit. Spark can access the Hive table when I use client mode.
spark2-submit --master yarn --deploy-mode client --keytab XXXXXXXXXX.keytab --principal XXXXXXXXXXX#USER.COM --conf "spark.driver.extraJavaOptions=-XX:+UseG1GC -XX:MaxPermSize=1024M -Djava.security.krb5.conf=/etc/krb5.conf" test.jar
But when I use cluster mode, spark gives table not found error.
spark2-submit --master yarn --deploy-mode cluster --keytab XXXXXXXXXX.keytab --principal XXXXXXXXXXX#USER.COM --conf "spark.driver.extraJavaOptions=-XX:+UseG1GC -XX:MaxPermSize=1024M -Djava.security.krb5.conf=/etc/krb5.conf" test.jar

spark-submit on kubernetes cluster does not recognise k8s --master property

I have successfully installed a Kubernetes cluster and can verify this by:
C:\windows\system32>kubectl cluster-info
Kubernetes master is running at https://<ip>:<port>
KubeDNS is running at https://<ip>:<port>/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
Then I am trying to run the SparkPi with the Spark I downloaded from https://spark.apache.org/downloads.html .
spark-submit --master k8s://https://192.168.99.100:8443 --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=2 --conf spark.kubernetes.container.image=gettyimages/spark c:\users\<username>\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar
I am getting this error:
Error: Master must either be yarn or start with spark, mesos, local
Run with --help for usage help or --verbose for debug output
I tried versions 2.4.0 and 2.3.3. I also tried
spark-submit --help
to see what I can get regarding the --master property. This is what I get:
--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.
According to the documentation [https://spark.apache.org/docs/latest/running-on-kubernetes.html] on running Spark workloads in Kubernetes, spark-submit does not even seem to recognise the k8s value for master. [ included in possible Spark masters: https://spark.apache.org/docs/latest/submitting-applications.html#master-urls ]
Any ideas? What would I be missing here?
Thanks
Issue was my CMD was recognising a previous spark-submit version I had installed(2.2) even though i was running the command from the bin directory of spark installation.

Erro spark-assembly-1.4.1-hadoop2.6.0.jar does not exist

I'm trying to submit a Spark app from local machine Terminal to my Cluster. I'm using --master yarn-cluster. I need to run the driver program on my Cluster too, not on the machine I do submit the application i.e my local machine
I'm using
bin/spark-submit
--class com.my.application.XApp
--master yarn-cluster --executor-memory 100m
--num-executors 50 hdfs://name.node.server:8020/user/root/x-service-1.0.0-201512141101-assembly.jar
1000
and getting error
Diagnostics: java.io.FileNotFoundException: File
file:/Users/nish1013/Dev/spark-1.4.1-bin-hadoop2.6/lib/spark-assembly-1.4.1-hadoop2.6.0.jar
does not exist
I can see in my service list ,
YARN + MapReduce2 2.7.1.2.3 Apache Hadoop NextGen MapReduce (YARN)
Spark 1.4.1.2.3 Apache Spark is a fast and general engine for
large-scale data processing.
already installed.
My spark-env.sh in local machine
export HADOOP_CONF_DIR=/Users/nish1013/Dev/hadoop-2.7.1/etc/hadoop
Has anyone encountered similar before ?
I think the right command to call is like following:
bin/spark-submit
--class com.my.application.XApp
--master yarn-cluster --executor-memory 100m
--num-executors 50 --conf spark.yarn.jars=hdfs://name.node.server:8020/user/root/x-service-1.0.0-201512141101-assembly.jar
1000
or you can add
spark.yarn.jars hdfs://name.node.server:8020/user/root/x-service-1.0.0-201512141101-assembly.jar
in your spark.default.conf file

Resources