Spark - yarn master but dataset on different hdfs cluster - apache-spark

I wish to run spark on one hdfs cluster (yarn master) but wish to access dataset from another hdfs cluster.
Both the hdfs cluster are keberized and the same ID has access on both.
steps:
setup env for first hdfs cluster
spark-shell --master yarn-client
sc.textFile("hdfs://[secondshdfscluster][dataset there]
res0.count(*) gives
......
Caused by: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN,KERBEROS]
.....
Is what I am trying even possible ? If so, any suggestions to fix it ?

Related

FileNotFound error when running spark-submit

I am trying to run the spark-submit command on my Hadoop cluster
Here is a summary of my Hadoop Cluster:
The cluster is built using 5 VirtualBox VM's connected on an internal network
There is 1 namenode and 4 datanodes created.
All the VM's were built from the Bitnami Hadoop Stack VirtualBox image
When I run the following command:
spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.12-3.0.3.jar 10
I receive the following error:
java.io.FileNotFoundException: File file:/home/bitnami/sparkStaging/bitnami/.sparkStaging/application_1658417340986_0002/__spark_conf__.zip does not exist
I also get a similar error when trying to create a sparkSession using PySpark:
spark = SparkSession.builder.appName('appName').getOrCreate()
I have tried/verified the following
environment variables: HADOOP_HOME, SPARK_HOME AND HADOOP_CONF_DIR have been set in my .bashrc file
SPARK_DIST_CLASSPATH and HADOOP_CONF_DIR have been defined in spark-env.sh
Added spark.master yarn, spark.yarn.stagingDir file:///home/bitnami/sparkStaging and spark.yarn.jars file:///opt/bitnami/hadoop/spark/jars/ in spark-defaults.conf
I believe spark.yarn.stagingDir needs to be an HDFS path.
More specifically, the "YARN Staging directory" needs to be available on all Spark executors, not just a local file path from where you run spark-submit
The path that isn't found is being reported from the YARN cluster, where /home/bitnami might not exist, or the Unix user running the Spark executor containers does not have access to that path.
Similarly, spark.yarn.jars (or spark.yarn.archive) should be HDFS paths because these will get downloaded, in parallel, across all executors.
Since the spark job is supposed to be submitted to the Hadoop cluster managed by YARN, master and deploy-mode has to be set. From the spark 3.3.0 docs:
# Run on a YARN cluster in cluster deploy mode
export HADOOP_CONF_DIR=XXX
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \
--executor-memory 20G \
--num-executors 50 \
/path/to/examples.jar \
1000
Or programatically:
spark = SparkSession.builder().appName('appName').master("yarn").config("spark.submit.deployMode","cluster").getOrCreate()

Spark Structure Streaming job failing in cluster mode

I am using spark-sql-2.4.1 v in my application.
While writing data on to hdfs folder I am facing this issue in spark-streaming application
Error:
yarn.Client: Deleted staging directory hdfs://dev/user/xyz/.sparkStaging/application_1575699597805_47
20/02/24 14:02:15 ERROR yarn.Client: Application diagnostics message: User class threw exception: org.apache.hadoop.security.AccessControlException: Permission denied: user= xyz, access=WRITE, inode="/tmp/hadoop-admin":admin:supergroup:drwxr-xr-x
.
.
.
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=xyz, access=WRITE, inode="/tmp/hadoop-admin":admin:supergroup:drwxr-xr-x
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:350)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:251)
While writing data on to HDFS folder I am facing this issue in spark-streaming application. When I run in yarn-cluster mode I face this issue i.e.
--master yarn \
--deploy-mode cluster \
But when I run in “yarn-client” mode it runs fine i.e.
--master yarn \
--deploy-mode client \
What is the root cause of this problem?
Fundamental question here, why it is trying to write in "/tmp/hadoop-admin/" instead of respective user directory i.e. hdfs://qa2/user/xyz/?
I have come across this fix:
https://issues.apache.org/jira/browse/SPARK-26825
How can I implement it in my spark-sql application?
The only difference between the working --deploy-mode client and the failing --deploy-mode cluster cases is the location of the driver. In client deploy mode, the driver runs on the machine you execute spark-submit (which is usually an edge node that is configured to use a YARN cluster, but it is not part of it) while in cluster deploy mode the driver runs as part of a YARN cluster (one of the nodes under control of YARN).
It looks like you've got a misconfigured edge node.
I'd not be surprised if a regular Spark SQL-only Spark application would be failing too. I'd not be surprised to hear that it has nothing to do with a streaming query (Spark Structured Streaming) and would fail for any Spark application.

Python job submission to spark from remotely

I have a python script with pyspark code on my local system. I am trying to submit a pyspark job from my local machine to remote spark cluster.
Please let me know how it can be done.
Do I need spark locally installed to submit spark job.
You need to have the spark master URL in the spark conf like below
SparkSession spark = SparkSession.builder().appName("CDX JSON Merge Job").master("spark://ip-address:7077")
.getOrCreate();
You have to install the spark client in your localhost and then execute the jar using the spark-submit
spark-submit --num-executors 50 --executor-memory 4G --executor-cores 4 --master spark://ip-address:7077 --deploy-mode client --class fully-qualified-class-name artifact.jar
You can also have the master as YARN if you are running Spark on YARN and deploy-mode as cluster.

Running spark application in local mode

I'm trying to start my Spark application in local mode using spark-submit. I am using Spark 2.0.2, Hadoop 2.6 & Scala 2.11.8 on Windows. The application runs fine from within my IDE (IntelliJ), and I can also start it on a cluster with actual, physical executors.
The command I'm running is
spark-submit --class [MyClassName] --master local[*] target/[MyApp]-jar-with-dependencies.jar [Params]
Spark starts up as usual, but then terminates with
java.io.Exception: Failed to connect to /192.168.88.1:56370
What am I missing here?
Check which port you are using: if on cluster: log in to master node and include:
--master spark://XXXX:7077
You can find it always in spark ui under port 8080
Also check your spark builder config if you have set master already as it takes priority when launching eg:
val spark = SparkSession
.builder
.appName("myapp")
.master("local[*]")

Spark + Mesos cluster mode, who uploads the jar?

I'm trying to run Spark applications with Mesos cluster mode. (I've got client mode working but still would like to try cluster mode)
I have launched spark-mesos-dispatcher on the Mesos master node.
When I submit the assembly at local path /tmp/assembly.jar using the following command,
bin/spark-submit --master mesos://dispatcher:7077 --deploy-mode cluster --class com.example.Example /tmp/assembly.jar
It fails because the file /tmp/assembly.jar does not exist on the mesos slave nodes.
I1129 10:47:43.839771 5884 fetcher.cpp:414] Fetcher Info: {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/9d725348-931a-48fb-96f7-d29a4b09f3e8-S9\/deploy","items":[{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"\/tmp\/assembly.jar"}}],"sandbox_directory":"\/var\/lib\/mesos\/slaves\/9d725348-931a-48fb-96f7-d29a4b09f3e8-S9\/frameworks\/9d725348-931a-48fb-96f7-d29a4b09f3e8-0291\/executors\/driver-20151129104742-0008\/runs\/31bf5840-226e-4b87-ae76-d14bd2f17950","user":"user"}
I1129 10:47:43.840710 5884 fetcher.cpp:369] Fetching URI '/tmp/assembly.jar'
I1129 10:47:43.840721 5884 fetcher.cpp:243] Fetching directly into the sandbox directory
I1129 10:47:43.840731 5884 fetcher.cpp:180] Fetching URI '/tmp/assembly.jar'
I1129 10:47:43.840737 5884 fetcher.cpp:160] Copying resource with command:cp '/tmp/assembly.jar' '/var/lib/mesos/slaves/9d725348-931a-48fb-96f7-d29a4b09f3e8-S9/frameworks/9d725348-931a-48fb-96f7-d29a4b09f3e8-0291/executors/driver-20151129104742-0008/runs/31bf5840-226e-4b87-ae76-d14bd2f17950/assembly.jar'
cp: cannot stat `/tmp/assembly.jar': No such file or directory
Failed to fetch '/tmp/assembly.jar': Failed to copy with command 'cp '/tmp/assembly.jar' '/var/lib/mesos/slaves/9d725348-931a-48fb-96f7-d29a4b09f3e8-S9/frameworks/9d725348-931a-48fb-96f7-d29a4b09f3e8-0291/executors/driver-20151129104742-0008/runs/31bf5840-226e-4b87-ae76-d14bd2f17950/assembly.jar'', exit status: 256
Failed to synchronize with slave (it's probably exited)
In case of YARN cluster mode, Spark's YARN client implementation will upload the application jar to HDFS so that the driver and all executors have access to the jar, but I could not find such code in RestSubmissionClient, which is used by Mesos or Standalond cluster mode.
Who does the uploading in this case? or do I need to manually put the application assembly somewhere accessible via an HTTP URI?
From my understanding you could use the SparkContext addJar() method to add a local (to the driver application) JAR file path, which will then be distributed to the executor nodes (in client mode).
As you state that you want to use cluster mode, I'd suggest that you have a look at the Spark Jobserver project, which should make the running of Spark applications on Mesos easier than with the built-in tools.

Resources