Spark Application connecting to kerborized Hadoop cluster - apache-spark

I have a Spark machine running standalone mode. It has Spark job that is writing to kerberized HDFS.
Based on the Cloudera documentation Standalone Spark can't connect to kerberized HDFS. is this true?
https://www.cloudera.com/documentation/enterprise/5-5-x/topics/sg_spark_auth.html
My Spark node is not kerberized. Do i need to go for YARN mode to write to kerberized HDFS? Does my Spark cluster also needs to be kerberized to connect to HDFS?
I have posted this earlier, but none of these worked for me
Kinit with Spark when connecting to Hive

Related

How to get spark executor logs from spark history server? (spark on mesos client mode)

We are running spark on mesos client mode.
We have also spark history server.
Spark log events can be seen fine in spark history server.
But how we can get the spark executor logs from spark ui or spark history server?

Persisting to Kerberized HDFS from Spark cluster

my current set-up:
Spark version 2.3.1 (Cluster running on Windows) uses Spark secret (basic).
Hdfs (Cluster running on Linux) Kerberized.
Not ideal! but there's a good reason why I can't use the same set of machines for both clusters.
I am able to read/ write to Hdfs from a standalone Spark application but when I try to run similar code on the Spark cluster I get an authentication error.
java.io.IOException: Failed on local exception: java.io.IOException:
org.apache.hadoop.security.AccessControlException: Client cannot
authenticate via: [TOKEN, KERBEROS]; Host Details....
Where is your other cluster node? Which user running spark in cluster mode? Does that user has permission to access keytab? I think it can be permission issue or some typo.

Apache Spark and Livy cluster

Scenario :
I have spark cluster and I also want to use Livy.
I am new about Livy
Problem :
I built
my spark cluster by using docker swarm and I will also create a
service for Livy.
Can Livy communicate with external spark master and
send a job to external spark master? If it is ok, which configuration
need to be done? Or Livy should be installed on spark master node?
I think is a little late, but I hope this help you.
sorry for my english, but I am mexican, you can use docker to send jobs via livy, but also you can use livy to send jobs throw Livy REST API.
The livy server can be outside of the spark cluster, you only need to send a conf file to livy that points to you spark cluster.
It looks you are running spark standalone, easist way to configure livy to work is that livy lives on spark master node, if you already have YARN on your cluster machines, you can install livy on any node and run spark application in yarn-cluster or yarn-client mode.

What's difference between HDInsight Hadoop cluster & HDInsight Spark cluster?

What's difference between HDInsight Hadoop cluster & HDInsight Spark cluster? I have seen that even in Hadoop cluster pyspark is available. Is the difference with respect to the cluster type? i.e. Hadoop cluster implies YARN as a cluster management layer and Spark implying Spark Standalone (or Mesos?) as a cluster management layer?
If that is the case we can still run Spark in Hadoop cluster I believe so Spark will run on top of YARN.
HDInsight Spark uses YARN as cluster management layer, just as Hadoop. The binary on the cluster is the same.
The difference between HDInsight Spark and Hadoop clusters are the following:
1) Optimal Configurations:
Spark cluster is tuned and configured for spark workloads. For example, we have pre-configured spark clusters to use SSD and adjust executor memory size based on machine resource, so customers will have better out-of-box experience than the spark default configuration.
2) Service setups:
Spark cluster also run spark-related services including Livy, Jupyter, and Spark Thrift Server.
3)Workload Quality: We test spark workloads on spark clusters prior every release to ensure quality of service.
The bits are the same as you noticed. The difference is set of services and Ambari components that are running by default (on Spark you will have additional spark thrift, livy, jupyter) and set of configurations for those services. So while you technically can run spark jobs on yarn on hadoop cluster it's not recommended, some configs may be not set to optimal values. The other way around would be more reliable - create spark cluster and run hadoop jobs on it.
Maxim (HDInsight Spark PM)

Apache Spark and Mesos running on a single node

I am interested in testing Spark running on Mesos. I created a Hadoop 2.6.0 single-node cluster in my Virtualbox and installed Spark on it. I can successfully process files in HDFS using Spark.
Then I installed Mesos Master and Slave on the same node. I tried to run Spark as a framework in Mesos using these instructions. I get the following error with Spark:
WARN TaskSchedulerImpl: Initial job has not accepted any resources;
check your cluster UI to ensure that workers are registered and have
sufficient resources
Sparkshell is successfully registered as a framework in the Mesos. Is there anything wrong with using a single-node setup? Or whether I need to add more Spark worker nodes?
I am very new to Spark and my aim is to just test Spark, HDFS, and Mesos.
If you have allocated enough resources for spark slaves, the cause might be firewall blocking the communication. Take a look at my other answer:
Apache Spark on Mesos: Initial job has not accepted any resources

Resources