Run spark cluster using an independent YARN (without using Hadoop's YARN) - apache-spark

I want to deploy a spark cluster with YARN cluster manager.
This spark cluster needs to read data from an external HDFS filesystem belonging to an existing Hadoop ecosystem that also has its own YARN (However, I am not allowed to use the Hadoop's YARN.)
My Questions are -
Is it possible to run spark cluster using an independent YARN, while still reading data from an outside HDFS filesystem?
If yes, Is there any downside or performance penalty to this approach?
If no, can I run Spark as a standalone cluster, and will there be any performance issue?
Assume both the spark cluster and the Hadoop cluster are running in the same Data Center.

using an independent YARN, while still reading data from an outside HDFS filesystem
Yes. Configure the yarn-site.xml to the necessary cluster and use full FQDN to refer to external file locations such as hdfs://namenode-external:8020/file/path
any downside or performance penalty to this approach
Yes. All reads will be remote, rather than cluster-local. This would effectively be similar performance degradation as reading from S3 or other remote locations, however.
can I run Spark as a standalone cluster
You could, or you could use Kubernetes, if that's available, but both are pointless IMO, if there's already a YARN cluster (with enough resources) available

Related

Yarn as resource manager in SPARK for linux cluster - inside Kubernetes and outside Kubernetes

If I am using Kubernetes cluster to run spark, then I am using Kubernetes resource manager in Spark.
If I am using Hadoop cluster to run spark, then I am using Yarn resource manager in Spark.
But my question is, if I am spawning multiple linux nodes in kebernetes, and use one of the node as spark maste and three other as worker, what resource manager should I use? can I use yarn over here?
Second question, in case of any 4 node linux spark cluster (not in kubernetes and not hadoop, simple connected linux machines), even if I do not have hdfs, can I use yarn here as resource manager? if not, then what resource manager should be used for saprk?
Thanks.
if I am spawning multiple linux nodes in kebernetes,
Then you'd obviously use kubernetes, since it's available
in case of any 4 node linux spark cluster (not in kubernetes and not hadoop, simple connected linux machines), even if I do not have hdfs, can I use yarn here
You can, or you can use Spark Standalone scheduler, instead. However Spark requires a shared filesystem for reading and writing data, so, while you could attempt to use NFS, or S3/GCS for this, HDFS is faster

Does Spark streaming needs HDFS with Kafka

I have to design a setup to read incoming data from twitter (streaming). I have decided to use Apache Kafka with Spark streaming for real time processing. It is required to show analytics in a dashboard.
Now, being a newbie is this domain, My assumed data rate will be 10 Mb/sec maximum. I have decided to use 1 machine for Kafka of 12 cores and 16 GB memory. *Zookeeper will also be on same machine. Now, I am confused about Spark, it will have to perform streaming job analysis only. Later, analyzed data output is pushed to DB and dashboard.
Confused list:
Should I run Spark on Hadoop cluster or local file system ?
Is standalone mode of Spark can fulfill my requirements ?
Is my approach is appropriate or what should be best in this case ?
Try answer:
Should I run Spark on Hadoop cluster or local file system ?
recommend use hdfs,it can can save more data, ensure High availability.
Is standalone mode of Spark can fulfill my requirements ?
Standalone mode is the easiest to set up and will provide almost all the same features as the other cluster managers if you are only running Spark.
YARN allows you to dynamically share and centrally configure the same pool of cluster resources between all frameworks that run on YARN.
YARN doesn’t need to run a separate ZooKeeper Failover Controller.
YARN will likely be preinstalled in many Hadoop distributions.such as CDH HADOOP.
so recommend use
YARN doesn’t need to run a separate ZooKeeper Failover Controller.
so recommend yarn
Useful links:
spark yarn doc
spark standalone doc
other wonderful answer
Is my approach is appropriate or what should be best in this case ?
If you data not more than 10 million ,I think can use use local cluster to do it.
local mode avoid many nodes shuffle. shuffles between processes are faster than shuffles between nodes.
else recommend use greater than or equal 3 nodes,That is real Hadoop cluster.
As a spark elementary players,this is my understand. I hope ace corrects me.

Is it possible to run ANY application or program with HADOOP YARN?

I'm studying distributed computing recently and found out Hadoop Yarn is one of them.
So thought if I just establish Hadoop Yarn cluster, then every application will run distributed.
But now someone told me that HADOOP Yarn cannot do anything by itself and need other things like mapreduce, spark, and hbase.
If this is correct, then is that mean only limited tasks can be run with Yarn?
Or can I apply Yarn's distributed computing to all applications I want?
Hadoop is the name which refers to the entire system.
HDFS is the actual storage system. Think of it as S3 or a distributed Linux filesystem.
YARN is a framework for scheduling jobs and allocating resources. It handles these things for you, but you don't interact very much with it.
Spark and MapReduce are managed by Yarn. With these two, you can actually write your code/applications and give work to the cluster.
HBase uses the HDFS storage (with is file based) and provides NoSql storage.
Theoretically you can run more than just Spark and MapReduce on Yarn and you can use something else then Yarn (Kubernetes is in works or supported now). You can even write your own processing tool, queue/resource management system, storage... Hadoop has many pieces which you may use or not, depending on your case. But the majority of Hadoop systems use Yarn and Spark.
If you want to deploy Docker containers for example, just a Kubernetes cluster would be a better choice. If you need batch/real time processing with Spark, use Hadoop.
YARN itself is a resource manager. You will need to write code that can be deployed onto those resources, and then that could do anything, given that the nodes running the tasks are themselves capable of running the job. For example, you cannot distribute a Python library without first installing the dependencies for that script. Mesos is a bit more generalized / accessible than YARN, if you want more flexibility for the same affect.
YARN mostly supports running JAR files, shell scripts (at least, from Oozie) or Docker containers can be deployed to it as well (refer Apache docs)
You may also refer to the Apache Slider or Twill projects for more information.

Spark Standalone vs YARN

What features of YARN make it better than Spark Standalone mode for multi-tenant cluster running only Spark applications? Maybe besides authentication.
There are a lot of answers at Google, pretty much of them sounds wrong to me, so I'm not sure where is the truth.
For example:
DZone, Deep Dive Into Spark Cluster Management
Standalone is good for small Spark clusters, but it is not good for
bigger clusters (there is an overhead of running Spark daemons —
master + slave — in cluster nodes)
But other cluster managers also require running agents on cluster nodes. I.e. YARN's slaves are called node managers. They may consume even more memory than Spark's slaves (Spark default is 1 GB).
This answer
The Spark standalone mode requires each application to run an executor
on every node in the cluster; whereas with YARN, you choose the number
of executors to use
agains Spark Standalone # executor/cores control, that shows how you can specify number of consumed resources at Standalone mode.
Spark Standalone Mode documentation
The standalone cluster mode currently only supports a simple FIFO
scheduler across applications.
Against the fact Standalone mode can use Dynamic Allocation, and you can specify spark.dynamicAllocation.minExecutors & spark.dynamicAllocation.maxExecutors. Also I haven't found a note about Standalone doesn't support FairScheduler.
This answer
YARN directly handles rack and machine locality
How does YARN may know anything about data locality in my job? Suppose, I'm storing file locations at AWS Glue (used by EMR as Hive metastore). Inside Spark job I'm querying some-db.some-table. How YARN may know what executor is better for job assignment?
UPD: found another mention about YARN and data locality https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-data-locality.html. Still doesn't matter in case of S3 for example.

Is it worth deploying Spark on YARN if I have no other cluster software?

I have a Spark cluster running in standalone mode. I am currently executing code on using Jupyter notebook calling pyspark. Is there a benefit to using YARN as the cluster manager, assuming that the machines are not doing anything else?
Would I get better performance using YARN? If so, why?
Many thanks,
John
I'd say YES by considering these points.
Why Run on YARN?
Using YARN as Spark’s cluster manager confers a few benefits over Spark standalone:
You can take advantage of all the features of YARN schedulers for categorizing, isolating, and prioritizing workloads.
Any how Spark standalone mode also requires worker for slave activity which can not run non Spark applications, where as with YARN, this is isolated in containers, so adoption of another compute framework should be a code change instead of infra + code. So the cluster can be shared among different frameworks.
YARN is the only cluster manager for Spark that supports security. With
YARN, Spark can run against Kerberized Hadoop clusters and uses
secure authentication between its processes.
YARN allows you to dynamically share and centrally configure the same
pool of cluster resources between all frameworks that run on YARN.
You can throw your entire cluster at a MapReduce job, then use some
of it on an Impala query and the rest on Spark application, without
any changes in configuration.
I would say 1,2 and 3 are suitable for mentioned scenarios but not point 4 as we assumed no other frameworks are going to use the cluster.
souce

Resources