Zeppelin isolated per user mode - apache-spark

My question is about the isolated mode per user in zeppelin.
If zeppelin doesn't yet support the deploy-mode remote for spark driver( more info in ZEPPELIN-2040)
how you deal with multiple users so multiple spark driver with each one take 4GB of memory ?
I'am looking for a horizontal scalable solution without increasing the spark master instance memory and use the slaves memory for driver.

Related

What is the benefit of using more then 1 driver core in spark yarn cluster mode?

what is the difference in using 1 vs 2 driver core in spark yarn cluster mode? If i use 2 driver cores in yarn cluster mode, then spark driver will be relaunched incase of failure? If so, how many retry if would do before failing?
Appreciate if anyone can share any article on this?
When you launch application in YARN cluster mode, it will create container for your driver.
This container - depending on your application - might need multiple cores and multiple gigs of memory. It all depends on how many sessions will connect to your Spark application at the same time and on complexity of your query.
If it looks like your query compiles slowly or your Spark Web UI/app hangs, it might be worth it to increase core count.
From the point of YARN, there is still only one driver container.

Does Spark streaming needs HDFS with Kafka

I have to design a setup to read incoming data from twitter (streaming). I have decided to use Apache Kafka with Spark streaming for real time processing. It is required to show analytics in a dashboard.
Now, being a newbie is this domain, My assumed data rate will be 10 Mb/sec maximum. I have decided to use 1 machine for Kafka of 12 cores and 16 GB memory. *Zookeeper will also be on same machine. Now, I am confused about Spark, it will have to perform streaming job analysis only. Later, analyzed data output is pushed to DB and dashboard.
Confused list:
Should I run Spark on Hadoop cluster or local file system ?
Is standalone mode of Spark can fulfill my requirements ?
Is my approach is appropriate or what should be best in this case ?
Try answer:
Should I run Spark on Hadoop cluster or local file system ?
recommend use hdfs,it can can save more data, ensure High availability.
Is standalone mode of Spark can fulfill my requirements ?
Standalone mode is the easiest to set up and will provide almost all the same features as the other cluster managers if you are only running Spark.
YARN allows you to dynamically share and centrally configure the same pool of cluster resources between all frameworks that run on YARN.
YARN doesn’t need to run a separate ZooKeeper Failover Controller.
YARN will likely be preinstalled in many Hadoop distributions.such as CDH HADOOP.
so recommend use
YARN doesn’t need to run a separate ZooKeeper Failover Controller.
so recommend yarn
Useful links:
spark yarn doc
spark standalone doc
other wonderful answer
Is my approach is appropriate or what should be best in this case ?
If you data not more than 10 million ,I think can use use local cluster to do it.
local mode avoid many nodes shuffle. shuffles between processes are faster than shuffles between nodes.
else recommend use greater than or equal 3 nodes,That is real Hadoop cluster.
As a spark elementary players,this is my understand. I hope ace corrects me.

In a Spark cluster, is there one driver process per machine or one driver process per cluster?

I am reading Spark: The Definitive Guide and am learning a great deal.
However, one thing I am confused about while reading is how many driver processes there are per Spark job. When the text first introduces driver and executor processes, it implies that there is a driver per machine, but in the discussion about broadcast variables, it sounds as though there is one driver per cluster.
This is because the text talks about the driver process sending the broadcast variable to every node in the cluster so that it can be cached there. That makes it sound as though there is only one driver process in the whole cluster that is responsible for this.
Which one is it: one driver process per cluster, or one per machine? Or can it be both? I think I am missing something here.
In Spark architecture, there will be only one driver for your spark application.
The spark driver, as part of the spark application is responsible for instantiating a spark session. The spark driver has multiple responsibilities
It communicates with the cluster manager (CM).
Requests resources from the CM for spark's executor JVMs.
Transforms all spark operations into DAG computations, schedules them and distributes their execution as tasks across all spark executors.
It's interaction with the CM is merely to get Spark executor resources.
So, the workflow of running spark applications on a cluster can be seen as:
The user submits an application using spark-submit.
spark-submit launches the driver program and invokes the main method specified by the user.
The driver program contacts the cluster manager to ask for resources to start executor.
The cluster manager launches executors on behalf of the driver program.
The driver process runs through the user application. Based on the RDD or dataset operations in the program, the driver sends work to executors in the form of tasks.
The tasks are run on executor process to compute and save result.

Spark client memory configuration

I'm trying to run multiple spark clients on Airflow(ETL scheduler).
I'm running in cluster mode on YARN, therefore ApplicationMaster Executor and Driver are all running on executor in Yarn context.
However, my spark client which sample the process and monitor the state is running in airflow worker.
The problem is that the Spark client take lot's of memory ~500 MB per job. It may sound as not much in terms of executors or drivers but for the role of spark client it sounds crazy.
My question is, how can I configure/manipulate spark client memory/cpu requirements can I limit it's intervals ? can I limit it's memory with flags?
So in spark code it make a distinction if it's running in standalone mode or cluster mode. For standalone it set a default of -Xmx 1G and in cluster mode it doesn't have default but it trying to read java options from environment variable called SPARK_SUBMIT_OPTS.
So if you wanna set any java opts for the client java process only use SPARK_SUBMIT_OPTS

Spark Standalone vs YARN

What features of YARN make it better than Spark Standalone mode for multi-tenant cluster running only Spark applications? Maybe besides authentication.
There are a lot of answers at Google, pretty much of them sounds wrong to me, so I'm not sure where is the truth.
For example:
DZone, Deep Dive Into Spark Cluster Management
Standalone is good for small Spark clusters, but it is not good for
bigger clusters (there is an overhead of running Spark daemons —
master + slave — in cluster nodes)
But other cluster managers also require running agents on cluster nodes. I.e. YARN's slaves are called node managers. They may consume even more memory than Spark's slaves (Spark default is 1 GB).
This answer
The Spark standalone mode requires each application to run an executor
on every node in the cluster; whereas with YARN, you choose the number
of executors to use
agains Spark Standalone # executor/cores control, that shows how you can specify number of consumed resources at Standalone mode.
Spark Standalone Mode documentation
The standalone cluster mode currently only supports a simple FIFO
scheduler across applications.
Against the fact Standalone mode can use Dynamic Allocation, and you can specify spark.dynamicAllocation.minExecutors & spark.dynamicAllocation.maxExecutors. Also I haven't found a note about Standalone doesn't support FairScheduler.
This answer
YARN directly handles rack and machine locality
How does YARN may know anything about data locality in my job? Suppose, I'm storing file locations at AWS Glue (used by EMR as Hive metastore). Inside Spark job I'm querying some-db.some-table. How YARN may know what executor is better for job assignment?
UPD: found another mention about YARN and data locality https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-data-locality.html. Still doesn't matter in case of S3 for example.

Resources