Spark Streaming and High Availability - apache-spark

I'm building Apache Spark application that acts on multiple streams.
I did read the Performance Tuning section of the documentation:
http://spark.apache.org/docs/latest/streaming-programming-guide.html#performance-tuning
What I didn't get is:
1) Are the streaming receivers located on multiple worker nodes or is the driver machine?
2) What happens if one of the nodes that receives the data fails (power off/restart)

Are the streaming receivers located on multiple worker nodes or is the
driver machine
Receivers are located on worker nodes, which are responsible for the consumption of the source which holds the data.
What happens if one of the nodes that receives the data fails (power
off/restart)
The receiver is located on the worker node. The worker node get's it's tasks from the driver. This driver can either be located on a dedicated master server if you're running in Client Mode, or it can be on one of the workers if you're running in Cluster Mode. In case a node fails which doesn't run the driver, the driver will re-assign the partitions held on the failed node to a different one, which will then be able to re-read the data from the source, and do the additional processing needed to recover from the failure.
This is why a replayable source such as Kafka or AWS Kinesis is needed.

Related

HDFS vs HDFS with YARN, and if I use spark can I put new resource management?

What is HDFS alone without YARN Use Case, and with it? Should I use MapReduce or I can only use the spark? Also if I use spark can I put new resource management for the spark instead of the yarn in the same system? And is this the optimal solution for it, how to decide each one here? based on use case
Sorry, I don't have a specific use case!
Hadoop Distributed File System
HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks. When that quantity and quality of enterprise data is available in HDFS, and YARN enables multiple data access applications to process it, Hadoop users can confidently answer questions that eluded previous data platforms.
HDFS is a scalable, fault-tolerant, distributed storage system that works closely with a wide variety of concurrent data access applications, coordinated by YARN. HDFS will “just work” under a variety of physical and systemic circumstances. By distributing storage and computation across many servers, the combined storage resource can grow linearly with demand while remaining economical at every amount of storage.
Take Away
HDFS is based on a master Slave Architecture with Name Node (NN) being the master and Data Nodes (DN) being the slaves.
Name Node stores only the meta Information about the files, actual data is stored in Data Node.
Both Name Node and Data Node are processes and not any super fancy Hardware.
The Data Node uses the underlying OS file System to save the data.
You need to use HDFS client to interact with HDFS. The hdfs clients always talks to Name Node for meta Info and subsequently talks to Data Nodes to read/write data. No Data IO happens through Name Node.
HDFS clients never send data to Name Node hence Name Node never becomes a bottleneck for any Data IO in the cluster
HDFS client has "short-circuit" feature enabled hence if the client is running on a Node hosting Data Node it can read the file from the Data Node making the complete read/write Local.
To even make it simple imagine HDFSclient is a web client and HDFS as whole is a web service which has predefined task to GET, PUT, COPYFROMLOCAL etc.
How is a 400 MB file Saved on HDFS with hdfs block size of 100 MB.
YARN (Yet Another Resource Negotiator )
"does it ring a bell 'Yet Another Hierarchically Organized Oracle' YAHOO"
YARN is essentially a system for managing distributed applications. It consists of a central Resource manager (RM), which arbitrates all available cluster resources, and a per-node Node Manager (NM), which takes direction from the Resource manager. The Node manager is responsible for managing available resources on a single node.
Take Away
YARN is based on a master Slave Architecture with Resource Manager being the master and Node Manager being the slaves.
Resource Manager keeps the meta info about which jobs are running on which Node Manage and how much memory and CPU is consumed and hence has a holistic view of total CPU and RAM consumption of the whole cluster.
The jobs run on the Node Manager and jobs never get execute on Resource Manager. Hence RM never becomes a bottleneck for any job execution. Both RM and NM are processes and not some fancy hardware
Container is logical abstraction for CPU and RAM.
YARN (Yet Another Resource Negotiator) is scheduling container (CPU and RAM ) over the whole cluster. Hence for end user if he needs CPU and RAM in the cluster it needs to interact with YARN
While Requesting for CPU and RAM you can specify the Host one which you need it.
To interact with YARN you need to use yarn-client which
How HDFS and YARN work in TANDEM
Name Node and Resource Manager process are hosted on two different host. As they hold key meta information.
The Data Node and Node manager processes are co-located on same host.
A file is saved onto HDFS (Data Nodes) and to access a file in Distributed way one can write a YARN Application (MR2, SPARK, Distributed Shell, Slider Application) using YARN client and to read data use HDFSclient.
The Distributed application can fetch file location ( meta info From Name Node ) ask Resource Manager (YARN) to provide containers on the hosts which hold the file blocks.
Do remember the short-circuit optimization provided by HDFS, hence if the Distributed job gets a container on a host which host the file block and tries to read it, the read will be local and not over the network.
The same file If read sequentially would have taken 4 sec (100 MB/sec speed) can be read in 1 second as Distributed process is running parallely on different YARN container( Node Manager) and reading 100 MB/sec *4 in 1 second.
HDFS can be used as a fault tolerant filesystem without YARN, yes. But so can Ceph, MinIO, GlusterFS, etc, each of which can work with Spark.
That addresses storage, but for processing, you can only configure one scheduler per Spark job, but the same code should be able to run in any environment, whether that be YARN, Spark Standalone, Mesos, or Kubernetes, but you ideally would not install these together
So therefore, neither HDFS nor YARN are required for Spark

How many connections will be build btween spark and hdfs when sc.textFile("hdfs://.....") is called

How many connections will be build btween spark and hdfs when sc.textFile("hdfs://.....") is called. The file on hdfs is very large(100G).
Actually, the main idea behind the distributed systems and of course which is designed and implemented in hadoop and spark is to send the process to data. In other words, imagine that there is some data located on hdfs data nodes on our cluster and we have a job which utilizes that data on the same worker. On each machine, you would have a data node and is a spark worker at the same time and may have some other processes like hbase region server too. When an executor is executing one of the scheduled tasks, it retrieves its needed data from the underlying data node. Then for each individual task you would retrieve its data and so you may describe this as one connection to hdfs on its local data node.

spark streaming print() method

According to Sparks documentation on output transformations
print(): Prints the first ten elements of every batch of data in a DStream on the driver node running the streaming application. This
is useful for development and debugging.
according to the cluster overview documentation:
Spark applications run as independent sets of processes on a cluster,
coordinated by the SparkContext object in your main program (called
the driver program).
My question is is the driver == master?
i.e. does print prints at the driver?
my question is, is the driver == master?
No. The driver is the process where you initialize your SparkContext. It will live where you tell it to. For example, if you run your job using local[*] which works locally, the driver is initialized locally on your machine. If you run spark in "client mode" on the default Standalone resource manager, it will start the driver from the location submitting the job. If you use "cluster mode", the driver will be dispatched to one of the Worker nodes in the cluster.
A master is a standalone process which is responsible for managing the cluster. It knows which workers it's managing, and it is his job to give you sufficient resources to run your driver such that you can utilize the cluster.
When you use DStream.print, the data will be send to whichever location is running your driver. If you started your driver from a machine that also happens to be the machine running your master process, then that is the machine which will receive the data and print the output.
Master is a resource manager. It doesn't participate directly in data processing and it not a part of the application.
print is executed on the driver which is the entry point of your application.

Data locality in Spark Streaming

Recently I've been doing performance tests on Spark Streaming. I ran a receiver on one of the 6 slaves and submitted a simple Word Count application to the cluster(actually I know this configuration is not proper in practice,just a simple test).I analyzed the scheduling log and found that nearly 88% of tasks are scheduled to the node where receiver ran on and the locality are always PROCESS_LOCAL and the CPU utilization is very high. Why does not Spark Streaming distribute data across the cluster and make full use of cluster? I've read official guide and it does not explain in detail, especially in Spark Streaming. Will it copy stream data to another node with free CPU and start new task on it when a task is on a node with busy CPU? If so, how can we explain the former case?
When you run the stream receiver just on one of the 6 nodes, all the received data are processed on this node (that is the data locality).
Data are not distributed across other nodes by default. If you need the input stream to be repartitioned (balanced across cluster) before further processing, you can use
inputStream.repartition(<number of partitions>)
This distributes the received batches of data across the specified number of machines in the cluster before further processing.
You can read more about level of parallelism in Spark documentation
https://spark.apache.org/docs/latest/streaming-programming-guide.html#performance-tuning

Spark Driver in Apache spark

I already have a cluster of 3 machines (ubuntu1,ubuntu2,ubuntu3 by VM virtualbox) running Hadoop 1.0.0. I installed spark on each of these machines. ub1 is my master node and the other nodes are working as slave. My question is what exactly a spark driver is? and should we set a IP and port to spark driver by spark.driver.host and where it will be executed and located? (master or slave)
The spark driver is the program that declares the transformations and actions on RDDs of data and submits such requests to the master.
In practical terms, the driver is the program that creates the SparkContext, connecting to a given Spark Master. In the case of a local cluster, like is your case, the master_url=spark://<host>:<port>
Its location is independent of the master/slaves. You could co-located with the master or run it from another node. The only requirement is that it must be in a network addressable from the Spark Workers.
This is how the configuration of your driver looks like:
val conf = new SparkConf()
.setMaster("master_url") // this is where the master is specified
.setAppName("SparkExamplesMinimal")
.set("spark.local.ip","xx.xx.xx.xx") // helps when multiple network interfaces are present. The driver must be in the same network as the master and slaves
.set("spark.driver.host","xx.xx.xx.xx") // same as above. This duality might disappear in a future version
val sc = new spark.SparkContext(conf)
// etc...
To explain a bit more on the different roles:
The driver prepares the context and declares the operations on the data using RDD transformations and actions.
The driver submits the serialized RDD graph to the master. The master creates tasks out of it and submits them to the workers for execution. It coordinates the different job stages.
The workers is where the tasks are actually executed. They should have the resources and network connectivity required to execute the operations requested on the RDDs.
You question is related to spark deploy on yarn, see 1: http://spark.apache.org/docs/latest/running-on-yarn.html "Running Spark on YARN"
Assume you start from a spark-submit --master yarn cmd :
The cmd will request yarn Resource Manager (RM) to start a ApplicationMaster (AM)process on one of your cluster machines (those have yarn node manager installled on it).
Once the AM started, it will call your driver program's main method. So the driver is actually where you define your spark context, your rdd, and your jobs. The driver contains the entry main method which start the spark computation.
The spark context will prepare RPC endpoint for the executor to talk back, and a lot of other things(memory store, disk block manager, jetty server...)
The AM will request RM for containers to run your spark executors, with the driver RPC url (something like spark://CoarseGrainedScheduler#ip:37444) specified on the executor's start cmd.
The Yellow box "Spark context" is the Driver.
A Spark driver is the process that creates and owns an instance of SparkContext. It is your
Spark application that launches the main method in which the instance of SparkContext is
created. It is the cockpit of jobs and tasks execution (using DAGScheduler and Task
Scheduler). It hosts Web UI for the environment
It splits a Spark application into tasks and schedules them to run on executors.
A driver is where the task scheduler lives and spawns tasks across workers.
A driver coordinates workers and overall execution of tasks.
In simple term, Spark driver is a program which contains the main method (main method is the starting point of your program). So, in Java, driver will be the Class which will contain public static void main(String args[]).
In a cluster, you can run this program in either one of the ways:
1) In any remote host machine. Here you'll have to provide the remote host machine details while submitting the driver program on to the remote host. The driver runs in the JVM process created in remote machine and only comes back with final result.
2) Locally from your client machine(Your laptop). Here the driver program runs in JVM process created in your machine locally. From here it sends the task to remote hosts and wait for the result from each tasks.
If you set config "spark.deploy.mode = cluster", then your driver will be launched at your worker hosts(ubuntu2 or ubuntu3).
If spark.deploy.mode=driver, which is the default value, then the driver will run on the machine your submit your application.
And finally, you can see your application on web UI: http://driverhost:driver_ui_port, where the driver_ui_port is default 4040, and you can change the port by set config "spark.ui.port"
Spark driver is node that allows application to create SparkContext, sparkcontext is connection to compute resource.
Now driver can run the box it is submitted or it can run on one of node of cluster when using some resource manager like YARN.
Both options of client/cluster has some tradeoff like
Access to CPU/Memory of once of the node on cluster, some time this is good because cluster node will be big in terms memory.
Driver logs are on cluster node vs local box from where job was submitted.
You should have history server for cluster mode other wise driver side logs are lost.
Some time it is hard to install dependency(i.e some native dependency) executor and running spark application in client mode comes to rescue.
If you want to read more on Spark Job anatomy then http://ashkrit.blogspot.com/2018/09/anatomy-of-apache-spark-job.html post could be useful
Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program).
Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos or YARN), which allocate resources across applications. Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks to the executors to run.
Spark cluster components
There are several useful things to note about this architecture:
Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads. This has the benefit of isolating applications from each other, on both the scheduling side (each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMs). However, it also means that data cannot be shared across different Spark applications (instances of SparkContext) without writing it to an external storage system.
Spark is agnostic to the underlying cluster manager. As long as it can acquire executor processes, and these communicate with each other, it is relatively easy to run it even on a cluster manager that also supports other applications (e.g. Mesos/YARN).
The driver program must listen for and accept incoming connections from its executors throughout its lifetime (e.g., see spark.driver.port in the network config section). As such, the driver program must be network addressable from the worker nodes.
Because the driver schedules tasks on the cluster, it should be run close to the worker nodes, preferably on the same local area network. If you’d like to send requests to the cluster remotely, it’s better to open an RPC to the driver and have it submit operations from nearby than to run a driver far away from the worker nodes.

Resources