How Spark internally works when reading HDFS files - apache-spark

Say I have a file of 256 KB is stored on HDFS file system of one node (as two blocks of 128 KB each). This file internally contains two blocks
of 128 KB each. Assume I have two nodes cluster of each 1 core only. My understanding is that spark during transformation will read complete file
on one node in memory and then transfer one file block memory data to other node so that both nodes/cores can parallely execute it ? Is that correct ?
What if both nodes had two core each instead of one core ? In that case two cores on single node could do the computation ? Is that right ?
val text = sc.textFile("mytextfile.txt")
val counts = text.flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_)
counts.collect

You question is a little hypothetical as it is unlikely you would have an Hadoop Cluster with HDFS existing with only one Data Node and 2 Worker Nodes - one being both Worker and Data Node. That is to say, the whole idea of Spark (and MR) with HDFS is to bring the processing to the data. The Worker Nodes are in fact the Data Nodes in the standard Hadoop set up. This is the original intent.
Some variations to answer your question:
Assuming the case as per above described, each Worker Node would process one partition and subsequent transformations on the newer generated RDDs until finished. You may of course repartition the data and what happens depends on the number of partitions and number of Executors per Worker Node.
In a nutshell: if you have N blocks / partitions initially and less than N Executors allocated - E - on a Hadoop Cluster with HDFS, then you will get some transfer of blocks (not a shuffle as is talked about elsewhere) to the Workers assigned, from Workers where no Executor was allocated to you Spark App, otherwise the block is assigned to be processed to that Data / Worker Node, obviously. Each block / partition is processed in some way, shuffled and the next set of Partitions or Partition read in and processed, depending on speed of processing for your transformation(s).
In the case of AWS S3 and Mircosoft's and gooogle's equivalent Cloud Storage which leave aside the principle of data locality as in the above case - i.e. compute power is divorced from storage, with the assumption that the network is not the bottleneck - which was exactly the Hadoop classic reason to bring the processing to the data, then it works similarly to the aforementioned, i.e. transfer of S3 data to Workers.
All of this assume an Action has been invoked.
I leave aside the principles of Rack Awareness, etc. as it becomes all quite complicated, but the Resource Managers understand these things and decide accordingly.

In the first case, Spark will usually load 1 partition on the first node and then if it cannot find an empty core, it will load the 2nd partition on the 2nd node after waiting for spark/locality.wait (default 3 seconds).
In the 2nd case both partitions will be loaded on the same node unless it does not have both cores free.
Many circumstances can cause this to change if you play with the default configurations.

Related

Is the smallest unit in Apache Spark is the block?

If it is block then, what is the desired block size? What is recommended? Is there some standard size? If not how do I know what block size should I keep?
Lets say sc=streaming context.
Is sc.awaitTermination() used in production.
Is await for termination is the only way?
Lets say a block got corrupted. Since there is a fault tolerance? Is it so that the block will get recovered by taking it from another replicated block from some other executor.
Can different executors have different memory size?
If so, say there are 3 executors
ex1 = 10gb
ex2 = 10gb
ex3 = 5gb
Assume replication factor is configured of 2.
How will replication work in this case. If an rdd of size lets say 8gb needs to be replicated. Then wont it fail? ex1 having say 8gb rdd size cannot be replicated into executor ex3 due to low memory? Then how fault tolerance is achieved? Does spark knows where to replicate what?Is it according to the size that it checks if it can be replicated into that particular node, if it can then replicate else dont? Then if node ex1 itself fails, then there is no fault tolerance and everything is lost?How is it handle in this scenario ?
Block size is the concern of storage system like HDFS, GFS, S3, Azure Blob Storage etc. Spark is a processing engine and it just accommodated to the bloc size of the storage system to create partitions. It would create 1 partition per block so that large files are can be processed in parallel. All the storage systems have a default block size so you needn't worry about setting it though you can certainly override it.
Similarly data duplication is also a concern of storage layer. It copies a block of data into say 3 (replication factor) blocks and in case of failure it can recover data from other blocks.
Spark fits into this equation by being the processing engine operating on this distributed file system. It is responsible for distributing the compute workload where data is located, it can recover from failure on a given node which is similar to data recovery but a different beast.

SparkDataframe.load(),when I execute a load command where actually my data get stored?

If I am loading one table from cassandra using spark dataframe.load().Where will my data gets loaded.Is it in spark memory.Or in datanode blocks ,if I am using yarn resource manager.
It will try to store in memory per number of partitions on the Worker Nodes / which in this context is a slightly better term than Data Nodes.
It will spill to disk if not enough memory on the Worker Nodes.
Per number of Cores / Executors, processing will occur. E.g. if you have, say, 20 Executors with 1 Core each, your concurrency of processing is 20 and spilling will occur via eviction. If you run out of disk, an error will result.
Worker Nodes is a better term here compared to Data Nodes, unless you have HDFS and processing locally, then Worker Node is equal to Data Node. Although you could argue what's in a name?
Of course, an Action will need to have been initiated.
And repartition and join or union latterly in the data pipeline affect things, but that goes without saying.

Will spark load data into in-memory if data is 10 gb and RAM is 1gb

If i have cluster of 5 nodes, each node having 1gb ram, now if my data file is 10gb distributed in all 5 nodes, let say 2gb in each node, now if i trigger
val rdd = sc.textFile("filepath")
rdd.collect
will spark load data into the ram and how spark will deal with this scenario
will it straight away deny or will it process it.
Lets understand the question first #intellect_dp you are asking, you have a cluster of 5 nodes (here the term "node" I am assuming machine which generally includes hard disk,RAM, 4 core cpu etc.), now each node having 1 GB of RAM and you have 10 GB of data file which is distributed in a manner, that 2GB of data is residing in the hard disk of each node. Here lets assume that you are using HDFS and now your block size at each node is 2GB.
now lets break this :
each block size = 2GB
RAM size of each node = 1GB
Due to lazy evaluation in spark, only when "Action API" get triggered, then only it will load your data into the RAM and execute it further.
here you are saying that you are using "collect" as an action api. Now the problem here is that RAM size is less than your block size, and if you process it with all default configuration (1 block = 1 partition ) of spark and considering that no further node will going to add up, then in that case it will give you out of memory exception.
now the question - is there any way spark can handle this kind of large data with the given kind of hardware provisioning?
Ans - yes, first you need to set default minimum partition :
val rdd = sc.textFile("filepath",n)
here n will be my default minimum partition of block, now as we have only 1gb of RAM, so we need to keep it less than 1gb, so let say we take n = 4,
now as your block size is 2gb and minimum partition of block is 4 :
each partition size will be = 2GB/4 = 500mb;
now spark will process this 500mb first and will convert it into RDD, when next chunk of 500mb will come, the first rdd will get spill to hard disk (given that you have set the storage level "MEMORY_AND_DISK_ONLY").
In this way it will process your whole 10 GB of data file with the given cluster hardware configuration.
Now I personally will not recommend the given hardware provisioning for such case,
as it will definitely process the data, but there are few disadvantages :
firstly it will involve multiple I/O operation making whole process very slow.
secondly if any lag occurs in reading or writing to the hard disk, your whole job will get discarded, you will go frustrated with such hardware configuration. In addition to it you will never be sure that will spark process your data and will be able to give result when data will increase.
So try to keep very less I/O operation, and
Utilize in memory computation power of spark with an adition of few more resources for faster performance.
When you use collect all the data send is collected as array only in driver node.
From this point distribution spark and other nodes does't play part. You can think of it as a pure java application on a single machine.
You can determine driver's memory with spark.driver.memory and ask for 10G.
From this moment if you will not have enough memory for the array you will probably get OutOfMemory exception.
In the otherhand if we do so, Performance will be impacted, we will not get the speed we want.
Also Spark store only results in RDD, so I can say result would not be complete data, any worst case if we are doing select * from tablename, it will give data in chunks , what it can affroad....

Spark pulling data into RDD or dataframe or dataset

I'm trying to put into simple terms when spark pulls data through the driver, and then when spark doesn't need to pull data through the driver.
I have 3 questions -
Let's day you have a 20 TB flat file file stored in HDFS and from a driver program you pull it into a data frame or an RDD, using one of the respective libraries' out of the box functions (sc.textfile(path) or sc.textfile(path).toDF, etc). Will it cause the driver program to have OOM if the driver is run with only 32 gb memory? Or at least have swaps on the driver Jim? Or will spark and hadoop be smart enough to distribute the data from HDFS into a spark executor to make a dataframe/RDD without going through the driver?
The exact same question as 1 except from an external RDBMS?
The exact same question as 1 except from a specific nodes file system (just Unix file system, a 20 TB file but not HDFS)?
Regarding 1
Spark operates with distributed data structure like RDD and Dataset (and Dataframe before 2.0). Here are the facts that you should know about this data structures to get the answer to your question:
All the transformation operations like (map, filter, etc.) are lazy.
This means that no reading will be performed unless you require a
concrete result of your operations (like reduce, fold or save the
result to some file).
When processing a file on HDFS Spark operates
with file partitions. Partition is a minimal logical batch of data
the can be processed. Normally one partition equals to one HDFS
block and the total number of partitions can never be less then
number of blocks in a file. The common (and default one) HDFS block size is 128Mb
All actual computations (including reading from the HDFS) in RDD and
Dataset are performed inside of executors and never on driver. Driver
creates a DAG and logical plan of execution and assigns tasks to
executors for further processing.
Each executor runs the previously
assigned task against a particular partition of data. So normally if you allocate only one core to your executor it would process no more than 128Mb (default HDFS block size) of data at the same time.
So basically when you invoke sc.textFile no actual reading happens. All mentioned facts explain why OOM doesn't occur while processing even 20 Tb of data.
There are some special cases like i.e. join operations. But even in this case all executors flush their intermediate results to local disk for further processing.
Regarding 2
In case of JDBC you can decide how many partitions will you have for your table. And choose the appropriate partition key in your table that will split the data into partitions properly. It's up to you how many data will be loaded into a memory at the same time.
Regarding 3
The block size of the local file is controlled by the fs.local.block.size property (I guess 32Mb by default). So it is basically the same as 1 (HDFS file) except the fact that you will read all data from one machine and one physical disk drive (which is extremely inefficient in case of 20TB file).

how does hashpartitioner work in spark?

Say I have lots data in a couple of s3 files, about 5 GB each, which I read in using sc.textFile
I need to join the data from the two files, therefore, I opt to use the HashPartitioner technique, and I set a partition count of 20. The submitted job to 8 worker nodes fails without any meaningful messages. Now I am thinking maybe I need to pick a proper number of partitions.
Obviously, the idea for spark to partition up all the data based on a chosen key. In order to load them up into 20 partitions, I imagine spark will have to read thru every line of data, compute its hash, and load into the memory of the matching partition, which resides in one of the 8 worker nodes. If there is enough collective memory in the worker nodes, I assume this goes smoothly. At the end of the read, all the data is in the proper partition, in the right node's memory. Am I right so far?
However, if the total memory can not fit all the data, I imagine Spark will work on certain partitions first. And after processing these first partitions, it flushes the original partitions and reads from the source files again, loading remaining data into new partitions. This would mean reading the same file as many time as necessary to process all partitions using available memory. Is this also correct?
Should I should calculate the number of partitions so that at least one full partition would fit into a single node's memory. Are there other guidelines to follow?

Resources