Data distribution in Apache Spark - apache-spark

I'm new to spark and have general question.As far as I know the whole file must be available on all worker nodes to be processed.If so, how do they know which partition should read?Driver controls the partitions but how does driver tell them to read what partition?

Each RDD is divided into multiple partition. To compute each partition, Spark will generate a task and assign it to a worker node. When the driver sends a task to the worker, it also specifies the PartitionID of that task.
The worker then executes the task by chaining the RDD's iterator all the way back to the InputRDD and pass along the PartitionID. The InputRDD determines which part of the input corresponding to the specified partition id and return the data.
rddIter.next -> parentRDDIter.next -> grandParentRDDIter.next -> ... -> InputRDDIter.next

Spark tries to read data into an RDD from the nodes that are close to it. Since Spark usually accesses distributed partitioned data, to optimize transformation operations it creates partitions to hold the data chunks.
https://github.com/jaceklaskowski/mastering-apache-spark-book

Related

Does every operation you write in a spark job performed in a spark cluster?

lets say for the operation
val a = 12 + 4, or something simple.
Will it still be distributed by the driver into cluster ?
lets say I have a map , say Map[String,String] (very large say 1000000 key value pairs)(hypothetical assumption)
Now when I do get*("something"),
Will this be distributed across the cluster to get that value?
If not , then what is the use of spark if it doesn't computes simple task together?
How is the number of tasks determined by spark also number of job determined ?
If there is a stream and some action is perform for each batch. Is it so that for each batch new job is created?
Answers:
No, This is still a driver side compute.
If you create the map in a Driver program then it remains on driver. If you try access a key then it would simply lookup on the map you created on driver memory and return you back the value.
If you create a RDD out of the collection (Reference) , and if you run any transformation then it will be run on Spark cluster.
Number of partitions usually corresponds to the number of tasks. You can explicitly tell how many partitions you want when you parallelize the collection ( like the map in your case)
Yes, there will be a job created for action performed on each batch.

Does Spark distributes dataframe across nodes internally?

I am trying to use Spark for processing csv file on cluster. I want to understand if I need to explicitly read the file on each of the worker nodes to do the processing in parallel or will the driver node read the file and distribute the data across cluster for processing internally? (I am working with Spark 2.3.2 and Python)
I know RDD's can be parallelized using SparkContext.parallelize() but what in case of Spark DataFrames?
if __name__=="__main__":
spark=SparkSession.builder.appName('myApp').getOrCreate()
df=spark.read.csv('dataFile.csv',header=True)
df=df.filter("date>'2010-12-01' AND date<='2010-12-02' AND town=='Madrid'")
So if I am running the above code on cluster, will the entire operation be done by driver node or will it distribute df across cluster and each worker perform processing on its data partition?
To be strict, if you run the above code it will not read or process any data. DataFrames are basically an abstraction implemented on top of RDDs. As with RDDs, you have to distinguish transformations and actions. As your code only consists of one filter(...) transformation, noting will happen in terms of readind or processing of data. Spark will only create the DataFrame which is an execution plan. You have to perform an action like count() or write.csv(...) to actually trigger processing of the CSV file.
If you do so, the data will then be read and processed by 1..n worker nodes. It is never read or processed by the driver node. How many or your worker nodes are actually involved depends -- in your code -- on the number of partitions of your source file. Each partition of the source file can be processed in parallel by one worker node. In your example it is probably a single CSV file, so when you call df.rdd.getNumPartitions() after you read the file, it should return 1. Hence, only one worker node will read the data. The same is true if you check the number of partitions after your filter(...) operation.
Here are two ways of how the processing of your single CSV file can be parallelized:
You can manually repartition your source DataFrame by calling df.repartition(n) with n the number of partitions you want to have. But -- and this is a significant but -- this means that all data is potentially send over the network (aka shuffle)!
You perform aggregations or joins on the DataFrame. These operations have to trigger a shuffle. Spark then uses the number of partitions specified in spark.sql.shuffle.partitions(default: 200) to partition the resulting DataFrame.

In spark streaming can I create RDD on worker

I want to know how can I create RDD on worker say containing a Map. This Map/RDD will be small and I want this RDD to completely reside on one machine/executor (I guess repartition(1) can achieve this). Further I want to be able to cache this Map/RDD on local executor and use it in tasks running on this executor for lookup.
How can I do this?
No, you cannot create RDD in worker node. Only driver can create RDD.
The broadcast variable seems be solution in your situation. It will send data to all workers, however if your map is small, then it wouldn't be an issue.
You cannot control on which partition your RDD will be placed, so you cannot just do repartition(1) - you don't know if this RDD will be placed on the same node ;) Broadcast variable will be on every node, so lookup will be very fast
You can create RDD on your driver program using sc.parallelize(data) . For storing Map, it can be split into 2 parts as key, value and then can be stored in RDD/Dataframe as two separate columns.

Repartition in repartitionAndSortWithinPartitions happens on driver or on worker

I am trying to understand the concept of repartitionAndSortWithinPartitions in Spark Streaming whether the repartition happens on driver or on worker. If suppose it happens on driver then does worker wait for all the data to come before sorting happens.
Like any other transformation it is handled by executors. Data is not passed via the driver. In other words this standard shuffle mechanism and there is nothing streaming specific here.
Destination of each record is defined by:
Its key.
Partitioner used for a given shuffle.
Number of partitions.
and data is passed directly between executor nodes.
From the comments it looks like you're more interested in a Spark Streaming architecture. If that's the case you should take a look at Diving into Apache Spark Streaming’s Execution Model. To give you some overview there can exist two different types of streams:
Receiver-based with a receiver node per stream.
Direct (without receiver) where only metadata is assigned to executors but data is fetched directly.

How to write a large file to disk on each worker

Newbie to spark. I have a large data file that needs to be writen to disk on each worker before my application runs. RDD.mapPartitions seems like what I should be using, but I'm not sure if each worker contains only a single partition and how to create an RDD guaranteed to have a partition for each worker.
A few things:
RDD.saveAsTextFile will write to disk from each worker, all you need to do is make sure that you have the right number of partitions (you probebly want to set the number of partitions to the number of cores you have available for workers in the cluster) for example:
val files = sc.textFile("file:...")
val prt = files.repartition(5)
prt.saveAsTextFile("file:...")
Also you should note that RDD.mapPartition executes the map operation for all the element in the partition, map is a transformation which means it is expected transform the data and is lazy.

Resources