Does Spark distributes dataframe across nodes internally? - apache-spark

I am trying to use Spark for processing csv file on cluster. I want to understand if I need to explicitly read the file on each of the worker nodes to do the processing in parallel or will the driver node read the file and distribute the data across cluster for processing internally? (I am working with Spark 2.3.2 and Python)
I know RDD's can be parallelized using SparkContext.parallelize() but what in case of Spark DataFrames?
if __name__=="__main__":
spark=SparkSession.builder.appName('myApp').getOrCreate()
df=spark.read.csv('dataFile.csv',header=True)
df=df.filter("date>'2010-12-01' AND date<='2010-12-02' AND town=='Madrid'")
So if I am running the above code on cluster, will the entire operation be done by driver node or will it distribute df across cluster and each worker perform processing on its data partition?

To be strict, if you run the above code it will not read or process any data. DataFrames are basically an abstraction implemented on top of RDDs. As with RDDs, you have to distinguish transformations and actions. As your code only consists of one filter(...) transformation, noting will happen in terms of readind or processing of data. Spark will only create the DataFrame which is an execution plan. You have to perform an action like count() or write.csv(...) to actually trigger processing of the CSV file.
If you do so, the data will then be read and processed by 1..n worker nodes. It is never read or processed by the driver node. How many or your worker nodes are actually involved depends -- in your code -- on the number of partitions of your source file. Each partition of the source file can be processed in parallel by one worker node. In your example it is probably a single CSV file, so when you call df.rdd.getNumPartitions() after you read the file, it should return 1. Hence, only one worker node will read the data. The same is true if you check the number of partitions after your filter(...) operation.
Here are two ways of how the processing of your single CSV file can be parallelized:
You can manually repartition your source DataFrame by calling df.repartition(n) with n the number of partitions you want to have. But -- and this is a significant but -- this means that all data is potentially send over the network (aka shuffle)!
You perform aggregations or joins on the DataFrame. These operations have to trigger a shuffle. Spark then uses the number of partitions specified in spark.sql.shuffle.partitions(default: 200) to partition the resulting DataFrame.

Related

From where does RDD loads data in Spark?

From where does Spark load data for RDD? Is data already present in Executor nodes or spark shuffles data from Driver node first?
From the name itself - RDD (Resilient Distributed Dataset) - it indicates that the data resides across executors when ever you create it.
Lets say when you run parallelize() for 100 entries, it will distribute that 100 entries across your executors so that each executor has its own chunk of data to do distributed processing.
Shuffling happens - when you do any operations like repartition() or coalesce().
Also if you run functions like collect() spark will try to pull all data from executor and bring it to driver(And you loose the ability of distributed processing)
This reference has more details around internals of spark - Apache Spark architecture

How does mllib code run on spark?

I am new to distributed computing, and I'm trying to run Kmeans on EC2 using Spark's mllib kmeans. As I was reading through the tutorial I found the following code snippet on
http://spark.apache.org/docs/latest/mllib-clustering.html#k-means
I am having trouble understanding how this code runs inside the cluster. Specifically, I'm having trouble understanding the following:
After submitting the code to master node, how does spark know how to parallelize the job? Because there seem to be no part of the code that deals with this.
Is the code copied to all nodes and executed on each node? Does the master node do computation?
How do node communitate the partial result of each iteration? Is this dealt inside the kmeans.train code, or is the spark core takes care of it automatically?
Spark divides data to many partitions. For example, if you read a file from HDFS, then partitions should be equal to partitioning of data in HDFS. You can manually specify number of partitions by doing repartition(numberOfPartitions). Each partition can be processed on separate node, thread, etc. Sometimes data are partitioned by i.e. HashPartitioner, which looks on hash of the data.
Number of partitions and size of partitions generally tells you if data is distributed/parallelized correctly. Creating partitions of data is hidden in RDD.getPartitions methods.
Resource scheduling depends on cluster manager. We can post very long post about them ;) I think that in this question, the partitioning is the most important. If not, please inform me, I will edit answer.
Spark serializes clusures, that are given as arguments to transformations and actions. Spark creates DAG, which is sent to all executors and executors execute this DAG on the data - it launches closures on each partition.
Currently after each iteration, data is returned to the driver and then next job is scheduled. In Drizzle project, AMPLab/RISELab is creating possibility to create multiple jobs on one time, so data won't be sent to the driver. It will create DAG one time and schedules i.e. job with 10 iterations. Shuffle between them will be limited / will not exists at all. Currently DAG is created in each iteration and job in scheduled to executors
There is very helpful presentation about resource scheduling in Spark and Spark Drizzle.

How to enable dynamic repartitioning in Spark Streaming for uneven data load

I have a use case where input stream data is skewed, volume of data can be from 0 events to 50,000 events per batch. Each data entry is independent of others. Therefore to avoid shuffle caused by repartitioning I want to use some kind of dynamic repartitioning based on the batch size. I cannot get size of the batch using dstream count.
My use case is very simple I have unknown volume of data coming into the spark stereaming process, that I want to process in parallel and save to a text file. I want to run this data in parallel therefore I am using repartition which has introduced shuffle. I want to avoid shuffle due to repartition.
I want to what is the recommended approach to solve data skewed application in spark streaming.

Data distribution in Apache Spark

I'm new to spark and have general question.As far as I know the whole file must be available on all worker nodes to be processed.If so, how do they know which partition should read?Driver controls the partitions but how does driver tell them to read what partition?
Each RDD is divided into multiple partition. To compute each partition, Spark will generate a task and assign it to a worker node. When the driver sends a task to the worker, it also specifies the PartitionID of that task.
The worker then executes the task by chaining the RDD's iterator all the way back to the InputRDD and pass along the PartitionID. The InputRDD determines which part of the input corresponding to the specified partition id and return the data.
rddIter.next -> parentRDDIter.next -> grandParentRDDIter.next -> ... -> InputRDDIter.next
Spark tries to read data into an RDD from the nodes that are close to it. Since Spark usually accesses distributed partitioned data, to optimize transformation operations it creates partitions to hold the data chunks.
https://github.com/jaceklaskowski/mastering-apache-spark-book

Location of HadoopPartition

I have a dataset in a csv file that occupies two blocks in HDFS and replicated on two nodes, A and B. Each node has a copy of the dataset.
When Spark starts processing the data, I have seen two ways how Spark loads the dataset as input. It either loads the entire dataset into memory on one node and perform most of the tasks on it or loads the dataset into two nodes and spill the tasks on both nodes (based on what I observed on the history server). For both cases, there is sufficient capacity to keep the whole datasets in memory.
I repeated the same experiment multiple times and Spark seemed to alternate between these two ways. Supposedly Spark inherits the input split location as in a MapReduce job. From my understanding, MapReduce should be able to take advantage of two nodes. I don't understand why Spark or MapReduce will alternate between the two cases.
When only one node is used for processing, the performance is worse.
When your loading the data in Spark you can specify the minimum number of splits and this will force Spark to load the data on multiple machines (with the textFile api you would add minPartitions=2 to your call.

Resources