When I run QueryExecution.prepareExecutedPlan(sparkSession, logicalPlan) in Spark I get a physical plan with broadcast exchange nodes. Then I run node.execute().getNumPartitions on every node of the physical plan (except for the Broadcastexchange), the result is always one partition. Why does Spark perform broadcast when there is only one partition?
Thanks!
Related
Partition is a logical division of data stored on a node.
So have couple of doubts:
1)when is partition created since data becomes accessible only when spark read file
2)Is one of the task responsible for partition creation as well
3)And is there a physical transfer of partition from driver to executor memory, how is this being performed since they are logical division
4)If there a physical transfer of partition ,is the result of task computation over partition logical as well.
And how is logical result consolidated at driver.
Could anyone help with these queries
I'm trying to understand the functioning of Spark, I know the Cluster manager allocates the resources (Workers) for the driver program.
I want to know, how (which transformations) the cluster manager sends the tasks to worker nodes and how worker nodes access the data (Assume my data is in S3)?
Does worker nodes read only a part of data and apply all transformations on it and return the actions to the driver program? or The worker nodes reads the entire file but only apply specific transformation and return back the result to the driver program?
Follow-up questions:
How and who decides how much amount of data needs to be sent to worker nodes? as we have established a point that partial data is present on each worker node. Eg: I have two worker nodes with 4 cores each and I have one 1TB csv file to read and perform few transformations and an action. assume the csv is on S3 and on the master node's local storage.
It's going to be a long answer, but I will try to simplify it at my best:
Typically a Spark cluster contains multiple nodes, each node would have multiple CPUs, a bunch of memory, and storage. Each node would hold some chunks of data, therefore sometimes they're also referred to data nodes as well.
When Spark application(s) are started, they tend to create multiple workers or executors. Those workers/executors took resources (CPU, RAM) from the cluster's nodes above. In other words, the nodes in a Spark cluster play both roles: data storage and computation.
But as you might have guessed, data in a node (sometimes) is incomplete, therefore, workers would have to "pull" data across the network to do a partial computation. Then the results are sent back to the driver. The driver would just do the "collection work", and combine them all to get the final results.
Edit #1:
How and who decides how much amount of data needs to be sent to worker nodes
Each task would decide which data is "local" and which is not. This article explains pretty well how data locality works
I have two worker nodes with 4 cores each and I have one 1TB csv file to read and perform a few transformations and an action
This situation is different with the above question, where you have only one file and most likely your worker would be exactly the same as your data node. The executor(s) those are sitting on that worker, however, would read the file piece by piece (by tasks), in parallel, in order to increase parallelism.
I need clarity about how spark works under the hood when it comes to fetch data external databases.
What I understood from spark documentation is that, if I do not mention attributes like "numPartitons","lowerBound" and "upperBound" then read via jdbc is not parallel.In that case what happens?
Is data read by 1 particular executor which fetches all the data ? How is parallelism achieved then?
Does that executor share the data later to other executors?But I believe executors cannot share data like this.
Please let me know if any one of you have explored this.
Edit to my question -
Hi Amit, thanks for your response, but that is not what I am looking for. Let me elaborate:-
Refer this - https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
Refer below code snippet –
val MultiJoin_vw = db.getDataFromGreenplum_Parallel(ss, MultiJoin, bs,5,"bu_id",10,9000)
println(MultiJoin_vw.explain(true))
println("Number of executors")
ss.sparkContext.statusTracker.getExecutorInfos.foreach(x => println(x.host(),x.numRunningTasks()))
println("Number of partitons:" ,MultiJoin_vw.rdd.getNumPartitions)
println("Number of records in each partiton:")
MultiJoin_vw.groupBy(spark_partition_id).count().show(10)
Output :
Fetch Starts
== Physical Plan ==
*(1) Scan JDBCRelation((select * from mstrdata_rdl.cmmt_sku_bu_vw)as mytab) [numPartitions=5] [sku_nbr#0,bu_id#1,modfd_dts#2] PushedFilters: [], ReadSchema: struct<sku_nbr:string,bu_id:int,modfd_dts:timestamp>
()
Number of executors
(ddlhdcdev18,0)
(ddlhdcdev41,0)
(Number of partitons:,5)
Number of records in each partition:
+--------------------+------+
|SPARK_PARTITION_ID()| count|
+--------------------+------+
| 1|212267|
| 3| 56714|
| 4|124824|
| 2|232193|
| 0|627712|
+--------------------+------+
Here I read the table using the custom function db.getDataFromGreenplum_Parallel(ss, MultiJoin, bs,5,"bu_id",10,9000) which specifies to create 5 partition based on field bu_id whose lower value is 10 and upper value is 9000.
See how spark read data in 5 partitions with 5 parallel connections (as mentioned by spark doc). Now lets read this table without mentioning any of the parameter above –
I simply get the data using another function - val MultiJoin_vw = db.getDataFromGreenplum(ss, MultiJoin, bs)
Here I am only passing the spark session(ss), query for getting the data(MultiJoin) and another parameter for exception handling(bs).
The o/p is like below –
Fetch Starts
== Physical Plan ==
*(1) Scan JDBCRelation((select * from mstrdata_rdl.cmmt_sku_bu_vw)as mytab) [numPartitions=1] [sku_nbr#0,bu_id#1,modfd_dts#2] PushedFilters: [], ReadSchema: struct<sku_nbr:string,bu_id:int,modfd_dts:timestamp>
()
Number of executors
(ddlhdcdev31,0)
(ddlhdcdev27,0)
(Number of partitons:1)
Number of records in each partiton:
+--------------------+-------+
|SPARK_PARTITION_ID()| count|
+--------------------+-------+
| 0|1253710|
See how data is read into one partition, means spawning only 1 connection.
Question remains this partition will be at one machine only and 1 task will be assigned to this.
So there is no parallelism here.How does the data gets distributed to other executors then?
By the way this is the spark-submit command I used for both scenarios –
spark2-submit --master yarn --deploy-mode cluster --driver-memory 1g --num-executors 1 --executor-cores 1 --executor-memory 1g --class jobs.memConnTest $home_directory/target/mem_con_test_v1-jar-with-dependencies.jar
Re:"to fetch data external databases"
In your spark application this is generally the part of the code that will be executed on executors. Number of executors can be controlled by passing a spark configuration "num-executors". If you have worked with Spark and RDD/Dataframe, then one of the example from where you would connect to the database is the transformation functions such as map,flatmap,filter etc. These functions when getting executed on executors ( configured by num-executors) will establish the database connection and use it.
One important thing to note here is that, if you work with too many executors then your database server might getting slower and slower and eventually non-responsive. If you give too less of executors then it might cause your spark job taking more time to finish. Hence you have to find an optimum number based on your DB server capacity.
Re:"How is parallelism achieved then? Does that executor share the data later to other executors?"
Parallelism as mentioned above is achieved by configuring number of executors. Configuring number of executors is just one way of increasing parallelism and it is not the only way. Consider a case where you have a smaller size data resulting in fewer partitions, then you will see lesser parallelism. So you need to have good number of partitions (those corresponds to tasks) and then appropriate(definite number depends on the use case) number of executors to execute those tasks in parallel. As long as you can process each record individually it scales, however as soon as you have an action that would cause a shuffle you would see statistics regarding tasks and executors in action. Spark will try to best distribute the data so that it can work at optimum level.
Please refer https://blog.cloudera.com/how-to-tune-your-apache-spark-jobs-part-1/ and subsequent parts to understand more about the internals.
Say I have a file of 256 KB is stored on HDFS file system of one node (as two blocks of 128 KB each). This file internally contains two blocks
of 128 KB each. Assume I have two nodes cluster of each 1 core only. My understanding is that spark during transformation will read complete file
on one node in memory and then transfer one file block memory data to other node so that both nodes/cores can parallely execute it ? Is that correct ?
What if both nodes had two core each instead of one core ? In that case two cores on single node could do the computation ? Is that right ?
val text = sc.textFile("mytextfile.txt")
val counts = text.flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_)
counts.collect
You question is a little hypothetical as it is unlikely you would have an Hadoop Cluster with HDFS existing with only one Data Node and 2 Worker Nodes - one being both Worker and Data Node. That is to say, the whole idea of Spark (and MR) with HDFS is to bring the processing to the data. The Worker Nodes are in fact the Data Nodes in the standard Hadoop set up. This is the original intent.
Some variations to answer your question:
Assuming the case as per above described, each Worker Node would process one partition and subsequent transformations on the newer generated RDDs until finished. You may of course repartition the data and what happens depends on the number of partitions and number of Executors per Worker Node.
In a nutshell: if you have N blocks / partitions initially and less than N Executors allocated - E - on a Hadoop Cluster with HDFS, then you will get some transfer of blocks (not a shuffle as is talked about elsewhere) to the Workers assigned, from Workers where no Executor was allocated to you Spark App, otherwise the block is assigned to be processed to that Data / Worker Node, obviously. Each block / partition is processed in some way, shuffled and the next set of Partitions or Partition read in and processed, depending on speed of processing for your transformation(s).
In the case of AWS S3 and Mircosoft's and gooogle's equivalent Cloud Storage which leave aside the principle of data locality as in the above case - i.e. compute power is divorced from storage, with the assumption that the network is not the bottleneck - which was exactly the Hadoop classic reason to bring the processing to the data, then it works similarly to the aforementioned, i.e. transfer of S3 data to Workers.
All of this assume an Action has been invoked.
I leave aside the principles of Rack Awareness, etc. as it becomes all quite complicated, but the Resource Managers understand these things and decide accordingly.
In the first case, Spark will usually load 1 partition on the first node and then if it cannot find an empty core, it will load the 2nd partition on the 2nd node after waiting for spark/locality.wait (default 3 seconds).
In the 2nd case both partitions will be loaded on the same node unless it does not have both cores free.
Many circumstances can cause this to change if you play with the default configurations.
I have three Spark Streaming jobs that use ConsumerStrategies.Assign[]() to seek the latest offset that was committed into a database.
Each one of these jobs is reading from one of three partitions in a topic (for example: partition 0,1 and 2). If one of the Spark Streaming jobs fails, is it possible to rebalance that partition to one of the other two jobs that are currently running.
I know you can do that in normal Kafka using ConsumerRebalanceListener, onPartitionsProvoked() and onPartitionsAssigned(). But how would you do that in Spark-Streaming-Kafka?