How spark reads from jdbc and distribute the data

How spark reads from jdbc and distribute the data - apache-spark

I need clarity about how spark works under the hood when it comes to fetch data external databases.
What I understood from spark documentation is that, if I do not mention attributes like "numPartitons","lowerBound" and "upperBound" then read via jdbc is not parallel.In that case what happens?
Is data read by 1 particular executor which fetches all the data ? How is parallelism achieved then?
Does that executor share the data later to other executors?But I believe executors cannot share data like this.
Please let me know if any one of you have explored this.
Edit to my question -
Hi Amit, thanks for your response, but that is not what I am looking for. Let me elaborate:-
Refer this - https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
Refer below code snippet –
val MultiJoin_vw = db.getDataFromGreenplum_Parallel(ss, MultiJoin, bs,5,"bu_id",10,9000)
println(MultiJoin_vw.explain(true))
println("Number of executors")
ss.sparkContext.statusTracker.getExecutorInfos.foreach(x => println(x.host(),x.numRunningTasks()))
println("Number of partitons:" ,MultiJoin_vw.rdd.getNumPartitions)
println("Number of records in each partiton:")
MultiJoin_vw.groupBy(spark_partition_id).count().show(10)
Output :
Fetch Starts
== Physical Plan ==
*(1) Scan JDBCRelation((select * from mstrdata_rdl.cmmt_sku_bu_vw)as mytab) [numPartitions=5] [sku_nbr#0,bu_id#1,modfd_dts#2] PushedFilters: [], ReadSchema: struct<sku_nbr:string,bu_id:int,modfd_dts:timestamp>
()
Number of executors
(ddlhdcdev18,0)
(ddlhdcdev41,0)
(Number of partitons:,5)
Number of records in each partition:
+--------------------+------+
|SPARK_PARTITION_ID()| count|
+--------------------+------+
| 1|212267|
| 3| 56714|
| 4|124824|
| 2|232193|
| 0|627712|
+--------------------+------+
Here I read the table using the custom function db.getDataFromGreenplum_Parallel(ss, MultiJoin, bs,5,"bu_id",10,9000) which specifies to create 5 partition based on field bu_id whose lower value is 10 and upper value is 9000.
See how spark read data in 5 partitions with 5 parallel connections (as mentioned by spark doc). Now lets read this table without mentioning any of the parameter above –
I simply get the data using another function - val MultiJoin_vw = db.getDataFromGreenplum(ss, MultiJoin, bs)
Here I am only passing the spark session(ss), query for getting the data(MultiJoin) and another parameter for exception handling(bs).
The o/p is like below –
Fetch Starts
== Physical Plan ==
*(1) Scan JDBCRelation((select * from mstrdata_rdl.cmmt_sku_bu_vw)as mytab) [numPartitions=1] [sku_nbr#0,bu_id#1,modfd_dts#2] PushedFilters: [], ReadSchema: struct<sku_nbr:string,bu_id:int,modfd_dts:timestamp>
()
Number of executors
(ddlhdcdev31,0)
(ddlhdcdev27,0)
(Number of partitons:1)
Number of records in each partiton:
+--------------------+-------+
|SPARK_PARTITION_ID()| count|
+--------------------+-------+
| 0|1253710|
See how data is read into one partition, means spawning only 1 connection.
Question remains this partition will be at one machine only and 1 task will be assigned to this.
So there is no parallelism here.How does the data gets distributed to other executors then?
By the way this is the spark-submit command I used for both scenarios –
spark2-submit --master yarn --deploy-mode cluster --driver-memory 1g --num-executors 1 --executor-cores 1 --executor-memory 1g --class jobs.memConnTest $home_directory/target/mem_con_test_v1-jar-with-dependencies.jar

Re:"to fetch data external databases"
In your spark application this is generally the part of the code that will be executed on executors. Number of executors can be controlled by passing a spark configuration "num-executors". If you have worked with Spark and RDD/Dataframe, then one of the example from where you would connect to the database is the transformation functions such as map,flatmap,filter etc. These functions when getting executed on executors ( configured by num-executors) will establish the database connection and use it.
One important thing to note here is that, if you work with too many executors then your database server might getting slower and slower and eventually non-responsive. If you give too less of executors then it might cause your spark job taking more time to finish. Hence you have to find an optimum number based on your DB server capacity.
Re:"How is parallelism achieved then? Does that executor share the data later to other executors?"
Parallelism as mentioned above is achieved by configuring number of executors. Configuring number of executors is just one way of increasing parallelism and it is not the only way. Consider a case where you have a smaller size data resulting in fewer partitions, then you will see lesser parallelism. So you need to have good number of partitions (those corresponds to tasks) and then appropriate(definite number depends on the use case) number of executors to execute those tasks in parallel. As long as you can process each record individually it scales, however as soon as you have an action that would cause a shuffle you would see statistics regarding tasks and executors in action. Spark will try to best distribute the data so that it can work at optimum level.
Please refer https://blog.cloudera.com/how-to-tune-your-apache-spark-jobs-part-1/ and subsequent parts to understand more about the internals.

Related

Why my shuffle partition is not 200(default) during group by operation? (Spark 2.4.5)

I am new to spark and trying to understand the internals of it. So,
I am reading a small 50MB parquet file from s3 and performing a group by and then saving back to s3.
When I observe the Spark UI, I can see 3 stages created for this,
stage 0 : load (1 tasks)
stage 1 : shufflequerystage for grouping (12 tasks)
stage 2: save (coalescedshufflereader) (26 tasks)
Code Sample:
df = spark.read.format("parquet").load(src_loc)
df_agg = df.groupby(grp_attribute)\
.agg(F.sum("no_of_launches").alias("no_of_launchesGroup")
df_agg.write.mode("overwrite").parquet(target_loc)
I am using EMR instance with 1 master, 3 core nodes(each with 4vcores). So, default parallelism is 12. I am not changing any config in runtime. But I am not able to understand why 26 tasks are created in the final stage? As I understand by default the shuffle partition should be 200. Screenshot of the UI attached.

I tried a similar logic on Databricks with Spark 2.4.5.
I observe that with spark.conf.set('spark.sql.adaptive.enabled', 'true'), the final number of my partitions is 2.
I observe that with spark.conf.set('spark.sql.adaptive.enabled', 'false') and spark.conf.set('spark.sql.shuffle.partitions', 75), the final number of my partitions is 75.
Using print(df_agg.rdd.getNumPartitions()) reveals this.
So, the job output on Spark UI does not reflect this. May be a repartition occurs at the end. Interesting, but not really an issue.

In Spark sql, number of shuffle partitions are set using spark.sql.shuffle.partitions which defaults to 200. In most of the cases, this number is too high for smaller data and too small for bigger data. Selecting right value becomes always tricky for the developer.
So we need an ability to coalesce the shuffle partitions by looking at the mapper output. If the mapping generates small number of partitions, we want to reduce the overall shuffle partitions so it will improve the performance.
In the lastet version , Spark3.0 with Adaptive Query Execution , this feature of reducing the tasks is automated.
http://blog.madhukaraphatak.com/spark-aqe-part-2/
Considering this in Spark2.4.5 also catalist opimiser or EMR might have enabled this feature to reduce the tasks insternally rather 200 tasks.

Spark SQL slow execution with resource idle

I have a Spark SQL that used to execute < 10 mins now running at 3 hours after a cluster migration and need to deep dive on what it's actually doing. I'm new to spark and please don't mind if I'm asking something unrelated.
Increased spark.executor.memory but no luck.
Env: Azure HDInsight Spark 2.4 on Azure Storage
SQL: Read and Join some data and finally write result to a Hive metastore.
The spark.sql script ends with below code:
.write.mode("overwrite").saveAsTable("default.mikemiketable")
Application Behavior:
Within the first 15 mins, it loads and complete most tasks (199/200); left only 1 executor process alive and continually to shuffle read / write data. Because now it only leave 1 executor, we need to wait 3 hours until this application finish.
Left only 1 executor alive
Not sure what's the executor doing:
From time to time, we can tell the shuffle read increased:
Therefore I increased the spark.executor.memory to 20g, but nothing changed. From Ambari and YARN I can tell the cluster has many resources left.
Release of almost all executor
Any guidance is greatly appreciated.

I would like to start with some observations for your case:
From the tasks list you can see that that Shuffle Spill (Disk) and Shuffle Spill (Memory) have both very high values. The max block size for each partition during the exchange of data should not exceed 2GB therefore you should be aware to keep the size of shuffled data as low as possible. As rule of thumb you need to remember that the size of each partition should be ~200-500MB. For instance if the total data is 100GB you need at least 250-500 partitions to keep the partition size within the mentioned limits.
The co-existence of two previous it also means that the executor memory was not sufficient and Spark was forced to spill data to the disk.
The duration of the tasks is too high. A normal task should lasts between 50-200ms.
Too many killed executors is another sign which shows that you are facing OOM problems.
Locality is RACK_LOCAL which is considered one of the lowest values you can achieve within a cluster. Briefly, that means that the task is being executed in a different node than the data is stored.
As solution I would try the next few things:
Increase the number of partitions by using repartition() or via Spark settings with spark.sql.shuffle.partitions to a number that meets the requirements above i.e 1000 or more.
Change the way you store the data and introduce partitioned data i.e day/month/year using partitionBy

Spark Sql Job optimization

I have a job which consist around 9 sql statement to pull data from hive and write back to hive db. It is currently running for 3hrs which seems too long considering spark abitlity to process data. The application launchs total 11 stages.
I did some analysis using Spark UI and found below grey areas which can be improved:
Stage 8 in Job 5 has shuffle output of 1.5 TB.
Time gap between job 4 and job 5 is 20 Mins. I read about this time gap and found spark perform IO out of spark job which reflects as gap between two jobs which can be seen in driver logs.
We have a cluster of 800 nodes with restricted resources for each queue and I am using below conf to submit job:
-- num-executor 200
-- executor-core 1
-- executor-memory 6G
-- deployment mode client
Attaching Image of UI as well.
Now my questions are:
Where can I find driver log for this job?
In image, I see a long list of Executor added which I sum is more than 200 but in Executor tab, number is exactly 200. Any explation for this?
Out of all the stages, only one stage has TASK around 35000 but rest of stages has 200 tasks only. Should I increase number of executor or should I go for dynamic allocation facility of spark?

Below are the thought processes that may guide you to some extent:
Is it necessary to have one core per executor? The executor need not be fat always. You can have more cores in one executor. it is a trade-off between creating a slim vs fat executors.
Configure shuffle partition parameter spark.sql.shuffle.partitions
Ensure while reading data from Hive, you are using Sparksession (basically HiveContext). This will pull the data into Spark memory from HDFS and schema information from Metastore of Hive.
Yes, Dynamic allocation of resources is a feature that helps in allocating the right set of resources. It is better than having fixed allocation.

Why input value on spark UI is less when we use more nodes and the shuffle value remains equal?

I run some queries on spark with just one node and then with 4 nodes. And in the spark:4040 UI I see something that I am not understanding. For example after executing a query with 4 nodes and check the results in the spark UI, in the "input" tab appears 2,8gb, so spark read 2,8gb from hadoop.
The same query on hadoop with just one node in local mode appears 7,3gb, the spark read 7,3GB from hadoop. But this value shouldnt be equal? For example the value of shuffle remains +- equal in one node vs 4. Why the input value doesn't stay equal? The same amount of data must be read from the hdfs, so I am not understanding. Do you know?
Single node:
Below the same query on multinode, as you can see input is less but the shuffle remains +- icual, do you know why?

Input is the size of data that your spark job is ingesting. For example, it can be data that each map task you may have defined is using.
Shuffling means the reallocation of data between multiple Spark stages. "Shuffle Write" is the sum of all written serialized data on all executors before transmitting (normally at the end of a stage) and "Shuffle Read" means the sum of read serialized data on all executors at the beginning of a stage.

I assume you are talking about input tab in jobs. It could be a cumulative storage. Please check input in executor tab as well. As in 4 node it will have more executor the data will be distributed.

what factors affect how many spark job concurrently

We recently have set up the Spark Job Server to which the spark jobs are submitted.But we found out that our 20 nodes(8 cores/128G Memory per node) spark cluster can only afford 10 spark jobs running concurrently.
Can someone share some detailed info about what factors would actually affect how many spark jobs can be run concurrently? How can we tune the conf so that we can take full advantage of the cluster?

Question is missing some context, but first - it seems like Spark Job Server limits the number of concurrent jobs (unlike Spark itself, which puts a limit on number of tasks, not jobs):
From application.conf
# Number of jobs that can be run simultaneously per context
# If not set, defaults to number of cores on machine where jobserver is running
max-jobs-per-context = 8
If that's not the issue (you set the limit higher, or are using more than one context), then the total number of cores in the cluster (8*20 = 160) is the maximum number of concurrent tasks. If each of your jobs creates 16 tasks, Spark would queue the next incoming job waiting for CPUs to be available.
Spark creates a task per partition of the input data, and the number of partitions is decided according to the partitioning of the input on disk, or by calling repartition or coalesce on the RDD/DataFrame to manually change the partitioning. Some other actions that operate on more than one RDD (e.g. union) may also change the number of partitions.

Some things that could limit the parallelism that you're seeing:
If your job consists of only map operations (or other shuffle-less operations), it will be limited to the number of partitions of data you have. So even if you have 20 executors, if you have 10 partitions of data, it will only spawn 10 task (unless the data is splittable, in something like parquet, LZO indexed text, etc).
If you're performing a take() operation (without a shuffle), it performs an exponential take, using only one task and then growing until it collects enough data to satisfy the take operation. (Another question similar to this)
Can you share more about your workflow? That would help us diagnose it.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string