What is the difference between spark's shuffle read and shuffle write? - apache-spark

I need to run a spark program which has huge amount of data. I am trying to optimize the spark program and working through spark UI and trying to reduce the Shuffle part.
There are couple of components mentioned, shuffle read and shuffle write. I can understand the difference based their terminology, but I would like to understand the exact meaning of them and which one of spark's shuffle read/write reduces the performance?
I have searched over the internet, but could not find solid in depth details about them, so wanted to see if any one can explain them here.

From the UI tooltip
Shuffle Read
Total shuffle bytes and records read (includes both data read locally and data read from remote executors
Shuffle Write
Bytes and records written to disk in order to be read by a shuffle in a future stage

I've recently begun working with Spark. I have been looking for answers to the same sort of questions.
When the data from one stage is shuffled to a next stage through the network, the executor(s) that process the next stage pull the data from the first stage's process through TCP. I noticed the shuffle "write" and "read" metrics for each stage are displayed in the Spark UI for a particular job. A stage also potentially had an "input" size (eg. input from HDFS or hive table scan).
I noticed that the shuffle write size from one stage that fed into another stage did not match that stages shuffle read size. If I remember correctly, there are reducer-type operations that can be performed on the shuffle data prior to it being transferred to the next stage/executor as an optimization. Maybe this contributes to the difference in size and therefore the relevance of reporting both values.

Related

Spark SQL output multiple small files

We are having multiple joins involving a large table (about 500gb in size). The output of the joins is stored into multiple small files each of size 800kb-1.5mb. Because of this the job is split into multiple tasks and taking a long time to complete.
We have tried using spark tuning configurations like using broadcast join, changing partition size, changing max records per file etc., But there is no performance improvement with this methods and the issue is also not fixed. Using coalesce makes the job struck at that stage and there is no progress.
Please view this link for Spark UI metrics screenshot, https://i.stack.imgur.com/FfyYy.png
The spark UI confirms your report of too many small files. You will get a file for every spark partition, and you have 33,479 in your final stage where you're writing the output. 33k partitions was probably the right number of partitions for your join but not the right number for your write.
You need to add another stage in your job that comes after your join. That 2nd needs to reduce the number of spark partitions to a reasonable number (that outputs 32MB - ~128MB files)
Something like a coalesce, or repartition. Maybe even a sort :(
You want to target ~350 partitions.
This diagram shows what you want to do manually or automatically (with spark on Databricks)
If you're using Databricks then it's easy as with Delta Lake you can turn on Auto Optimize

Shuffle Stage Failing Due To Executor Loss

I get the following error when my spark jobs fails **"org.apache.spark.shuffle.FetchFailedException: The relative remote executor(Id: 21), which maintains the block data to fetch is dead."**
Over view of my spark job
input size is ~35 GB
I have broadcast joined all the smaller tables with the mother table into say a dataframe1 and then i salted each big table and dataframe1 before i join it with dataframe1 (left table).
profile used:
#configure(profile=[
'EXECUTOR_MEMORY_LARGE',
'NUM_EXECUTORS_32',
'DRIVER_MEMORY_LARGE',
'SHUFFLE_PARTITIONS_LARGE'
])
using the above approach and profiles i was able to get the runtime down by 50% but i still get Shuffle Stage Failing Due To Executor Loss issues.
is there a way i can fix this?
There are multiple things you can try:
Broadcast Joins: If you have used broadcast hints to join multiple smaller tables, then the resulting table (of many smaller tables) might be too huge to be accommodated in each executor memory. So, you need to look at total size of dataframe1.
35GB is really not huge. Also try the profile "EXECUTOR_CORES_MEDIUM", which really increases the parallelism in data computation. Use Dynamic allocation (16 executors should be fine for 35GB) rather than static allocation. If 32 executors are not available at a time, the build doesn't start. "DRIVER_MEMORY_MEDIUM" should be enough.
Spark 3.0 handles skew joins by itself with Adaptive Query Execution. So, you need not use salting technique. There is a profile called "ADAPTIVE_ENABLED" with foundry that you can use. Other settings of adaptive query execution, you will have to set manually with "ctx" spark context object readily available with Foundry.
Some references for AQE:
https://learn.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/aqe
https://spark.apache.org/docs/latest/sql-performance-tuning.html#adaptive-query-execution

Why there are so many partitions required before shuffling data in Apache Spark?

Background
I am a newbie in Spark and want to understand about shuffling in spark.
I have two following questions about shuffling in Apache Spark.
1) Why there is change in no. of partitions before performing shuffling ? Spark does it by default by changing partition count to value given in spark.sql.shuffle.partitions.
2) Shuffling usually happens when there is a wide transformation. I have read in a book that data is also saved on disk. Is my understanding correct ?
Two questions actually.
Nowhere it it stated that you need to change this parameter. 200 is the default if not set. It applies to JOINing and AGGregating. You make have a far bigger set of data that is better served by increasing the number of partitions for more processing capacity - if more Executors are available. 200 is the default, but if your quantity is huge, more parallelism if possible will speed up processing time - in general.
Assuming an Action has been called - so as to avoid the obvious comment if this is not stated, assuming we are not talking about ResultStage and a broadcast join, then we are talking about ShuffleMapStage. We look at an RDD initially:
DAG dependency involving a shuffle means creation of a separate Stage.
Map operations are followed by Reduce operations and a Map and so forth.
CURRENT STAGE
All the (fused) Map operations are performed intra-Stage.
The next Stage requirement, a Reduce operation - e.g. a reduceByKey, means the output is hashed or sorted by key (K) at end of the Map
operations of current Stage.
This grouped data is written to disk on the Worker where the Executor is - or storage tied to that Cloud version. (I would have
thought in memory was possible, if data is small, but this is an architectural Spark
approach as stated from the docs.)
The ShuffleManager is notified that hashed, mapped data is available for consumption by the next Stage. ShuffleManager keeps track of all
keys/locations once all of the map side work is done.
NEXT STAGE
The next Stage, being a reduce, then gets the data from those locations by consulting the Shuffle Manager and using Block Manager.
The Executor may be re-used or be a new on another Worker, or another Executor on same Worker.
Stages mean writing to disk, even if enough memory present. Given finite resources of a Worker it makes sense that writing to disk occurs for this type of operation. The more important point is, of course, the 'Map Reduce' style of implementation.
Of course, fault tolerance is aided by this persistence, less re-computation work.
Similar aspects apply to DFs.

spark shuffle read time

k.imgur.com/r8NIv.png
I am having hard time processing this information from Spark UI. The executor which has lowest spark shuffle read size/Records takes maximum time to read the shuffle blocks as shown in the pictures. I am not understanding if this is a code issue or if this is a data node issues.
Maybe it not only caused by the shuffle read size,there are many factors affecting the shuffle time like the number of partitions.You can try to modify the configuration parmeters about shuffle.
shuffle-behavior

spark streaming failed batches

I see some failed batches in my spark streaming application because of memory related issues like
Could not compute split, block input-0-1464774108087 not found
, and I was wondering if there is a way to re process those batches on the side without messing with the current running application, just in general , does not have to be the same exact exception.
Thanks in advance
Pradeep
This may happen in cases where your data ingestion rate into spark is higher than memory allocated or can be kept. You can try changing StorageLevel to MEMORY_AND_DISK_SER so that when it is low on memory Spark can spill data to disk. This will prevent your error.
Also, I don't think this error means that any data was lost while processing, but that input block which was added by your block manager just timed out before processing started.
Check similar question on Spark User list.
Edit:
Data is not lost, it was just not present where the task was expecting it to be. As per Spark docs:
You can mark an RDD to be persisted using the persist() or cache()
methods on it. The first time it is computed in an action, it will be
kept in memory on the nodes. Spark’s cache is fault-tolerant – if any
partition of an RDD is lost, it will automatically be recomputed using
the transformations that originally created it.

Resources