Spark shuffle data to less executors - apache-spark

I am running a complex logic in Spark, in order to get some metrics, I add count in someplace. there are 2 stags
after reading data from S3, I got 22 tasks and data distributed on 22 executorsthese tasks distributed on 22 executors
I run count, the data shuffle to 4 executors with 200 tasks. data shuffle to 4 executors
code:
val newRdd = rdd.toDF()
.repartition(Constants.cappingSparkPartitionNum, $"internal_id")
.rdd // line 389
.map(convertRowToConfiguration)
Log.info(s"source, rdd.getNumPartitions: , ${rdd.getNumPartitions}, newRdd.getNumPartitions: ${newRdd.getNumPartitions}, count: ${newRdd.count()}") // line 392
Why did this happen?
How to control the data distributed on executors(40 executors) evenly?

Related

Number of Task in Apache spark while writing into HDFS

I am trying to read csv file and then adding some columns . After that trying to save in orc format.
I could not understand how spark decided number of tasks for different stages.
Why number of task for CSV stage is 1 and for ORC stage it is 39?
val c1c8 = spark.read.option("header",true).csv("/user/DEEPAK_TEST/C1C6_NEW/")
val c1c8new = { c1c8.withColumnRenamed("c1c6_F","c1c8").withColumnRenamed("Network_Out","c1c8_network").withColumnRenamed("Access NE Out","c1c8_access_ne")
.withColumn("c1c8_signalling",when (col("signalling_Out") === "SIP Cl4" , "SIP CL4").when (col("signalling_Out") === "SIP cl4" , "SIP CL4").when (col("signalling_Out") === "Other" , "other").otherwise(col("signalling_Out")))
.withColumnRenamed("access type Out","c1c8_access_type").withColumnRenamed("Type_of_traffic_C","c1c8_typeoftraffic")
.withColumnRenamed("BOS traffic type Out","c1c8_bos_trafc_typ").withColumnRenamed("Scope_Out","c1c8_scope")
.withColumnRenamed("Join with UP-DWN SIP cl5 T1T7 Out","c1c8_join_indicator")
.select("c1c8","c1c8_network", "c1c8_access_ne", "c1c8_signalling", "c1c8_access_type", "c1c8_typeoftraffic",
"c1c8_bos_trafc_typ", "c1c8_scope","c1c8_join_indicator")
}
c1c8new.write.orc("/user/DEEPAK_TEST/C1C8_MAPPING_NEWT/")
Below is my understanding from looking at Spark 2.x source code.
Stage 0 is a file scan that creates FileScanRDD which is an RDD that scans a list of file partitions. This stage can have more than one task when you are reading from multiple partitioned directories, such as a partitioned Hive table.
The number of tasks in Stage 1 will be equals to the number of RDD partitions. In your case c1c8new.rdd.getNumPartitions will be 39. This number is calculated using:
config value spark.files.maxPartitionBytes (128MB by default)
sparkContext.defaultParallelism returned by task scheduler (equal to number of cores when running in local mode)
totalBytes
DataSourceScanExec.scala#L423
val defaultMaxSplitBytes =
fsRelation.sparkSession.sessionState.conf.filesMaxPartitionBytes
val openCostInBytes = fsRelation.sparkSession.sessionState.conf.filesOpenCostInBytes
val defaultParallelism = fsRelation.sparkSession.sparkContext.defaultParallelism
val totalBytes = selectedPartitions.flatMap(_.files.map(_.getLen + openCostInBytes)).sum
val bytesPerCore = totalBytes / defaultParallelism
val maxSplitBytes = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))
logInfo(s"Planning scan with bin packing, max size: $maxSplitBytes bytes, " +
s"open cost is considered as scanning $openCostInBytes bytes.")
You can see actual calculated values in the above log message if you set the log level to INFO - spark.sparkContext.setLogLevel("INFO")
In your case, I think the split size is 128 and so, number of tasks/partitions is roughly 4.6G/128MB
As a side note, you can change the number of partitions (and hence the number of tasks in the subsequent stage) by using repartition() or coalesce() on the dataframe. More importantly, the number of partitions after a shuffle is determined by spark.sql.shuffle.partitions (200 by default). If you have a shuffle, it is better to use this configuration to control the number of tasks because inserting repartition() or coalesce() between stages adds extra overhead.
For large spark SQL workloads, setting optimum values for spark.sql.shuffle.partitions in each stage was always a pain point. Spark 3.x has better support for this if Adaptive Query Execution is enabled, but I haven't tried it for any production workloads.

Why does Apache Spark partitions CSV read based on the file size and how do I change the partitions?

Here is my pyspark code:
csv_file = "/FileStore/tables/mnt/training/departuredelays02.csv"
schema = "`date` STRING, `delay` INT, `distance` INT, `origin` STRING, `destination` STRING"
df = (spark
.read
.format("csv")
.option("header","true")
.schema(schema)
.load(csv_file)
)
partitions = df.rdd.getNumPartitions()
print(partitions)
The csv file has 487178 rows.
After I print the partitions, the result I get is 3 partitions.
Please note that I have 2 workers with 4 cores. That means a total of 8 slots.
Now if I try to load the following file which is much larger with 1391578 rows:
csv_file = "/FileStore/tables/mnt/training/departuredelays.csv"
I get a partition of 8.
My question is how do I force the first CSV to partition same way as the larger file.
I am aware that repartition can be used but I was curious to know whether this can be done without any shuffle? And even if we repartition it, it seems to create a job with 3 tasks instead of 8 tasks.
Here is what I get after I run the following code snippet:
df = df.repartition(8)
print(df.count())
The first stage of the first task still has 3 tasks allocated.
Output:
(3) Spark Jobs
Job 93 View(Stages: 1/1)
Stage 123: 3/3
Job 94 View(Stages: 1/1, 1 skipped)
Stage 124: 0/3 skipped
Stage 125: 8/8
Job 95 View(Stages: 1/1, 2 skipped)
Stage 126: 0/3 skipped
Stage 127: 0/8 skipped
Stage 128: 1/1
You can try using coalesce which would do a sensible shuffle than repartition.
df = spark
.read
.format("csv")
.option("header","true")
.schema(schema)
.load(csv_file)
.coalesce(8)
Check this out for more info
Spark - repartition() vs coalesce()

Spark Partitionby doesn't scale as expected

INPUT:
The input data set contains 10 million transactions in multiple files stored as parquet. The size of the entire data set including all files ranges from 6 to 8GB.
PROBLEM STATEMENT:
Partition the transactions based on customer id's which would create one folder per customer id and each folder containing all the transactions done by that particular customer.
HDFS has a hard limit of 6.4 million on the number of sub directories within a root directory that can be created so using the last two digits of the customer id ranging from 00,01,02...to 99 to create top level directories and each top level directory would contain all the customer id's ending with that specific two digits.
Sample output directory structure:
00/cust_id=100900/part1.csv
00/cust_id=100800/part33.csv
01/cust_id=100801/part1.csv
03/cust_id=100803/part1.csv
CODE:
// Reading input file and storing in cache
val parquetReader = sparksession.read
.parquet("/inputs")
.persist(StorageLevel.MEMORY_ONLY) //No spill will occur has enough memory
// Logic to partition
var customerIdEndingPattern = 0
while (cardAccountEndingPattern < 100) {
var idEndPattern = customerIdEndingPattern + ""
if (customerIdEndingPattern < 10) {
idEndPattern = "0" + customerIdEndingPattern
}
parquetReader
.filter(col("customer_id").endsWith(idEndPattern))
.repartition(945, col("customer_id"))
.write
.partitionBy("customer_id")
.option("header", "true")
.mode("append")
.csv("/" + idEndPattern)
customerIdEndingPattern = customerIdEndingPattern + 1
}
Spark Configuration:
Amazon EMR 5.29.0 (Spark 2.4.4 & Hadoop 2.8.5)
1 master and 10 slaves and each of them has 96 vCores and 768GB RAM(Amazon AWS R5.24xlarge instance). Hard disks are EBS with bust of 3000 IOPS for 30 mins.
'spark.hadoop.dfs.replication': '3',
'spark.driver.cores':'5',
'spark.driver.memory':'32g',
'spark.executor.instances': '189',
'spark.executor.memory': '32g',
'spark.executor.cores': '5',
'spark.executor.memoryOverhead':'8192',
'spark.driver.memoryOverhead':'8192',
'spark.default.parallelism':'945',
'spark.sql.shuffle.partitions' :'945',
'spark.serializer':'org.apache.spark.serializer.KryoSerializer',
'spark.dynamicAllocation.enabled': 'false',
'spark.memory.fraction':'0.8',
'spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version':'2',
'spark.memory.storageFraction':'0.2',
'spark.task.maxFailures': '6',
'spark.driver.extraJavaOptions': '-XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=12 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError="kill -9 %p"
'spark.executor.extraJavaOptions': '-XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=12 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError="kill -9 %p"
SCALING ISSUES:
Experimented from 10 to all the way upto 40 slaves(adjusting the spark configs accordingly) but still the same results the job takes more than 2hrs to complete(as shown in the first pic each job takes more than a minute and the while loop runs 99 times). Also the reads from remote executors are almost non existent(which is good) most are process local.
Partition seems to work fine(refer second pic) got 5 RDD blocks per instance and 5 tasks running at all times(each instance has 5 cores and 19 instances per slave node). GC is optimized too.
Each partitionby task as written in the while loop takes a minute or more to complete.
METRICS:
Sample duration of a few jobs we have 99 jobs in total
Partition seems okay
Summary from 1 job basically one partitionby execution
Summary of a few instances after full job completion hence RDD blocks is zero and the first row is driver.
So the question is how to optimize it more and why it's not scaling up? Is there a better way to go about it? Have I reached the max performance already? Assuming I have access to more resources in terms of hardware is there anything I could do better? Any suggestions are welcome.
Touching every record 100 times is very inefficient, even if data can be cached in memory and not be evicted downstream. Not to mention persisting alone is expensive
Instead you could add a virtual column
import org.apache.spark.sql.functions.substring
val df = sparksession.read
.parquet("/inputs")
.withColumn("partition_id", substring($"customer_id", -2, 2))
and use it later for partitioning
df
.write
.partitionBy("partition_id", "customer_id")
.option("header", "true")
.mode("append")
.csv("/")
To avoid to many small files you can repartition first using longer suffix
val nParts: Int = ???
val suffixLength: Int = ??? // >= suffix length used for write partitions
df
.repartitionByRange(
nParts,
substring($"customer_id", -suffixLength, suffixLength)
.write
.partitionBy("partition_id", "customer_id")
.option("header", "true")
.mode("append")
.csv("/")
Such changes will allow you to process all data in a single pass without any explicit caching.

Optimization Spark job - Spark 2.1

my spark job currently runs in 59 mins. I want to optimize it so that I it takes less time. I have noticed that the last step of the job takes a lot of time (55 mins) (see the screenshots of the spark job in Spark UI below).
I need to join a big dataset with a smaller one, apply transformations on this joined dataset (creating a new column).
At the end, I should have a dataset repartitioned based on the column PSP (see snippet of the code below). I also perform a sort at the end (sort each partition based on 3 columns).
All the details (infrastructure, configuration, code) can be found below.
Snippet of my code :
spark.conf.set("spark.sql.shuffle.partitions", 4158)
val uh = uh_months
.withColumn("UHDIN", datediff(to_date(unix_timestamp(col("UHDIN_YYYYMMDD"), "yyyyMMdd").cast(TimestampType)),
to_date(unix_timestamp(col("january"), "yyyy-MM-dd").cast(TimestampType))))
"ddMMMyyyy")).cast(TimestampType)))
.withColumn("DVA_1", date_format(col("DVA"), "dd/MM/yyyy"))
.drop("UHDIN_YYYYMMDD")
.drop("january")
.drop("DVA")
.persist()
val uh_flag_comment = new TransactionType().transform(uh)
uh.unpersist()
val uh_joined = uh_flag_comment.join(broadcast(smallDF), "NO_NUM")
.select(
uh.col("*"),
smallDF.col("PSP"),
smallDF.col("minrel"),
smallDF.col("Label"),
smallDF.col("StartDate"))
.withColumnRenamed("DVA_1", "DVA")
smallDF.unpersist()
val uh_to_be_sorted = uh_joined.repartition(4158, col("PSP"))
val uh_final = uh_joined.sortWithinPartitions(col("NO_NUM"), col("UHDIN"), col("HOURMV"))
uh_final
EDITED - Repartition logic
val sqlContext = spark.sqlContext
sqlContext.udf.register("randomUDF", (partitionCount: Int) => {
val r = new scala.util.Random
r.nextInt(partitionCount)
// Also tried with r.nextInt(partitionCount) + col("PSP")
})
val uh_to_be_sorted = uh_joined
.withColumn("tmp", callUDF("RandomUDF", lit("4158"))
.repartition(4158, col("tmp"))
.drop(col("tmp"))
val uh_final = uh_to_be_sorted.sortWithinPartitions(col("NO_NUM"), col("UHDIN"), col("HOURMV"))
uh_final
smallDF is a small dataset (535MB) that I broadcast.
TransactionType is a class where I add a new column of string elements to my uh dataframe based on the value of 3 columns (MMED, DEBCRED, NMTGP), checking the values of those columns using regex.
I previously faced a lot of issues (job failing) because of shuffle blocks that were not found. I discovered that I was spilling to disk and had a lot of GC memory issues so I increased the "spark.sql.shuffle.partitions" to 4158.
WHY 4158 ?
Partition_count = (stage input data) / (target size of your partition)
so Shuffle partition_count = (shuffle stage input data) / 200 MB = 860000/200=4300
I have 16*24 - 6 =378 cores availaible. So if I want to run every tasks in one go, I should divide 4300 by 378 which is approximately 11. Then 11*378=4158
Spark Version: 2.1
Cluster configuration:
24 compute nodes (workers)
16 vcores each
90 GB RAM per node
6 cores are already being used by other processes/jobs
Current Spark configuration:
-master: yarn
-executor-memory: 26G
-executor-cores: 5
-driver memory: 70G
-num-executors: 70
-spark.kryoserializer.buffer.max=512
-spark.driver.cores=5
-spark.driver.maxResultSize=500m
-spark.memory.storageFraction=0.4
-spark.memory.fraction=0.9
-spark.hadoop.fs.permissions.umask-mode=007
How is the job executed:
We build an artifact (jar) with IntelliJ and then send it to a server. Then a bash script is executed. This script:
export some environment variables (SPARK_HOME, HADOOP_CONF_DIR, PATH and SPARK_LOCAL_DIRS)
launch the spark-submit command with all the parameters defined in the spark configuration above
retrieves the yarn logs of the application
Spark UI screenshots
DAG
#Ali
From the Summary Metrics we can say that your data is Skewed ( Max Duration : 49 min and Max Shuffle Read Size/Records : 2.5 GB/ 23,947,440 where as on an average it's taking about 4-5 mins and processing less than 200 MB/1.2 MM rows)
Now that we know the problem might be skew of data in few partition(s) , I think we can fix this by changing repartition logic val uh_to_be_sorted = uh_joined.repartition(4158, col("PSP")) by chosing something (like some other column or adding any other column to PSP)
few links to refer on data skew and fix
https://dzone.com/articles/optimize-spark-with-distribute-by-cluster-by
https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/
Hope this helps

How to remove empty partition in a dataframe?

I need to remove the empty partitions from a Dataframe
We are having two Dataframes, both are created using sqlContext. And the dataframes are constructed and combined as below
import org.apache.spark.sql.{SQLContext}
val sqlContext = new SQLContext(sc)
// Loading Dataframe 1
val csv1 = "s3n://xxxxx:xxxxxx#xxxx/xxx.csv"
val csv1DF = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load(csv1)
// Loading Dataframe 2
val csv2 = "s3n://xxxxx:xxxxxx#xxxx/xxx.csv"
val csv2DF = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load(csv1)
// Combining dataframes
val combinedDF = csv1.
join(csv2 csv1("column_1") === csv2("column_2"))
Now the number of partition for combinedDF is 200.
From here it is found that the default number of partition is 200 when we use joins.
In some cases the dataframe/csv is not big and getting many empty partition which causes issues later part of the code.
So how can I remove these empty partition created?
The repartition method can be used to create an RDD without any empty partitions.
This thread discusses the optimal number of partitions for a given cluster. Here is good rule of thumb for estimating the optimal number of partitions.
number_of_partitions = number_of_cores * 4
If you have a cluster of 8 r3.xlarge AWS nodes, you should use 128 partitions (8 nodes * 4 CPUs per node * 4 partitions per CPU).

Resources