We are processing roughly 500 MB file of data in EMR.
I am performing the following operations on the file.
read csv :
val df = spark.read.format("csv").load(s3)
aggregating by key and creating the list :
val data = filteredDf.groupBy($"<key>")
.agg(collect_list(struct(cols.head, cols.tail: _*)) as "finalData")
.toJSON
Iterating through each partition and storing per key aggregation to S3 and sending the key to SQS.
data.foreachPartition(partition => {
partition.foreach(json => ......)
}
Data is skewed with one account having almost 10M records (~400 MB). I am experiencing out of memory issue during foreachPartition for the given account.
Configuration:
1 driver : m4.4xlarge CPU Cores : 16 and Memory : 64GB
1 executor : m4.2x large CPU Cores : 8 and Memory : 32GB
driver-memory: 20G
executor-memory: 10G
Partitions : default 200 [ most of them don't do anything ]
Any help is much appreciated! thanks a lot in advance :)
Related
I am trying to read csv file and then adding some columns . After that trying to save in orc format.
I could not understand how spark decided number of tasks for different stages.
Why number of task for CSV stage is 1 and for ORC stage it is 39?
val c1c8 = spark.read.option("header",true).csv("/user/DEEPAK_TEST/C1C6_NEW/")
val c1c8new = { c1c8.withColumnRenamed("c1c6_F","c1c8").withColumnRenamed("Network_Out","c1c8_network").withColumnRenamed("Access NE Out","c1c8_access_ne")
.withColumn("c1c8_signalling",when (col("signalling_Out") === "SIP Cl4" , "SIP CL4").when (col("signalling_Out") === "SIP cl4" , "SIP CL4").when (col("signalling_Out") === "Other" , "other").otherwise(col("signalling_Out")))
.withColumnRenamed("access type Out","c1c8_access_type").withColumnRenamed("Type_of_traffic_C","c1c8_typeoftraffic")
.withColumnRenamed("BOS traffic type Out","c1c8_bos_trafc_typ").withColumnRenamed("Scope_Out","c1c8_scope")
.withColumnRenamed("Join with UP-DWN SIP cl5 T1T7 Out","c1c8_join_indicator")
.select("c1c8","c1c8_network", "c1c8_access_ne", "c1c8_signalling", "c1c8_access_type", "c1c8_typeoftraffic",
"c1c8_bos_trafc_typ", "c1c8_scope","c1c8_join_indicator")
}
c1c8new.write.orc("/user/DEEPAK_TEST/C1C8_MAPPING_NEWT/")
Below is my understanding from looking at Spark 2.x source code.
Stage 0 is a file scan that creates FileScanRDD which is an RDD that scans a list of file partitions. This stage can have more than one task when you are reading from multiple partitioned directories, such as a partitioned Hive table.
The number of tasks in Stage 1 will be equals to the number of RDD partitions. In your case c1c8new.rdd.getNumPartitions will be 39. This number is calculated using:
config value spark.files.maxPartitionBytes (128MB by default)
sparkContext.defaultParallelism returned by task scheduler (equal to number of cores when running in local mode)
totalBytes
DataSourceScanExec.scala#L423
val defaultMaxSplitBytes =
fsRelation.sparkSession.sessionState.conf.filesMaxPartitionBytes
val openCostInBytes = fsRelation.sparkSession.sessionState.conf.filesOpenCostInBytes
val defaultParallelism = fsRelation.sparkSession.sparkContext.defaultParallelism
val totalBytes = selectedPartitions.flatMap(_.files.map(_.getLen + openCostInBytes)).sum
val bytesPerCore = totalBytes / defaultParallelism
val maxSplitBytes = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))
logInfo(s"Planning scan with bin packing, max size: $maxSplitBytes bytes, " +
s"open cost is considered as scanning $openCostInBytes bytes.")
You can see actual calculated values in the above log message if you set the log level to INFO - spark.sparkContext.setLogLevel("INFO")
In your case, I think the split size is 128 and so, number of tasks/partitions is roughly 4.6G/128MB
As a side note, you can change the number of partitions (and hence the number of tasks in the subsequent stage) by using repartition() or coalesce() on the dataframe. More importantly, the number of partitions after a shuffle is determined by spark.sql.shuffle.partitions (200 by default). If you have a shuffle, it is better to use this configuration to control the number of tasks because inserting repartition() or coalesce() between stages adds extra overhead.
For large spark SQL workloads, setting optimum values for spark.sql.shuffle.partitions in each stage was always a pain point. Spark 3.x has better support for this if Adaptive Query Execution is enabled, but I haven't tried it for any production workloads.
INPUT:
The input data set contains 10 million transactions in multiple files stored as parquet. The size of the entire data set including all files ranges from 6 to 8GB.
PROBLEM STATEMENT:
Partition the transactions based on customer id's which would create one folder per customer id and each folder containing all the transactions done by that particular customer.
HDFS has a hard limit of 6.4 million on the number of sub directories within a root directory that can be created so using the last two digits of the customer id ranging from 00,01,02...to 99 to create top level directories and each top level directory would contain all the customer id's ending with that specific two digits.
Sample output directory structure:
00/cust_id=100900/part1.csv
00/cust_id=100800/part33.csv
01/cust_id=100801/part1.csv
03/cust_id=100803/part1.csv
CODE:
// Reading input file and storing in cache
val parquetReader = sparksession.read
.parquet("/inputs")
.persist(StorageLevel.MEMORY_ONLY) //No spill will occur has enough memory
// Logic to partition
var customerIdEndingPattern = 0
while (cardAccountEndingPattern < 100) {
var idEndPattern = customerIdEndingPattern + ""
if (customerIdEndingPattern < 10) {
idEndPattern = "0" + customerIdEndingPattern
}
parquetReader
.filter(col("customer_id").endsWith(idEndPattern))
.repartition(945, col("customer_id"))
.write
.partitionBy("customer_id")
.option("header", "true")
.mode("append")
.csv("/" + idEndPattern)
customerIdEndingPattern = customerIdEndingPattern + 1
}
Spark Configuration:
Amazon EMR 5.29.0 (Spark 2.4.4 & Hadoop 2.8.5)
1 master and 10 slaves and each of them has 96 vCores and 768GB RAM(Amazon AWS R5.24xlarge instance). Hard disks are EBS with bust of 3000 IOPS for 30 mins.
'spark.hadoop.dfs.replication': '3',
'spark.driver.cores':'5',
'spark.driver.memory':'32g',
'spark.executor.instances': '189',
'spark.executor.memory': '32g',
'spark.executor.cores': '5',
'spark.executor.memoryOverhead':'8192',
'spark.driver.memoryOverhead':'8192',
'spark.default.parallelism':'945',
'spark.sql.shuffle.partitions' :'945',
'spark.serializer':'org.apache.spark.serializer.KryoSerializer',
'spark.dynamicAllocation.enabled': 'false',
'spark.memory.fraction':'0.8',
'spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version':'2',
'spark.memory.storageFraction':'0.2',
'spark.task.maxFailures': '6',
'spark.driver.extraJavaOptions': '-XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=12 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError="kill -9 %p"
'spark.executor.extraJavaOptions': '-XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=12 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError="kill -9 %p"
SCALING ISSUES:
Experimented from 10 to all the way upto 40 slaves(adjusting the spark configs accordingly) but still the same results the job takes more than 2hrs to complete(as shown in the first pic each job takes more than a minute and the while loop runs 99 times). Also the reads from remote executors are almost non existent(which is good) most are process local.
Partition seems to work fine(refer second pic) got 5 RDD blocks per instance and 5 tasks running at all times(each instance has 5 cores and 19 instances per slave node). GC is optimized too.
Each partitionby task as written in the while loop takes a minute or more to complete.
METRICS:
Sample duration of a few jobs we have 99 jobs in total
Partition seems okay
Summary from 1 job basically one partitionby execution
Summary of a few instances after full job completion hence RDD blocks is zero and the first row is driver.
So the question is how to optimize it more and why it's not scaling up? Is there a better way to go about it? Have I reached the max performance already? Assuming I have access to more resources in terms of hardware is there anything I could do better? Any suggestions are welcome.
Touching every record 100 times is very inefficient, even if data can be cached in memory and not be evicted downstream. Not to mention persisting alone is expensive
Instead you could add a virtual column
import org.apache.spark.sql.functions.substring
val df = sparksession.read
.parquet("/inputs")
.withColumn("partition_id", substring($"customer_id", -2, 2))
and use it later for partitioning
df
.write
.partitionBy("partition_id", "customer_id")
.option("header", "true")
.mode("append")
.csv("/")
To avoid to many small files you can repartition first using longer suffix
val nParts: Int = ???
val suffixLength: Int = ??? // >= suffix length used for write partitions
df
.repartitionByRange(
nParts,
substring($"customer_id", -suffixLength, suffixLength)
.write
.partitionBy("partition_id", "customer_id")
.option("header", "true")
.mode("append")
.csv("/")
Such changes will allow you to process all data in a single pass without any explicit caching.
my spark job currently runs in 59 mins. I want to optimize it so that I it takes less time. I have noticed that the last step of the job takes a lot of time (55 mins) (see the screenshots of the spark job in Spark UI below).
I need to join a big dataset with a smaller one, apply transformations on this joined dataset (creating a new column).
At the end, I should have a dataset repartitioned based on the column PSP (see snippet of the code below). I also perform a sort at the end (sort each partition based on 3 columns).
All the details (infrastructure, configuration, code) can be found below.
Snippet of my code :
spark.conf.set("spark.sql.shuffle.partitions", 4158)
val uh = uh_months
.withColumn("UHDIN", datediff(to_date(unix_timestamp(col("UHDIN_YYYYMMDD"), "yyyyMMdd").cast(TimestampType)),
to_date(unix_timestamp(col("january"), "yyyy-MM-dd").cast(TimestampType))))
"ddMMMyyyy")).cast(TimestampType)))
.withColumn("DVA_1", date_format(col("DVA"), "dd/MM/yyyy"))
.drop("UHDIN_YYYYMMDD")
.drop("january")
.drop("DVA")
.persist()
val uh_flag_comment = new TransactionType().transform(uh)
uh.unpersist()
val uh_joined = uh_flag_comment.join(broadcast(smallDF), "NO_NUM")
.select(
uh.col("*"),
smallDF.col("PSP"),
smallDF.col("minrel"),
smallDF.col("Label"),
smallDF.col("StartDate"))
.withColumnRenamed("DVA_1", "DVA")
smallDF.unpersist()
val uh_to_be_sorted = uh_joined.repartition(4158, col("PSP"))
val uh_final = uh_joined.sortWithinPartitions(col("NO_NUM"), col("UHDIN"), col("HOURMV"))
uh_final
EDITED - Repartition logic
val sqlContext = spark.sqlContext
sqlContext.udf.register("randomUDF", (partitionCount: Int) => {
val r = new scala.util.Random
r.nextInt(partitionCount)
// Also tried with r.nextInt(partitionCount) + col("PSP")
})
val uh_to_be_sorted = uh_joined
.withColumn("tmp", callUDF("RandomUDF", lit("4158"))
.repartition(4158, col("tmp"))
.drop(col("tmp"))
val uh_final = uh_to_be_sorted.sortWithinPartitions(col("NO_NUM"), col("UHDIN"), col("HOURMV"))
uh_final
smallDF is a small dataset (535MB) that I broadcast.
TransactionType is a class where I add a new column of string elements to my uh dataframe based on the value of 3 columns (MMED, DEBCRED, NMTGP), checking the values of those columns using regex.
I previously faced a lot of issues (job failing) because of shuffle blocks that were not found. I discovered that I was spilling to disk and had a lot of GC memory issues so I increased the "spark.sql.shuffle.partitions" to 4158.
WHY 4158 ?
Partition_count = (stage input data) / (target size of your partition)
so Shuffle partition_count = (shuffle stage input data) / 200 MB = 860000/200=4300
I have 16*24 - 6 =378 cores availaible. So if I want to run every tasks in one go, I should divide 4300 by 378 which is approximately 11. Then 11*378=4158
Spark Version: 2.1
Cluster configuration:
24 compute nodes (workers)
16 vcores each
90 GB RAM per node
6 cores are already being used by other processes/jobs
Current Spark configuration:
-master: yarn
-executor-memory: 26G
-executor-cores: 5
-driver memory: 70G
-num-executors: 70
-spark.kryoserializer.buffer.max=512
-spark.driver.cores=5
-spark.driver.maxResultSize=500m
-spark.memory.storageFraction=0.4
-spark.memory.fraction=0.9
-spark.hadoop.fs.permissions.umask-mode=007
How is the job executed:
We build an artifact (jar) with IntelliJ and then send it to a server. Then a bash script is executed. This script:
export some environment variables (SPARK_HOME, HADOOP_CONF_DIR, PATH and SPARK_LOCAL_DIRS)
launch the spark-submit command with all the parameters defined in the spark configuration above
retrieves the yarn logs of the application
Spark UI screenshots
DAG
#Ali
From the Summary Metrics we can say that your data is Skewed ( Max Duration : 49 min and Max Shuffle Read Size/Records : 2.5 GB/ 23,947,440 where as on an average it's taking about 4-5 mins and processing less than 200 MB/1.2 MM rows)
Now that we know the problem might be skew of data in few partition(s) , I think we can fix this by changing repartition logic val uh_to_be_sorted = uh_joined.repartition(4158, col("PSP")) by chosing something (like some other column or adding any other column to PSP)
few links to refer on data skew and fix
https://dzone.com/articles/optimize-spark-with-distribute-by-cluster-by
https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/
Hope this helps
I have a folder which has 14 files in it. I run the spark-submit with 10 executors on a cluster, which has resource manager as yarn.
I create my first RDD as this:
JavaPairRDD<String,String> files = sc.wholeTextFiles(folderPath.toString(), 10);
However, files.getNumPartitions()gives me 7 or 8, randomly. Then I do not use coalesce/repartition anywhere and I finish my DAG with 7-8 partitions.
As I know, we gave argument as the "minimum" number of partitions, so that why Spark divide my RDD to 7-8 partitions?
I also run the same program with 20 partitions and it gave me 11 partitions.
I have seen a topic here, but it was about "more" partitions, which did not help me at all.
Note: In the program, I read another folder which has 10 files, and Spark creates 10 partitions successfully. I run the above problematic transformation after this successful job is finished.
File sizes:
1)25.07 KB
2)46.61 KB
3)126.34 KB
4)158.15 KB
5)169.21 KB
6)16.03 KB
7)67.41 KB
8)60.84 KB
9)70.83 KB
10)87.94 KB
11)99.29 KB
12)120.58 KB
13)170.43 KB
14)183.87 KB
Files are on the HDFS, block sizes are 128MB, replication factor 3.
It would have been more clear if we have size of each file. But code will not be wrong. I am adding this answer as per spark code base
First off all, maxSplitSize will be calculated depends directory size and min partitions passed in wholeTextFiles
def setMinPartitions(context: JobContext, minPartitions: Int) {
val files = listStatus(context).asScala
val totalLen = files.map(file => if (file.isDirectory) 0L else file.getLen).sum
val maxSplitSize = Math.ceil(totalLen * 1.0 /
(if (minPartitions == 0) 1 else minPartitions)).toLong
super.setMaxSplitSize(maxSplitSize)
}
// file: WholeTextFileInputFormat.scala
link
As per maxSplitSize splits(partitions in Spark) will be extracted from source.
inputFormat.setMinPartitions(jobContext, minPartitions)
val rawSplits = inputFormat.getSplits(jobContext).toArray // Here number of splits will be decides
val result = new Array[Partition](rawSplits.size)
for (i <- 0 until rawSplits.size) {
result(i) = new NewHadoopPartition(id, i, rawSplits(i).asInstanceOf[InputSplit with Writable])
}
// file: WholeTextFileRDD.scala
link
More information available at CombineFileInputFormat#getSplits class on reading files and preparing splits.
Note:
I referred Spark partitions as MapReduce splits here, as Spark
borrowed input and output formatters from MapReduce
I need some help
I have problems with apache-spark when i use for loop to update dataframe. its size keeps growing unlimited although its count is not growing
Can u suggest me how to fix it or guide me why my dataframe size is keep growing all the time? (T^T)//
my program run on local[6] using spark2.0.1
#this is my code
def main(args: Array[String]): Unit = {
val df1 = initial dataframe(read from db)
while(){
val word_count_df = processAndCountText() // query data from database and do word count
val temp_df1 = update(df1,word_count_df )
temp_df1.persist(StorageLevel.MEMORY_AND_DISK)
df1.unpersist()
df1 = temp_df1
println(temp_df1.count())
println(s"${SizeEstimator.estimate(temp_df1) / 1073741824.0} GB")
}
}
//Edited
This is update function that updates some row that has key in word_count_df.
I tried to split it to 2 dataframes and compute it separately then return the union of 2 dataframes but it takes too much time because it need to enable "spark.sql.crossJoin.enabled"
def update(u_stateful_df : DataFrame, word_count_df : DataFrame) : DataFrame = {
val run_time = current_end_time_m - start_time_ms / 60000
val calPenalty = udf { (last_update_duration: Long, run_time: Long) => calculatePenalty(last_update_duration, run_time) }
//calculatePenalty is simple math function using for loop and return double
val calVold = udf { (v_old: Double, penalty_power: Double) => v_old * Math.exp(penalty_power) }
//(word_new,count_new)
val word_count_temp_df = word_count_df
.withColumnRenamed("word", "word_new")
.withColumnRenamed("count", "count_new")
//u_stateful_df (word,u,v,a,last_update,count)
val state_df = u_stateful_df
.join(word_count_temp_df, u_stateful_df("word") === word_count_temp_df("word_new"), "outer")
.na.fill(Map("last_update" -> start_time_ms / 60000))
.na.fill(0.0)
.withColumn("word", when(col("word").isNotNull, col("word")).otherwise(col("word_new")))
.withColumn("count", when(col("word_new").isNotNull, col("count_new")).otherwise(-1))
.drop("count_new")
.withColumn("current_end_time_m", lit(current_end_time_m))
.withColumn("last_update_duration", col("current_end_time_m") - col("last_update"))
.filter(col("last_update_duration") < ResourceUtility.one_hour_duration_ms / 60000)
.withColumn("run_time", when(col("word_new").isNotNull, lit(run_time)))
.withColumn("penalty_power", when(col("word_new").isNotNull, calPenalty(col("last_update_duration"), col("run_time"))))
.withColumn("v_old_penalty", when(col("word_new").isNotNull, calVold(col("v"), col("penalty_power"))))
.withColumn("v_new", when(col("word_new").isNotNull, col("count") / run_time))
.withColumn("v_sum", when(col("word_new").isNotNull, col("v_old_penalty") + col("v_new")))
.withColumn("a", when(col("word_new").isNotNull, (col("v_sum") - col("v")) / col("last_update_duration")).otherwise(col("a")))
.withColumn("last_update", when(col("word_new").isNotNull, lit(current_end_time_m)).otherwise(col("last_update")))
.withColumn("u", when(col("word_new").isNotNull, col("v")).otherwise(col("u")))
.withColumn("v", when(col("word_new").isNotNull, col("v_sum")).otherwise(col("v")))
state_df.select("word", "u", "v", "a", "last_update", "count")
}
#this is my log
u_stateful_df : 1408665
size of dataframe size : 0.8601360470056534 GB
u_stateful_df : 1408665
size of dataframe size : 1.3347024470567703 GB
u_stateful_df : 268498
size of dataframe size : 1.5012029185891151 GB
u_stateful_df : 147232
size of dataframe size : 3.287795402109623 GB
u_stateful_df : 111950
size of dataframe size : 4.761911824345589 GB
....
....
u_stateful_df : 72067
size of dataframe size : 14.510709017515182 GB
#This is log when i write it to file
I save df1 as CSV in the file system. below is the size of dataframe in file system, count and size(track by org.apache.spark.util.SizeEstimator).
csv size 84.2 MB
u_stateful_df : 1408665
size of dataframe size : 0.4460855945944786 GB
csv size 15.2 MB
u_stateful_df : 183315
size of dataframe size : 0.522 GB
csv size 9.96 MB
u_stateful_df : 123381
size of dataframe size : 0.630GB
csv size 4.63 MB
u_stateful_df : 56896
size of dataframe size : 0.999 GB
...
...
...
csv size 3.47 MB
u_stateful_df : 43104
size of dataframe size : 3.1956922858953476 GB
It looks like some leak inside Spark. Usually when you call persist or cache on Dataframe and then count Spark generates results and stores it inside distributed memory or on disks, but also knows whole execution plan to rebuild that Dataframe in case of lost executor or something. But it should not take so much space...
As far as i know there is no option to "collapse" Dataframe (to tell Spark to forget whole execution plan) other that simply writing to storage and then read from this storage.