I need some help
I have problems with apache-spark when i use for loop to update dataframe. its size keeps growing unlimited although its count is not growing
Can u suggest me how to fix it or guide me why my dataframe size is keep growing all the time? (T^T)//
my program run on local[6] using spark2.0.1
#this is my code
def main(args: Array[String]): Unit = {
val df1 = initial dataframe(read from db)
while(){
val word_count_df = processAndCountText() // query data from database and do word count
val temp_df1 = update(df1,word_count_df )
temp_df1.persist(StorageLevel.MEMORY_AND_DISK)
df1.unpersist()
df1 = temp_df1
println(temp_df1.count())
println(s"${SizeEstimator.estimate(temp_df1) / 1073741824.0} GB")
}
}
//Edited
This is update function that updates some row that has key in word_count_df.
I tried to split it to 2 dataframes and compute it separately then return the union of 2 dataframes but it takes too much time because it need to enable "spark.sql.crossJoin.enabled"
def update(u_stateful_df : DataFrame, word_count_df : DataFrame) : DataFrame = {
val run_time = current_end_time_m - start_time_ms / 60000
val calPenalty = udf { (last_update_duration: Long, run_time: Long) => calculatePenalty(last_update_duration, run_time) }
//calculatePenalty is simple math function using for loop and return double
val calVold = udf { (v_old: Double, penalty_power: Double) => v_old * Math.exp(penalty_power) }
//(word_new,count_new)
val word_count_temp_df = word_count_df
.withColumnRenamed("word", "word_new")
.withColumnRenamed("count", "count_new")
//u_stateful_df (word,u,v,a,last_update,count)
val state_df = u_stateful_df
.join(word_count_temp_df, u_stateful_df("word") === word_count_temp_df("word_new"), "outer")
.na.fill(Map("last_update" -> start_time_ms / 60000))
.na.fill(0.0)
.withColumn("word", when(col("word").isNotNull, col("word")).otherwise(col("word_new")))
.withColumn("count", when(col("word_new").isNotNull, col("count_new")).otherwise(-1))
.drop("count_new")
.withColumn("current_end_time_m", lit(current_end_time_m))
.withColumn("last_update_duration", col("current_end_time_m") - col("last_update"))
.filter(col("last_update_duration") < ResourceUtility.one_hour_duration_ms / 60000)
.withColumn("run_time", when(col("word_new").isNotNull, lit(run_time)))
.withColumn("penalty_power", when(col("word_new").isNotNull, calPenalty(col("last_update_duration"), col("run_time"))))
.withColumn("v_old_penalty", when(col("word_new").isNotNull, calVold(col("v"), col("penalty_power"))))
.withColumn("v_new", when(col("word_new").isNotNull, col("count") / run_time))
.withColumn("v_sum", when(col("word_new").isNotNull, col("v_old_penalty") + col("v_new")))
.withColumn("a", when(col("word_new").isNotNull, (col("v_sum") - col("v")) / col("last_update_duration")).otherwise(col("a")))
.withColumn("last_update", when(col("word_new").isNotNull, lit(current_end_time_m)).otherwise(col("last_update")))
.withColumn("u", when(col("word_new").isNotNull, col("v")).otherwise(col("u")))
.withColumn("v", when(col("word_new").isNotNull, col("v_sum")).otherwise(col("v")))
state_df.select("word", "u", "v", "a", "last_update", "count")
}
#this is my log
u_stateful_df : 1408665
size of dataframe size : 0.8601360470056534 GB
u_stateful_df : 1408665
size of dataframe size : 1.3347024470567703 GB
u_stateful_df : 268498
size of dataframe size : 1.5012029185891151 GB
u_stateful_df : 147232
size of dataframe size : 3.287795402109623 GB
u_stateful_df : 111950
size of dataframe size : 4.761911824345589 GB
....
....
u_stateful_df : 72067
size of dataframe size : 14.510709017515182 GB
#This is log when i write it to file
I save df1 as CSV in the file system. below is the size of dataframe in file system, count and size(track by org.apache.spark.util.SizeEstimator).
csv size 84.2 MB
u_stateful_df : 1408665
size of dataframe size : 0.4460855945944786 GB
csv size 15.2 MB
u_stateful_df : 183315
size of dataframe size : 0.522 GB
csv size 9.96 MB
u_stateful_df : 123381
size of dataframe size : 0.630GB
csv size 4.63 MB
u_stateful_df : 56896
size of dataframe size : 0.999 GB
...
...
...
csv size 3.47 MB
u_stateful_df : 43104
size of dataframe size : 3.1956922858953476 GB
It looks like some leak inside Spark. Usually when you call persist or cache on Dataframe and then count Spark generates results and stores it inside distributed memory or on disks, but also knows whole execution plan to rebuild that Dataframe in case of lost executor or something. But it should not take so much space...
As far as i know there is no option to "collapse" Dataframe (to tell Spark to forget whole execution plan) other that simply writing to storage and then read from this storage.
Related
We are processing roughly 500 MB file of data in EMR.
I am performing the following operations on the file.
read csv :
val df = spark.read.format("csv").load(s3)
aggregating by key and creating the list :
val data = filteredDf.groupBy($"<key>")
.agg(collect_list(struct(cols.head, cols.tail: _*)) as "finalData")
.toJSON
Iterating through each partition and storing per key aggregation to S3 and sending the key to SQS.
data.foreachPartition(partition => {
partition.foreach(json => ......)
}
Data is skewed with one account having almost 10M records (~400 MB). I am experiencing out of memory issue during foreachPartition for the given account.
Configuration:
1 driver : m4.4xlarge CPU Cores : 16 and Memory : 64GB
1 executor : m4.2x large CPU Cores : 8 and Memory : 32GB
driver-memory: 20G
executor-memory: 10G
Partitions : default 200 [ most of them don't do anything ]
Any help is much appreciated! thanks a lot in advance :)
I'm looking for detailed description on how partitions are created in Spark. I assume its created based on the number of available cores in the cluster. But take for example if I have 512 MB file that needs to be processed and this file will be stored in my storage (which can be HDFS or S3 bucket) with block size of either 64 MB or 128 MB. For this case we can assume my cluster cores is 8. But when the file is getting processed by spark program how the partitions will happen on this. Hope 512 MB file will be divided into 8 different partitions and executed in 8 cores. Pls provide suggestions on this.
I find something in source code of FilePartition.scala .
It seems the number of partitions is related to the configuration parameters "maxSplitBytes" and "filesOpenCostInBytes"
def getFilePartitions(
sparkSession: SparkSession,
partitionedFiles: Seq[PartitionedFile],
maxSplitBytes: Long): Seq[FilePartition] = {
val partitions = new ArrayBuffer[FilePartition]
val currentFiles = new ArrayBuffer[PartitionedFile]
var currentSize = 0L
/** Close the current partition and move to the next. */
def closePartition(): Unit = {
if (currentFiles.nonEmpty) {
// Copy to a new Array.
val newPartition = FilePartition(partitions.size, currentFiles.toArray)
partitions += newPartition
}
currentFiles.clear()
currentSize = 0
}
val openCostInBytes = sparkSession.sessionState.conf.filesOpenCostInBytes
// Assign files to partitions using "Next Fit Decreasing"
partitionedFiles.foreach { file =>
if (currentSize + file.length > maxSplitBytes) {
closePartition()
}
// Add the given file to the current partition.
currentSize += file.length + openCostInBytes
currentFiles += file
}
closePartition()
partitions.toSeq
}
my spark job currently runs in 59 mins. I want to optimize it so that I it takes less time. I have noticed that the last step of the job takes a lot of time (55 mins) (see the screenshots of the spark job in Spark UI below).
I need to join a big dataset with a smaller one, apply transformations on this joined dataset (creating a new column).
At the end, I should have a dataset repartitioned based on the column PSP (see snippet of the code below). I also perform a sort at the end (sort each partition based on 3 columns).
All the details (infrastructure, configuration, code) can be found below.
Snippet of my code :
spark.conf.set("spark.sql.shuffle.partitions", 4158)
val uh = uh_months
.withColumn("UHDIN", datediff(to_date(unix_timestamp(col("UHDIN_YYYYMMDD"), "yyyyMMdd").cast(TimestampType)),
to_date(unix_timestamp(col("january"), "yyyy-MM-dd").cast(TimestampType))))
"ddMMMyyyy")).cast(TimestampType)))
.withColumn("DVA_1", date_format(col("DVA"), "dd/MM/yyyy"))
.drop("UHDIN_YYYYMMDD")
.drop("january")
.drop("DVA")
.persist()
val uh_flag_comment = new TransactionType().transform(uh)
uh.unpersist()
val uh_joined = uh_flag_comment.join(broadcast(smallDF), "NO_NUM")
.select(
uh.col("*"),
smallDF.col("PSP"),
smallDF.col("minrel"),
smallDF.col("Label"),
smallDF.col("StartDate"))
.withColumnRenamed("DVA_1", "DVA")
smallDF.unpersist()
val uh_to_be_sorted = uh_joined.repartition(4158, col("PSP"))
val uh_final = uh_joined.sortWithinPartitions(col("NO_NUM"), col("UHDIN"), col("HOURMV"))
uh_final
EDITED - Repartition logic
val sqlContext = spark.sqlContext
sqlContext.udf.register("randomUDF", (partitionCount: Int) => {
val r = new scala.util.Random
r.nextInt(partitionCount)
// Also tried with r.nextInt(partitionCount) + col("PSP")
})
val uh_to_be_sorted = uh_joined
.withColumn("tmp", callUDF("RandomUDF", lit("4158"))
.repartition(4158, col("tmp"))
.drop(col("tmp"))
val uh_final = uh_to_be_sorted.sortWithinPartitions(col("NO_NUM"), col("UHDIN"), col("HOURMV"))
uh_final
smallDF is a small dataset (535MB) that I broadcast.
TransactionType is a class where I add a new column of string elements to my uh dataframe based on the value of 3 columns (MMED, DEBCRED, NMTGP), checking the values of those columns using regex.
I previously faced a lot of issues (job failing) because of shuffle blocks that were not found. I discovered that I was spilling to disk and had a lot of GC memory issues so I increased the "spark.sql.shuffle.partitions" to 4158.
WHY 4158 ?
Partition_count = (stage input data) / (target size of your partition)
so Shuffle partition_count = (shuffle stage input data) / 200 MB = 860000/200=4300
I have 16*24 - 6 =378 cores availaible. So if I want to run every tasks in one go, I should divide 4300 by 378 which is approximately 11. Then 11*378=4158
Spark Version: 2.1
Cluster configuration:
24 compute nodes (workers)
16 vcores each
90 GB RAM per node
6 cores are already being used by other processes/jobs
Current Spark configuration:
-master: yarn
-executor-memory: 26G
-executor-cores: 5
-driver memory: 70G
-num-executors: 70
-spark.kryoserializer.buffer.max=512
-spark.driver.cores=5
-spark.driver.maxResultSize=500m
-spark.memory.storageFraction=0.4
-spark.memory.fraction=0.9
-spark.hadoop.fs.permissions.umask-mode=007
How is the job executed:
We build an artifact (jar) with IntelliJ and then send it to a server. Then a bash script is executed. This script:
export some environment variables (SPARK_HOME, HADOOP_CONF_DIR, PATH and SPARK_LOCAL_DIRS)
launch the spark-submit command with all the parameters defined in the spark configuration above
retrieves the yarn logs of the application
Spark UI screenshots
DAG
#Ali
From the Summary Metrics we can say that your data is Skewed ( Max Duration : 49 min and Max Shuffle Read Size/Records : 2.5 GB/ 23,947,440 where as on an average it's taking about 4-5 mins and processing less than 200 MB/1.2 MM rows)
Now that we know the problem might be skew of data in few partition(s) , I think we can fix this by changing repartition logic val uh_to_be_sorted = uh_joined.repartition(4158, col("PSP")) by chosing something (like some other column or adding any other column to PSP)
few links to refer on data skew and fix
https://dzone.com/articles/optimize-spark-with-distribute-by-cluster-by
https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/
Hope this helps
I am new to Spark distributed development. I'm attempting to optimize my existing Spark job which takes up to 1 hour to complete.
Infrastructure:
EMR [10 instances of r4.8xlarge (32 cores, 244GB)]
Source Data: 1000 .gz files in S3 (~30MB each)
Spark Execution Parameters [Executors: 300, Executor Memory: 6gb, Cores: 1]
In general, the Spark job performs the following:
private def processLines(lines: RDD[String]): DataFrame = {
val updatedLines = lines.mapPartitions(row => ...)
spark.createDataFrame(updatedLines, schema)
}
// Read S3 files and repartition() and cache()
val lines: RDD[String] = spark.sparkContext
.textFile(pathToFiles, numFiles)
.repartition(2 * numFiles) // double the parallelism
.cache()
val numRawLines = lines.count()
// Custom process each line and cache table
val convertedLines: DataFrame = processLines(lines)
convertedRows.createOrReplaceTempView("temp_tbl")
spark.sqlContext.cacheTable("temp_tbl")
val numRows = spark.sql("select count(*) from temp_tbl").collect().head().getLong(0)
// Select a subset of the data
val myDataFrame = spark.sql("select a, b, c from temp_tbl where field = 'xxx' ")
// Define # of parquet files to write using coalesce
val numParquetFiles = numRows / 1000000
var lessParts = myDataFrame.rdd.coalesce(numParquetFiles)
var lessPartsDataFrame = spark.sqlContext.createDataFrame(lessParts, myDataFrame.schema)
lessPartsDataFrame.createOrReplaceTempView('my_view')
// Insert data from view into Hive parquet table
spark.sql("insert overwrite destination_tbl
select * from my_view")
lines.unpersist()
The app reads all S3 files => repartitions to twice the amount of files => caches the RDD => custom processes each line => creates a temp view/cache table => counts the num rows => selects a subset of the data => decrease the amount of partitions => creates a view of the subset of data => inserts to hive destination table using the view => unpersist the RDD.
I am not sure why it takes a long time to execute. Are the spark execution parameters incorrectly set or is there something being incorrectly invoked here?
Before looking at the metrics, I would try the following change to your code.
private def processLines(lines: DataFrame): DataFrame = {
lines.mapPartitions(row => ...)
}
val convertedLinesDf = spark.read.text(pathToFiles)
.filter("field = 'xxx'")
.cache()
val numLines = convertedLinesDf.count() //dataset get in memory here, it takes time
// Select a subset of the data, but it will be fast if you have enough memory
// Just use Dataframe API
val myDataFrame = convertedLinesDf.transform(processLines).select("a","b","c")
//coalesce here without converting to RDD, experiment what best
myDataFrame.coalesce(<desired_output_files_number>)
.write.option(SaveMode.Overwrite)
.saveAsTable("destination_tbl")
Caching is useless if you don't count the number of rows. And it will take some memory and add some GC pressure
Caching table may consume more memory and add more GC pressure
Converting Dataframe to RDD is costly as it implies ser/deser operations
Not sure what you trying to do with : val numParquetFiles = numRows / 1000000 and repartition(2 * numFiles). With your setup, 1000 files of 30MB each will give you 1000 partitions. It could be fine like this. Calling repartition and coalesce may trigger a shuffling operation which is costly. (Coalesce may not trigger a shuffle)
Tell me if you get any improvements !
I have a folder which has 14 files in it. I run the spark-submit with 10 executors on a cluster, which has resource manager as yarn.
I create my first RDD as this:
JavaPairRDD<String,String> files = sc.wholeTextFiles(folderPath.toString(), 10);
However, files.getNumPartitions()gives me 7 or 8, randomly. Then I do not use coalesce/repartition anywhere and I finish my DAG with 7-8 partitions.
As I know, we gave argument as the "minimum" number of partitions, so that why Spark divide my RDD to 7-8 partitions?
I also run the same program with 20 partitions and it gave me 11 partitions.
I have seen a topic here, but it was about "more" partitions, which did not help me at all.
Note: In the program, I read another folder which has 10 files, and Spark creates 10 partitions successfully. I run the above problematic transformation after this successful job is finished.
File sizes:
1)25.07 KB
2)46.61 KB
3)126.34 KB
4)158.15 KB
5)169.21 KB
6)16.03 KB
7)67.41 KB
8)60.84 KB
9)70.83 KB
10)87.94 KB
11)99.29 KB
12)120.58 KB
13)170.43 KB
14)183.87 KB
Files are on the HDFS, block sizes are 128MB, replication factor 3.
It would have been more clear if we have size of each file. But code will not be wrong. I am adding this answer as per spark code base
First off all, maxSplitSize will be calculated depends directory size and min partitions passed in wholeTextFiles
def setMinPartitions(context: JobContext, minPartitions: Int) {
val files = listStatus(context).asScala
val totalLen = files.map(file => if (file.isDirectory) 0L else file.getLen).sum
val maxSplitSize = Math.ceil(totalLen * 1.0 /
(if (minPartitions == 0) 1 else minPartitions)).toLong
super.setMaxSplitSize(maxSplitSize)
}
// file: WholeTextFileInputFormat.scala
link
As per maxSplitSize splits(partitions in Spark) will be extracted from source.
inputFormat.setMinPartitions(jobContext, minPartitions)
val rawSplits = inputFormat.getSplits(jobContext).toArray // Here number of splits will be decides
val result = new Array[Partition](rawSplits.size)
for (i <- 0 until rawSplits.size) {
result(i) = new NewHadoopPartition(id, i, rawSplits(i).asInstanceOf[InputSplit with Writable])
}
// file: WholeTextFileRDD.scala
link
More information available at CombineFileInputFormat#getSplits class on reading files and preparing splits.
Note:
I referred Spark partitions as MapReduce splits here, as Spark
borrowed input and output formatters from MapReduce