Repartitioned data bottlenecks in few tasks in Spark - apache-spark

I have a simple spark job which does the following
val dfIn = spark.read.parquet(PATH_IN)
val dfOut = dfIn.repartition(col1, col2, col3)
dfOut.write.mode(SaveMode.Append).partitionBy(col1, col2, col3).parquet(PATH_OUT)
I noticed in this job big performance deterioration. Inspecting the Spark UI showed that the write bottlenecked in a few tasks which showed huge memory spill and much bigger output size compared to the fast partitions.
So I suspected that this issue is caused by the data skew and changed the way the data is repartitioned to
import org.apache.spark.sql.functions.rand
val dfOut = dfIn.withColumn("rand", rand()).repartitionByRange(col1, col2, col3, $"rand")
however this did not help to resolve the performance issues.
In the Spark UI you can see now that the data is very evenly distributed across ALL partitions (based on the output size). But still a few tasks are very long running.
I have no ideas what else could cause this and would be thankful for any ideas.

While this is not a final answer for your issue, this tip might help: you can easily inspect your actual data for possible skewness with
for i, part in enumerate(dfIn.rdd.glom().collect()):
print({i: len(part)})
and then salting as needed. Of course all data might not fit, limit as appropriate to get a proper sample :)
PS: example in Python but you get the idea

Related

Repartition followed by coalesce is not honored

I would like to spin up a lot of tasks when doing my calculation but coalesce into a smaller set of partitions when writing to the table.
A simple example for a demonstration is given below, where repartition is NOT honored during the execution.
My expected output is that the map operation happens in 100 partitions and finally collect happens in only 10 partitions.
It seems Spark has optimized the execution by ignoring the repartition. It would be helpful if someone can explain how to achieve my expected behavior.
sc.parallelize(range(1,1000)).repartition(100).map(lambda x: x*x).coalesce(10).collect()
Instead of coalesce, using repartition helps to achieve the expected behavior.
sc.parallelize(range(1,1000)).repartition(100).map(lambda x: x*x).cache().repartition(10).collect()
This helps to solve my problem. But, still would appreciate an explanation for this behavior.
"Returns a new Dataset that has exactly numPartitions partitions, when (sic) the fewer partitions are requested. If a larger number of partitions is requested, it will stay at the current number of partitions. "
Source: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset#coalesce(numPartitions:Int):org.apache.spark.sql.Dataset[T]

Spark very slow performance with wide dataset

I have a small parquet file (7.67 MB) in HDFS, compressed with snappy. The file has 1300 rows and 10500 columns, all double values. When I create a data frame from the parquet file and perform a simple operation like count, it takes 18 seconds.
scala> val df = spark.read.format("parquet").load("/path/to/parquet/file")
df: org.apache.spark.sql.DataFrame = [column0_0: double, column1_1: double ... 10498 more fields]
scala> df.registerTempTable("table")
scala> spark.time(sql("select count(1) from table").show)
+--------+
|count(1)|
+--------+
| 1300|
+--------+
Time taken: 18402 ms
Can anything be done to improve performance of wide files?
Hey Glad you are here on the community,
Count is a lazy operation.Count,Show all these operations are costly in spark as they run over each and every record so using them will always take a lot of time instead you can write the results back to a file or database to make it fast, if you want to check out the result you can use DF.printSchema()
A simple way to check if a dataframe has rows, is to do a Try(df.head). If Success, then there's at least one row in the dataframe. If Failure, then the dataframe is empty.
When operating on the data frame, you may want to consider selecting only those columns that are of interest to you (i.e. df.select(columns...)) before performing any aggregation. This may trim down the size of your set considerably. Also, if any filtering needs to be done, do that first as well.
I find this answer which may be helpful to you.
Spark SQL is not suitable to process wide data (column number > 1K). If it's possible, you can use vector or map column to solve this problem.

What is the best way to collect the Spark job run statistics and save to database

My Spark program has got several table joins(using SPARKSQL) and I would like to collect the time taken to process each of those joins and save to a statistics table. The purpose is to run it continuously over a period of time and gather the performance at very granular level.
e.g
val DF1= spark.sql("select x,y from A,B ")
Val DF2 =spark.sql("select k,v from TABLE1,TABLE2 ")
finally I join DF1 and DF2 and then initiate an action like saveAsTable .
What I am looking for is to figure out
1.How much time it really took to compute DF1
2.How much time to compute DF2 and
3.How much time to persist those final Joins to Hive / HDFS
and put all these info to a RUN-STATISTICS table / file.
Any help is appreciated and thanks in advance
Spark uses Lazy Evaluation, allowing the engine to optimize RDD transformations at a very granular level.
When you execute
val DF1= spark.sql("select x,y from A,B ")
nothing happens except the transformation is added to the Directed Acyclic Graph.
Only when you perform an Action, such as DF1.count, the driver is forced to execute a physical execution plan. This is deferred as far down the chain of RDD transformations as possible.
Therefore it is not correct to ask
1.How much time it really took to compute DF1
2.How much time to compute DF2 and
at least based on the code examples you provided. Your code did not "compute" val DF1. We may not know how long processing just DF1 took, unless you somehow tricked the compiler into processing each dataframe separately.
A better way to structure the question might be "how many stages (tasks) is my job divided into overall, and how long does it take to finish those stages (tasks)"?
And this can be easily answered by looking at the log files/web GUI timeline (comes in different flavors depending on your setup)
3.How much time to persist those final Joins to Hive / HDFS
Fair question. Check out Ganglia
Cluster-wide monitoring tools, such as Ganglia, can provide insight into overall cluster utilization and resource bottlenecks. For instance, a Ganglia dashboard can quickly reveal whether a particular workload is disk bound, network bound, or CPU bound.
Another trick I like to use it defining every sequence of transformations that must end in an action inside a separate function, and then calling that function on the input RDD inside a "timer function" block.
For instance, my "timer" is defined as such
def time[R](block: => R): R = {
val t0 = System.nanoTime()
val result = block
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0)/1e9 + "s")
result
}
and can be used as
val df1 = Seq((1,"a"),(2,"b")).toDF("id","letter")
scala> time{df1.count}
Elapsed time: 1.306778691s
res1: Long = 2
However don't call unnecessary actions just to break down the DAG into more stages/wide dependencies. This might lead to shuffles or slow down your execution.
Resources:
https://spark.apache.org/docs/latest/monitoring.html
http://ganglia.sourceforge.net/
https://www.youtube.com/watch?v=49Hr5xZyTEA

Spark skewing data to few executors

I'm running spark on standalone mode with 21 executors, and when I load my first SQL table using my sqlContext, I partition it in a way such that the data is perfectly distributed among all blocks by partitioning on a column that is sequential integers:
val brDF = sqlContext.load("jdbc", Map("url" -> srcurl, "dbtable" -> "basereading", "partitionColumn" -> "timeperiod", "lowerBound" ->"2", "upperBound" -> "35037", "numPartitions" -> "100"))
Additionally, the blocks are nicely distributed on each cluster so that each cluster has a similar memory usage.
Unfortunately, when I join it with a much smaller table idoM like so:
val mrDF = idoM.as('idom).join(brS1DF.as('br), $"idom.idoid" === $"br.meter")
Where idoM is a 1 column table and cache the result, the distribution of the way the RDD blocks are stored on the cluster changes:
screenshot of spark UI executors sorted by number of RDD blocks
Now, there are suddenly more RDD blocks on my fourth cluster and it uses more memory. Upon checking each RDD, their blocks seem to still be distributed nicely so my partitioning is still fine, just that all the blocks seem to only want to be written on one cluster, defeating the purpose of having multiple to begin with.
I suspect that my problem has something similar to
this question on the Apache mail list
but there is no answer, so anything would be greatly appreciated.
Not knowing your data, I assume that the distribution of the key you are joining on are the cause of the data skew.
Running idoM.groupBy("idoid").count.orderBy(desc("count")).show or brS1DF.groupBy("meter").count.orderBy(desc("count")).show will probably show you, that a few values have a lot of occurrences.
The issue was with idoM being loaded onto one machine, and spark trying to keep the data locality and doing the whole join on one machine, which was resolved in this case by broadcasting the smaller table to the larger one. I made sure that the keys of idoM were perfectly distributed on the column that was being joined, and unfortunately, repartitioning does not solve the issue as spark still tries to keep the locality and the whole dataFrame still ends up on one machine.

What should be the optimal value for spark.sql.shuffle.partitions or how do we increase partitions when using Spark SQL?

I am using Spark SQL actually hiveContext.sql() which uses group by queries and I am running into OOM issues. So thinking of increasing value of spark.sql.shuffle.partitions from 200 default to 1000 but it is not helping.
I believe this partition will share data shuffle load so more the partitions less data to hold. I am new to Spark. I am using Spark 1.4.0 and I have around 1TB of uncompressed data to process using hiveContext.sql() group by queries.
If you're running out of memory on the shuffle, try setting spark.sql.shuffle.partitions to 2001.
Spark uses a different data structure for shuffle book-keeping when the number of partitions is greater than 2000:
private[spark] object MapStatus {
def apply(loc: BlockManagerId, uncompressedSizes: Array[Long]): MapStatus = {
if (uncompressedSizes.length > 2000) {
HighlyCompressedMapStatus(loc, uncompressedSizes)
} else {
new CompressedMapStatus(loc, uncompressedSizes)
}
}
...
I really wish they would let you configure this independently.
By the way, I found this information in a Cloudera slide deck.
OK so I think your issue is more general. It's not specific to Spark SQL, it's a general problem with Spark where it ignores the number of partitions you tell it when the files are few. Spark seems to have the same number of partitions as the number of files on HDFS, unless you call repartition. So calling repartition ought to work, but has the caveat of causing a shuffle somewhat unnecessarily.
I raised this question a while ago and have still yet to get a good answer :(
Spark: increase number of partitions without causing a shuffle?
It's actually depends on your data and your query, if Spark must load 1Tb, there is something wrong on your design.
Use the superbe web UI to see the DAG, mean how Spark is translating your SQL query to jobs/stages and tasks.
Useful metrics are "Input" and "Shuffle".
Partition your data (Hive / directory layout like /year=X/month=X)
Use spark CLUSTER BY feature, to work per data partition
Use ORC / Parquet file format because they provide "Push-down filter", useless data is not loaded to Spark
Analyze Spark History to see how Spark is reading data
Also, OOM could happen on your driver?
-> this is another issue, the driver will collect at the end the data you want. If you ask too much data, the driver will OOM, try limiting your query, or write another table (Spark syntax CREATE TABLE ...AS).
I came across this post from Cloudera about Hive Partitioning. Check out the "Pointers" section talking about number of partitions and number of files in each partition resulting in overloading the name node, which might cause OOM.

Resources