How to remove empty partition in a dataframe? - apache-spark

I need to remove the empty partitions from a Dataframe
We are having two Dataframes, both are created using sqlContext. And the dataframes are constructed and combined as below
import org.apache.spark.sql.{SQLContext}
val sqlContext = new SQLContext(sc)
// Loading Dataframe 1
val csv1 = "s3n://xxxxx:xxxxxx#xxxx/xxx.csv"
val csv1DF = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load(csv1)
// Loading Dataframe 2
val csv2 = "s3n://xxxxx:xxxxxx#xxxx/xxx.csv"
val csv2DF = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load(csv1)
// Combining dataframes
val combinedDF = csv1.
join(csv2 csv1("column_1") === csv2("column_2"))
Now the number of partition for combinedDF is 200.
From here it is found that the default number of partition is 200 when we use joins.
In some cases the dataframe/csv is not big and getting many empty partition which causes issues later part of the code.
So how can I remove these empty partition created?

The repartition method can be used to create an RDD without any empty partitions.
This thread discusses the optimal number of partitions for a given cluster. Here is good rule of thumb for estimating the optimal number of partitions.
number_of_partitions = number_of_cores * 4
If you have a cluster of 8 r3.xlarge AWS nodes, you should use 128 partitions (8 nodes * 4 CPUs per node * 4 partitions per CPU).

Related

In Spark, caching a DataFrame influences execution time of previous stages?

I am running a Spark (2.0.1) job with multiple stages. I noticed that when I insert a cache() in one of later stages it changes the execution time of earlier stages. Why? I've never encountered such a case in literature when reading about caching().
Here is my DAG with cache():
And here is my DAG without cache(). All remaining code is the same.
I have a cache() after a sort merge join in Stage10. If the cache() is used in Stage10 then Stage8 is nearly twice longer (20 min vs 11 min) then if there were no cache() in Stage10. Why?
My Stage8 contains two broadcast joins with small DataFrames and a shuffle on a large DataFrame in preparation for the merge join. Stages8 and 9 are independent and operate on two different DataFrames.
Let me know if you need more details to answer this question.
UPDATE 8/2/1018
Here are the details of my Spark script:
I am running my job on a cluster via spark-submit. Here is my spark session.
val spark = SparkSession.builder
.appName("myJob")
.config("spark.executor.cores", 5)
.config("spark.driver.memory", "300g")
.config("spark.executor.memory", "15g")
.getOrCreate()
This creates a job with 21 executors with 5 cpu each.
Load 4 DataFrames from parquet files:
val dfT = spark.read.format("parquet").load(filePath1) // 3 Tb in 3185 partitions
val dfO = spark.read.format("parquet").load(filePath2) // ~ 700 Mb
val dfF = spark.read.format("parquet").load(filePath3) // ~ 800 Mb
val dfP = spark.read.format("parquet").load(filePath4) // 38 Gb
Preprocessing on each of the DataFrames is composed of column selection and dropDuplicates and possible filter like this:
val dfT1 = dfT.filter(...)
val dfO1 = dfO.select(columnsToSelect2).dropDuplicates(Array("someColumn2"))
val dfF1 = dfF.select(columnsToSelect3).dropDuplicates(Array("someColumn3"))
val dfP1 = dfP.select(columnsToSelect4).dropDuplicates(Array("someColumn4"))
Then I left-broadcast-join together first three DataFrames:
val dfTO = dfT1.join(broadcast(dfO1), Seq("someColumn5"), "left_outer")
val dfTOF = dfTO.join(broadcast(dfF1), Seq("someColumn6"), "left_outer")
Since the dfP1 is large I need to do a merge join, I can't afford it to do it now. I need to limit the size of dfTOF first. To do that I add a new timestamp column which is a withColumn with a UDF which transforms a string into a timestamp
val dfTOF1 = dfTOF.withColumn("TransactionTimestamp", myStringToTimestampUDF)
Next I filter on a new timestamp column:
val dfTrain = dfTOF1.filter(dfTOF1("TransactionTimestamp").between("2016-01-01 00:00:00+000", "2016-05-30 00:00:00+000"))
Now I am joining the last DataFrame:
val dfTrain2 = dfTrain.join(dfP1, Seq("someColumn7"), "left_outer")
And lastly the column selection with a cache() that is puzzling me.
val dfTrain3 = dfTrain.select("columnsToSelect5").cache()
dfTrain3.agg(sum(col("someColumn7"))).show()
It looks like the cache() is useless here but there will be some further processing and modelling of the DataFrame and the cache() will be necessary.
Should I give more details? Would you like to see execution plan for dfTrain3?

How to distribute dataset evenly to avoid a skewed join (and long-running tasks)?

I am writing application using Spark dataset API on databricks notebook.
I have 2 tables. One is 1.5billion rows and second 2.5 million. Both tables contain telecommunication data and join is done using country code and first 5 digits of a number. Output has 55 billion rows. Problem is I have skewed data(long running tasks). No matter how i repartition dataset I get long running tasks because of uneven distribution of hashed keys.
I tried using broadcast joins, tried persisting big table partitions in memory etc.....
What are my options here?
spark will repartition the data based on the join key, so repartitioning before the join won't change the skew (only add an unnecessary shuffle)
if you know the key that is causing the skew (usually it will be some thing like null or 0 or ""), split your data into to 2 parts - 1 dataset with the skew key, and another with the rest
and do the join on the sub datasets, and union the results
for example:
val df1 = ...
val df2 = ...
val skewKey = null
val df1Skew = df1.where($"key" === skewKey)
val df2Skew = df2.where($"key" === skewKey)
val df1NonSkew = df1.where($"key" =!= skewKey)
val df2NonSkew = df2.where($"key" =!= skewKey)
val dfSkew = df1Skew.join(df2Skew) //this is a cross join
val dfNonSkew = df1NonSkew.join(df2NonSkew, "key")
val res = dfSkew.union(dfNonSkew)

How does HashPartitioner distribute data in Spark? [duplicate]

When I execute below command:
scala> val rdd = sc.parallelize(List((1,2),(3,4),(3,6)),4).partitionBy(new HashPartitioner(10)).persist()
rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[10] at partitionBy at <console>:22
scala> rdd.partitions.size
res9: Int = 10
scala> rdd.partitioner.isDefined
res10: Boolean = true
scala> rdd.partitioner.get
res11: org.apache.spark.Partitioner = org.apache.spark.HashPartitioner#a
It says that there are 10 partitions and partitioning is done using HashPartitioner. But When I execute below command:
scala> val rdd = sc.parallelize(List((1,2),(3,4),(3,6)),4)
...
scala> rdd.partitions.size
res6: Int = 4
scala> rdd.partitioner.isDefined
res8: Boolean = false
It says that there are 4 partitions and partitioner is not defined. So, What is default Partitioning Scheme in Spark ? / How data is partitioned in second case?
You have to distinguish between two different things:
partitioning as distributing data between partitions depending on a value of the key which is limited only to the PairwiseRDDs (RDD[(T, U)]). This creates a relationship between partition and the set of keys which can be found on a given partition.
partitioning as splitting input into multiple partitions where data is simply divided into chunks containing consecutive records to enable distributed computation. Exact logic depends on a specific source but it is either number of records or size of a chunk.
In case of parallelize data is evenly distributed between partitions using indices. In case of HadoopInputFormats (like textFile) it depends on properties like mapreduce.input.fileinputformat.split.minsize / mapreduce.input.fileinputformat.split.maxsize.
So default partitioning scheme is simply none because partitioning is not applicable to all RDDs. For operations which require partitioning on a PairwiseRDD (aggregateByKey, reduceByKey etc.) default method is use hash partitioning.

How to effectively join large tables in SparkSql?

I am trying to improve performance on a join involving two large tables using SparkSql. From various sources, I figured that the RDDs need to be partitioned.
Source: https://deepsense.io/optimize-spark-with-distribute-by-and-cluster-by
However, when you load a file directly from a parquet file as given below, I am not sure how it can be created as a paired RDD!
With Spark 2.0.1, using “cluster by” has no effect.
val rawDf1 = spark.read.parquet(“file in hdfs”)
rawDf1 .createOrReplaceTempView(“rawdf1”)
val rawDf2 = spark.read.parquet(“file in hdfs”)
rawDf2 .createOrReplaceTempView(“rawdf2”)
val rawDf3 = spark.read.parquet(“file in hdfs”)
rawDf3 .createOrReplaceTempView(“rawdf3”)
val df1 = spark.sql(“select * from rawdf1 cluster by key)
df1 .createOrReplaceTempView(“df1”)
val df2 = spark.sql(“select * from rawdf2 cluster by key)
df2 .createOrReplaceTempView(“df2”)
val df3 = spark.sql(“select * from rawdf3 cluster by key)
df3 .createOrReplaceTempView(“df3”)
val resultDf = spark.sql(“select * from df1 a inner join df2 b on a.key = b.key inner join df3 c on a.key =c.key”)
Whether I use "cluster by" key or not, I still see the same query plan being generated by Spark. How can I create a rdd pair in spark sql so that joins can use tables that can be partitioned?
Without proper partitioning, a lot of shuffles are happening resulting in long delays.
Our configuration ( 5 worker nodes with 1 executor (5 cores per executor) each having 32 cores and 128 GB of RAM):
spark.cores.max 25
spark.default.parallelism 75
spark.driver.extraJavaOptions -XX:+UseG1GC
spark.executor.memory 60G
spark.rdd.compress True
spark.driver.maxResultSize 4g
spark.driver.memory 8g
spark.executor.cores 5
spark.executor.extraJavaOptions -Djdk.nio.maxCachedBufferSize=262144
spark.memory.storageFraction 0.2
To add more info: I am joining more than one table in the same select using the same key across all tables. So it is not possible to create a dataframe first to call repartitionby. I understand I can do this using dataframe api. But my question is how to accomplish this using plain sparksql.

Default Partitioning Scheme in Spark

When I execute below command:
scala> val rdd = sc.parallelize(List((1,2),(3,4),(3,6)),4).partitionBy(new HashPartitioner(10)).persist()
rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[10] at partitionBy at <console>:22
scala> rdd.partitions.size
res9: Int = 10
scala> rdd.partitioner.isDefined
res10: Boolean = true
scala> rdd.partitioner.get
res11: org.apache.spark.Partitioner = org.apache.spark.HashPartitioner#a
It says that there are 10 partitions and partitioning is done using HashPartitioner. But When I execute below command:
scala> val rdd = sc.parallelize(List((1,2),(3,4),(3,6)),4)
...
scala> rdd.partitions.size
res6: Int = 4
scala> rdd.partitioner.isDefined
res8: Boolean = false
It says that there are 4 partitions and partitioner is not defined. So, What is default Partitioning Scheme in Spark ? / How data is partitioned in second case?
You have to distinguish between two different things:
partitioning as distributing data between partitions depending on a value of the key which is limited only to the PairwiseRDDs (RDD[(T, U)]). This creates a relationship between partition and the set of keys which can be found on a given partition.
partitioning as splitting input into multiple partitions where data is simply divided into chunks containing consecutive records to enable distributed computation. Exact logic depends on a specific source but it is either number of records or size of a chunk.
In case of parallelize data is evenly distributed between partitions using indices. In case of HadoopInputFormats (like textFile) it depends on properties like mapreduce.input.fileinputformat.split.minsize / mapreduce.input.fileinputformat.split.maxsize.
So default partitioning scheme is simply none because partitioning is not applicable to all RDDs. For operations which require partitioning on a PairwiseRDD (aggregateByKey, reduceByKey etc.) default method is use hash partitioning.

Resources