How to effectively join large tables in SparkSql? - apache-spark

I am trying to improve performance on a join involving two large tables using SparkSql. From various sources, I figured that the RDDs need to be partitioned.
Source: https://deepsense.io/optimize-spark-with-distribute-by-and-cluster-by
However, when you load a file directly from a parquet file as given below, I am not sure how it can be created as a paired RDD!
With Spark 2.0.1, using “cluster by” has no effect.
val rawDf1 = spark.read.parquet(“file in hdfs”)
rawDf1 .createOrReplaceTempView(“rawdf1”)
val rawDf2 = spark.read.parquet(“file in hdfs”)
rawDf2 .createOrReplaceTempView(“rawdf2”)
val rawDf3 = spark.read.parquet(“file in hdfs”)
rawDf3 .createOrReplaceTempView(“rawdf3”)
val df1 = spark.sql(“select * from rawdf1 cluster by key)
df1 .createOrReplaceTempView(“df1”)
val df2 = spark.sql(“select * from rawdf2 cluster by key)
df2 .createOrReplaceTempView(“df2”)
val df3 = spark.sql(“select * from rawdf3 cluster by key)
df3 .createOrReplaceTempView(“df3”)
val resultDf = spark.sql(“select * from df1 a inner join df2 b on a.key = b.key inner join df3 c on a.key =c.key”)
Whether I use "cluster by" key or not, I still see the same query plan being generated by Spark. How can I create a rdd pair in spark sql so that joins can use tables that can be partitioned?
Without proper partitioning, a lot of shuffles are happening resulting in long delays.
Our configuration ( 5 worker nodes with 1 executor (5 cores per executor) each having 32 cores and 128 GB of RAM):
spark.cores.max 25
spark.default.parallelism 75
spark.driver.extraJavaOptions -XX:+UseG1GC
spark.executor.memory 60G
spark.rdd.compress True
spark.driver.maxResultSize 4g
spark.driver.memory 8g
spark.executor.cores 5
spark.executor.extraJavaOptions -Djdk.nio.maxCachedBufferSize=262144
spark.memory.storageFraction 0.2
To add more info: I am joining more than one table in the same select using the same key across all tables. So it is not possible to create a dataframe first to call repartitionby. I understand I can do this using dataframe api. But my question is how to accomplish this using plain sparksql.

Related

In Spark, caching a DataFrame influences execution time of previous stages?

I am running a Spark (2.0.1) job with multiple stages. I noticed that when I insert a cache() in one of later stages it changes the execution time of earlier stages. Why? I've never encountered such a case in literature when reading about caching().
Here is my DAG with cache():
And here is my DAG without cache(). All remaining code is the same.
I have a cache() after a sort merge join in Stage10. If the cache() is used in Stage10 then Stage8 is nearly twice longer (20 min vs 11 min) then if there were no cache() in Stage10. Why?
My Stage8 contains two broadcast joins with small DataFrames and a shuffle on a large DataFrame in preparation for the merge join. Stages8 and 9 are independent and operate on two different DataFrames.
Let me know if you need more details to answer this question.
UPDATE 8/2/1018
Here are the details of my Spark script:
I am running my job on a cluster via spark-submit. Here is my spark session.
val spark = SparkSession.builder
.appName("myJob")
.config("spark.executor.cores", 5)
.config("spark.driver.memory", "300g")
.config("spark.executor.memory", "15g")
.getOrCreate()
This creates a job with 21 executors with 5 cpu each.
Load 4 DataFrames from parquet files:
val dfT = spark.read.format("parquet").load(filePath1) // 3 Tb in 3185 partitions
val dfO = spark.read.format("parquet").load(filePath2) // ~ 700 Mb
val dfF = spark.read.format("parquet").load(filePath3) // ~ 800 Mb
val dfP = spark.read.format("parquet").load(filePath4) // 38 Gb
Preprocessing on each of the DataFrames is composed of column selection and dropDuplicates and possible filter like this:
val dfT1 = dfT.filter(...)
val dfO1 = dfO.select(columnsToSelect2).dropDuplicates(Array("someColumn2"))
val dfF1 = dfF.select(columnsToSelect3).dropDuplicates(Array("someColumn3"))
val dfP1 = dfP.select(columnsToSelect4).dropDuplicates(Array("someColumn4"))
Then I left-broadcast-join together first three DataFrames:
val dfTO = dfT1.join(broadcast(dfO1), Seq("someColumn5"), "left_outer")
val dfTOF = dfTO.join(broadcast(dfF1), Seq("someColumn6"), "left_outer")
Since the dfP1 is large I need to do a merge join, I can't afford it to do it now. I need to limit the size of dfTOF first. To do that I add a new timestamp column which is a withColumn with a UDF which transforms a string into a timestamp
val dfTOF1 = dfTOF.withColumn("TransactionTimestamp", myStringToTimestampUDF)
Next I filter on a new timestamp column:
val dfTrain = dfTOF1.filter(dfTOF1("TransactionTimestamp").between("2016-01-01 00:00:00+000", "2016-05-30 00:00:00+000"))
Now I am joining the last DataFrame:
val dfTrain2 = dfTrain.join(dfP1, Seq("someColumn7"), "left_outer")
And lastly the column selection with a cache() that is puzzling me.
val dfTrain3 = dfTrain.select("columnsToSelect5").cache()
dfTrain3.agg(sum(col("someColumn7"))).show()
It looks like the cache() is useless here but there will be some further processing and modelling of the DataFrame and the cache() will be necessary.
Should I give more details? Would you like to see execution plan for dfTrain3?

How does HashPartitioner distribute data in Spark? [duplicate]

When I execute below command:
scala> val rdd = sc.parallelize(List((1,2),(3,4),(3,6)),4).partitionBy(new HashPartitioner(10)).persist()
rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[10] at partitionBy at <console>:22
scala> rdd.partitions.size
res9: Int = 10
scala> rdd.partitioner.isDefined
res10: Boolean = true
scala> rdd.partitioner.get
res11: org.apache.spark.Partitioner = org.apache.spark.HashPartitioner#a
It says that there are 10 partitions and partitioning is done using HashPartitioner. But When I execute below command:
scala> val rdd = sc.parallelize(List((1,2),(3,4),(3,6)),4)
...
scala> rdd.partitions.size
res6: Int = 4
scala> rdd.partitioner.isDefined
res8: Boolean = false
It says that there are 4 partitions and partitioner is not defined. So, What is default Partitioning Scheme in Spark ? / How data is partitioned in second case?
You have to distinguish between two different things:
partitioning as distributing data between partitions depending on a value of the key which is limited only to the PairwiseRDDs (RDD[(T, U)]). This creates a relationship between partition and the set of keys which can be found on a given partition.
partitioning as splitting input into multiple partitions where data is simply divided into chunks containing consecutive records to enable distributed computation. Exact logic depends on a specific source but it is either number of records or size of a chunk.
In case of parallelize data is evenly distributed between partitions using indices. In case of HadoopInputFormats (like textFile) it depends on properties like mapreduce.input.fileinputformat.split.minsize / mapreduce.input.fileinputformat.split.maxsize.
So default partitioning scheme is simply none because partitioning is not applicable to all RDDs. For operations which require partitioning on a PairwiseRDD (aggregateByKey, reduceByKey etc.) default method is use hash partitioning.

Apache Spark or Spark-Cassandra-Connector doesnt look like it is reading multiple partitions in parallel?

Apache Spark or Spark-Cassandra-Connector doesnt look like it is reading multiple partitions in parallel.
Here is my code using spark-shell
import org.apache.spark.sql._
import org.apache.spark.sql.types.StringType
spark.sql("""CREATE TEMPORARY VIEW hello USING org.apache.spark.sql.cassandra OPTIONS (table "hello", keyspace "db", cluster "Test Cluster", pushdown "true")""")
val df = spark.sql("SELECT test from hello")
val df2 = df.select(df("test").cast(StringType).as("test"))
val rdd = df2.rdd.map { case Row(j: String) => j }
val df4 = spark.read.json(rdd) // This line takes forever
I have about 700 million rows each row is about 1KB and this line
val df4 = spark.read.json(rdd) takes forever as I get the following output.
Stage 1:==========> (4866 + 24) / 25256]
so at this rate it will probably take roughly 3hrs.
I measured the network throughput rate of spark worker nodes using iftop and it is about 75MB/s (Megabytes per second) which is pretty good but I am not sure if it is reading partitions in parallel. Any ideas on how to make it faster?
Here is my DAG.

Default Partitioning Scheme in Spark

When I execute below command:
scala> val rdd = sc.parallelize(List((1,2),(3,4),(3,6)),4).partitionBy(new HashPartitioner(10)).persist()
rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[10] at partitionBy at <console>:22
scala> rdd.partitions.size
res9: Int = 10
scala> rdd.partitioner.isDefined
res10: Boolean = true
scala> rdd.partitioner.get
res11: org.apache.spark.Partitioner = org.apache.spark.HashPartitioner#a
It says that there are 10 partitions and partitioning is done using HashPartitioner. But When I execute below command:
scala> val rdd = sc.parallelize(List((1,2),(3,4),(3,6)),4)
...
scala> rdd.partitions.size
res6: Int = 4
scala> rdd.partitioner.isDefined
res8: Boolean = false
It says that there are 4 partitions and partitioner is not defined. So, What is default Partitioning Scheme in Spark ? / How data is partitioned in second case?
You have to distinguish between two different things:
partitioning as distributing data between partitions depending on a value of the key which is limited only to the PairwiseRDDs (RDD[(T, U)]). This creates a relationship between partition and the set of keys which can be found on a given partition.
partitioning as splitting input into multiple partitions where data is simply divided into chunks containing consecutive records to enable distributed computation. Exact logic depends on a specific source but it is either number of records or size of a chunk.
In case of parallelize data is evenly distributed between partitions using indices. In case of HadoopInputFormats (like textFile) it depends on properties like mapreduce.input.fileinputformat.split.minsize / mapreduce.input.fileinputformat.split.maxsize.
So default partitioning scheme is simply none because partitioning is not applicable to all RDDs. For operations which require partitioning on a PairwiseRDD (aggregateByKey, reduceByKey etc.) default method is use hash partitioning.

How to remove empty partition in a dataframe?

I need to remove the empty partitions from a Dataframe
We are having two Dataframes, both are created using sqlContext. And the dataframes are constructed and combined as below
import org.apache.spark.sql.{SQLContext}
val sqlContext = new SQLContext(sc)
// Loading Dataframe 1
val csv1 = "s3n://xxxxx:xxxxxx#xxxx/xxx.csv"
val csv1DF = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load(csv1)
// Loading Dataframe 2
val csv2 = "s3n://xxxxx:xxxxxx#xxxx/xxx.csv"
val csv2DF = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load(csv1)
// Combining dataframes
val combinedDF = csv1.
join(csv2 csv1("column_1") === csv2("column_2"))
Now the number of partition for combinedDF is 200.
From here it is found that the default number of partition is 200 when we use joins.
In some cases the dataframe/csv is not big and getting many empty partition which causes issues later part of the code.
So how can I remove these empty partition created?
The repartition method can be used to create an RDD without any empty partitions.
This thread discusses the optimal number of partitions for a given cluster. Here is good rule of thumb for estimating the optimal number of partitions.
number_of_partitions = number_of_cores * 4
If you have a cluster of 8 r3.xlarge AWS nodes, you should use 128 partitions (8 nodes * 4 CPUs per node * 4 partitions per CPU).

Resources