How to avoid RDD regeneration in Spark SQL? - apache-spark

When running TPCH Q1 in Spark 3.2.1 via syntax SQL, it generates 2 jobs and 5 stages.
I am looking for the explanations or reasons for the relevant issues, including
why a Spark SQL will be separated into multiple jobs
why are RDDs (blue rectangles in the DAGs) regenerated? In this example, the first two RDDs in stage 4 are regenerated.
Can we avoid the RDD regenerations? If so, how?
Thanks in advance!
Here is the code for running TPCH Q1.
val spark = SparkSession
.builder()
.enableHiveSupport()
.getOrCreate()
spark.sql(s"use TPCH_100")
val queryContent = """
select
l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
from
lineitem
where
l_shipdate <= date '1998-12-01' - interval '68' day
group by
l_returnflag,
l_linestatus
order by
l_returnflag,
l_linestatus
"""
spark.sql(queryContent).collect()

Related

Hive hql to Spark sql conversion

I have a requirement to convert hql to spark sql .I am using below approach , with this I am not seeing much change in performance. If anybody has better suggestion please let me know.
hive-
create table temp1 as select * from Table1 T1 join (select id , min(activity_date) as dt from Table1 group by id) T2 on T1.id=T2.id and T1.activity_date=T2.dt ;
create table temp2 as select * from temp1 join diff_table
I have around 70 such internal hive temp tables and data in the source Table1 is around 1.8 billion with no partitioning and 200 hdfs files .
spark code - running with 20 executor, 5 executor-core,10G executor memory, yarn-client , driver 4G
import org.apache.spark.sql.{Row,SaveMode,SparkSession}
val spark=SparkSession.builder().appName("test").config("spark.sql.warehouse.dir","/usr/hive/warehouse").enableHiveSupport().getOrCreate()
import spark.implicit._
import spark.sql
val id_df=sql("select id , min(activity_date) as dt from Table1 group by id")
val all_id_df=sql("select * from Table1")
id_df.createOrReplaceTempView("min_id_table")
all_id_df.createOrReplaceTempView("all_id_table")
val temp1_df=sql("select * from all_id_table T1 join min_id_table T2 on T1.id=T2.id and T1.activity_date=T2.dt")
temp1_df.createOrReplaceTempView("temp2")
sql("create or replace table temp as select * from temp2")

Spark job collapses into a single partition but I do not understand why

I am trying to tune a spark job.
I am using databricks to run it and at some point I see this picture:
Notice that in stage 12, I have only one partition- meaning there is no parallelism. How can I deduce the cause for this? To be sure, I do not have any 'repartition(1)' in my code.
Adding the (slightly obfuscated) code:
spark.read(cid, location).createOrReplaceTempView("some_parquets")
parquets = spark.profile_paqrquet_df(cid)
parquets.where("year = 2018 and month = 5 and day = 18 and sm_device_source = 'js'"
.createOrReplaceTempView("parquets")
# join between two dataframes.
spark.sql(
"""
SELECT {fields}
FROM some_parquets
WHERE some_parquets.a = 'js'
AND some_parquets.b = 'normal'
AND date_f >= to_date('2018-05-01')
AND date_f < to_date('2018-05-05')
limit {limit}
""".format(limit=1000000, fields=",".join(fields))
).createOrReplaceTempView("some_parquets")
join_result = spark.sql(
"""
SELECT
parquets.some_field,
struct(some_parquets.*) as some_parquets
FROM some_parquets
LEFT ANTI JOIN some_ids ON some_parquets.sid = some_ids.sid
LEFT OUTER JOIN parquets ON some_parquets.uid = parquets.uid
""".format(some_ids=some_ids)
)
# turn items in each partition into vectors for machine learning
vectors = join_result \
.rdd \
.mapPartitions(extract)
# write vectors to file system. This evaluates the results
dump_vectors(vectors, output_folder)
Session construction:
spark = SparkSession \
.builder \
.appName("...") \
.config("spark.sql.shuffle.partitions", 1000)
If anybody is still interested in the answer, in short it happens because of limit clause. Strangely limit clause collapses data into a single partition after the shuffle stage.
Just a sample run on my local spark-shell
scala> spark.sql("Select * from temp limit 1").rdd.partitions.size
res28: Int = 1
scala> spark.sql("Select * from temp").rdd.partitions.size
res29: Int = 16

In Spark, caching a DataFrame influences execution time of previous stages?

I am running a Spark (2.0.1) job with multiple stages. I noticed that when I insert a cache() in one of later stages it changes the execution time of earlier stages. Why? I've never encountered such a case in literature when reading about caching().
Here is my DAG with cache():
And here is my DAG without cache(). All remaining code is the same.
I have a cache() after a sort merge join in Stage10. If the cache() is used in Stage10 then Stage8 is nearly twice longer (20 min vs 11 min) then if there were no cache() in Stage10. Why?
My Stage8 contains two broadcast joins with small DataFrames and a shuffle on a large DataFrame in preparation for the merge join. Stages8 and 9 are independent and operate on two different DataFrames.
Let me know if you need more details to answer this question.
UPDATE 8/2/1018
Here are the details of my Spark script:
I am running my job on a cluster via spark-submit. Here is my spark session.
val spark = SparkSession.builder
.appName("myJob")
.config("spark.executor.cores", 5)
.config("spark.driver.memory", "300g")
.config("spark.executor.memory", "15g")
.getOrCreate()
This creates a job with 21 executors with 5 cpu each.
Load 4 DataFrames from parquet files:
val dfT = spark.read.format("parquet").load(filePath1) // 3 Tb in 3185 partitions
val dfO = spark.read.format("parquet").load(filePath2) // ~ 700 Mb
val dfF = spark.read.format("parquet").load(filePath3) // ~ 800 Mb
val dfP = spark.read.format("parquet").load(filePath4) // 38 Gb
Preprocessing on each of the DataFrames is composed of column selection and dropDuplicates and possible filter like this:
val dfT1 = dfT.filter(...)
val dfO1 = dfO.select(columnsToSelect2).dropDuplicates(Array("someColumn2"))
val dfF1 = dfF.select(columnsToSelect3).dropDuplicates(Array("someColumn3"))
val dfP1 = dfP.select(columnsToSelect4).dropDuplicates(Array("someColumn4"))
Then I left-broadcast-join together first three DataFrames:
val dfTO = dfT1.join(broadcast(dfO1), Seq("someColumn5"), "left_outer")
val dfTOF = dfTO.join(broadcast(dfF1), Seq("someColumn6"), "left_outer")
Since the dfP1 is large I need to do a merge join, I can't afford it to do it now. I need to limit the size of dfTOF first. To do that I add a new timestamp column which is a withColumn with a UDF which transforms a string into a timestamp
val dfTOF1 = dfTOF.withColumn("TransactionTimestamp", myStringToTimestampUDF)
Next I filter on a new timestamp column:
val dfTrain = dfTOF1.filter(dfTOF1("TransactionTimestamp").between("2016-01-01 00:00:00+000", "2016-05-30 00:00:00+000"))
Now I am joining the last DataFrame:
val dfTrain2 = dfTrain.join(dfP1, Seq("someColumn7"), "left_outer")
And lastly the column selection with a cache() that is puzzling me.
val dfTrain3 = dfTrain.select("columnsToSelect5").cache()
dfTrain3.agg(sum(col("someColumn7"))).show()
It looks like the cache() is useless here but there will be some further processing and modelling of the DataFrame and the cache() will be necessary.
Should I give more details? Would you like to see execution plan for dfTrain3?

Pyspark query hive extremely slow even the final result is quite small

I am using spark 2.0.0 to query hive table:
my sql is:
select * from app.abtestmsg_v limit 10
Yes, I want to get the first 10 records from a view app.abtestmsg_v.
When I run this sql in spark-shell,it is very fast, USE about 2 seconds .
But then the problem comes when I try to implement this query by my python code.
I am using Spark 2.0.0 and write a very simple pyspark program, code is:
Below is my pyspark code:
from pyspark.sql import HiveContext
from pyspark.sql.functions import *
import json
hc = HiveContext(sc)
hc.setConf("hive.exec.orc.split.strategy", "ETL")
hc.setConf("hive.security.authorization.enabled",false)
zj_sql = 'select * from app.abtestmsg_v limit 10'
zj_df = hc.sql(zj_sql)
zj_df.collect()
Below is my scala code:
val hive = new org.apache.spark.sql.hive.HiveContext(sc)
hive.setConf("hive.exec.orc.split.strategy", "ETL")
val df = hive.sql("select * from silver_ep.zj_v limit 10")
df.rdd.collect()
From the info log , I find:
although I use "limit 10" to tell spark that I just want the first 10 records , but spark still scan and read all files(in my case, the source data of this view contains 100 files and each file's size is about 1G) of the view , So , there are nearly 100 tasks , each task read a file , and all the task is executed serially. I use nearlly 15 minutes to finish these 100 tasks!!!!! but what I want is just to get the first 10 records.
So , I don't know what to do and what is wrong;
Anybode could give me some suggestions?

How to effectively join large tables in SparkSql?

I am trying to improve performance on a join involving two large tables using SparkSql. From various sources, I figured that the RDDs need to be partitioned.
Source: https://deepsense.io/optimize-spark-with-distribute-by-and-cluster-by
However, when you load a file directly from a parquet file as given below, I am not sure how it can be created as a paired RDD!
With Spark 2.0.1, using “cluster by” has no effect.
val rawDf1 = spark.read.parquet(“file in hdfs”)
rawDf1 .createOrReplaceTempView(“rawdf1”)
val rawDf2 = spark.read.parquet(“file in hdfs”)
rawDf2 .createOrReplaceTempView(“rawdf2”)
val rawDf3 = spark.read.parquet(“file in hdfs”)
rawDf3 .createOrReplaceTempView(“rawdf3”)
val df1 = spark.sql(“select * from rawdf1 cluster by key)
df1 .createOrReplaceTempView(“df1”)
val df2 = spark.sql(“select * from rawdf2 cluster by key)
df2 .createOrReplaceTempView(“df2”)
val df3 = spark.sql(“select * from rawdf3 cluster by key)
df3 .createOrReplaceTempView(“df3”)
val resultDf = spark.sql(“select * from df1 a inner join df2 b on a.key = b.key inner join df3 c on a.key =c.key”)
Whether I use "cluster by" key or not, I still see the same query plan being generated by Spark. How can I create a rdd pair in spark sql so that joins can use tables that can be partitioned?
Without proper partitioning, a lot of shuffles are happening resulting in long delays.
Our configuration ( 5 worker nodes with 1 executor (5 cores per executor) each having 32 cores and 128 GB of RAM):
spark.cores.max 25
spark.default.parallelism 75
spark.driver.extraJavaOptions -XX:+UseG1GC
spark.executor.memory 60G
spark.rdd.compress True
spark.driver.maxResultSize 4g
spark.driver.memory 8g
spark.executor.cores 5
spark.executor.extraJavaOptions -Djdk.nio.maxCachedBufferSize=262144
spark.memory.storageFraction 0.2
To add more info: I am joining more than one table in the same select using the same key across all tables. So it is not possible to create a dataframe first to call repartitionby. I understand I can do this using dataframe api. But my question is how to accomplish this using plain sparksql.

Resources