Partitioning in Spark while writing to delta

Partitioning in Spark while writing to delta - apache-spark

I am writing a dataframe to a delta table using the following code:
(df
.write
.format("delta")
.mode("overwrite")
.partitionBy("date")
.saveAsTable("table"))
I have 32 distinct dates in the format yyyy-mm, and I am expecting to have 32 partitions, but if I run print(df.rdd.getNumPartitions()), I get only 15. What am I missing?

Related

Databricks Associate Practice Exam-Question 31

Can anyone please let me know how option "C" is the answer to Question 31 for PracticeExam-DataEngineerAssociate.
https://files.training.databricks.com/assessments/practice-exams/PracticeExam-DataEngineerAssociate.pdf?_ga=2.185796329.1103386439.1663221490-957565140.1661854848
Question 31
Which of the following Structured Streaming queries is performing a hop from a Bronze table
to a Silver table?
A. (spark.table("sales")
.groupBy("store")
.agg(sum("sales"))
.writeStream
.option("checkpointLocation", checkpointPath)
.outputMode("complete")
.table("aggregatedSales")
)
B. (spark.table("sales")
.agg(sum("sales"),
sum("units"))
.writeStream
.option("checkpointLocation", checkpointPath)
.outputMode("complete")
.table("aggregatedSales")
)
C. (spark.table("sales")
.withColumn("avgPrice", col("sales") / col("units"))
.writeStream
.option("checkpointLocation", checkpointPath)
.outputMode("append")
.table("cleanedSales")
)
D. (spark.readStream.load(rawSalesLocation)
.writeStream
.option("checkpointLocation", checkpointPath)
.outputMode("append")
.table("uncleanedSales")
)
E. (spark.read.load(rawSalesLocation)
.writeStream
.option("checkpointLocation", checkpointPath)
.outputMode("append")
.table("uncleanedSales")
)
Since option "C" contains the average function it can't be the correct option as aggregation is from the Silver to the Gold table as per my understanding.

Options A and B are aggregations (notice the use of the .agg function)
As you state, Gold tables are generally aggregations.
Option C is actually not an aggregation.
.withColumn("avgPrice", col("sales") / col("units")) creates a new column with the average price per unit (for that row)
Since option C adds/refines the data and does not reduce it, it can be considered a Bronze to Silver transformation.
EDIT:
Option D loads raw data into a table but performs no refinement, so it could be considered a raw or bronze table.

Should I reduce not required columns in DFs before join them in Spark?

Is there any sense to reduce not required columns before I join it in Spark data frames?
For example:
DF1 has 10 columns, DF2 has 15 columns, DF3 has 25 columns.
I want to join them, select needed 10 columns and save it in .parquet.
Does it make sense to transform DFs with select only needed columns before the join or Spark engine will optimize the join by itself and will not operate with all 50 columns during the join operation?

Yes, it makes a perfect sense because it reduce the amount of data shuffled between executors. And it's better to make selection of only necessary columns as early as possible - in most cases, if file format allows (Parquet, Delta Lake), Spark will read data only for necessary columns, not for all columns. I.e.:
df1 = spark.read.parquet("file1") \
.select("col1", "col2", "col3")
df2 = spark.read.parquet("file2") \
.select("col1", "col5", "col6")
joined = df1.join(df2, "col1")

Pushdown filter in case of spark structured Delta streaming

I have a use case where we need to stream Open Source Delta table into multiple queries, filtered on one of the partitioned column.
Eg,.
Given Delta-table partitioned on year column.
Streaming query 1
spark.readStream.format("delta").load("/tmp/delta-table/").
where("year= 2013")
Streaming query 2
spark.readStream.format("delta").load("/tmp/delta-table/").
where("year= 2014")
The physical plan shows filter after the streaming.
> == Physical Plan == Filter (isnotnull(year#431) AND (year#431 = 2013))
> +- StreamingRelation delta, []
My question is does pushdown predicate works with Streaming queries in Delta?
Can we stream only specific partition from the Delta?

If the column is already partitioned, only the required partition will be scanned.
Let's create both partitioned and non-partitioned delta table and perform structured streaming.
Partitioned delta table streaming:
val spark = SparkSession.builder().master("local[*]").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
//sample dataframe
val df = Seq((1,2020),(2,2021),(3,2020),(4,2020),
(5,2020),(6,2020),(7,2019),(8,2019),(9,2018),(10,2020)).toDF("id","year")
//partionBy year column and save as delta table
df.write.format("delta").partitionBy("year").save("delta-stream")
//streaming delta table
spark.readStream.format("delta").load("delta-stream")
.where('year===2020)
.writeStream.format("console").start().awaitTermination()
physical plan of above streaming query: Notice the partitionFilters
Non-partitioned delta table streaming:
df.write.format("delta").save("delta-stream")
spark.readStream.format("delta").load("delta-stream")
.where('year===2020)
.writeStream.format("console").start().awaitTermination()
physical plan of above streaming query: Notice the pushedFilters

ValidationFailureSemanticException: Partition spec contains non-partition columns

I am trying a simple use case of inserting into a hive partitioned table on S3. I am running my code on zeppelin notebook on EMR and below is my code along with the screenshot of the output of the commands. I checked the schema of hive table and dataframe and there is no case difference in column name. I am getting below mentioned exception.
import org.apache.spark.sql.hive.HiveContext
import sqlContext.implicits._
System.setProperty("hive.metastore.uris","thrift://datalake-hive-server2.com:9083")
val hiveContext = new HiveContext(sc)
hiveContext.setConf("hive.exec.dynamic.partition", "true")
hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
spark.sql("""CREATE EXTERNAL TABLE employee_table (Emp_Id STRING, First_Name STRING, Salary STRING) PARTITIONED BY (Month STRING) LOCATION 's3n://dev-emr-jupyter/anup/'
TBLPROPERTIES ("skip.header.line.count"="1") """)
val csv_df = spark.read
.format("csv")
.option("header", "true").load("s3n://dev-emr-jupyter/anup/test_data.csv")
import org.apache.spark.sql.SaveMode
csv_df.registerTempTable("csv")
spark.sql(""" INSERT OVERWRITE TABLE employee_table PARTITION(Month) select Emp_Id, First_Name, Salary, Month from csv""")
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.Table.ValidationFailureSemanticException: Partition spec {month=, Month=May} contains non-partition columns;
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)

You need to put a command before your insert statement, in order to be able to populate a partition at runtime. By default, the dynamic partition mode is set to strict.
spark.sql("set hive.exec.dynamic.partition.mode=nonstrict")
Try by adding that line and run again.
Edit 1:
I saw in your attache image that when you do csv_df.show() you got your salary column to be the last, instead of month column. Try to reference your columns in the insert statement, like: insert into table_name partition(month) (column1, column2..)..
Florin

Output Hive table is bucketed but Spark currently does NOT populate bucketed output which is compatible with Hive

I have a Apache Spark(v2.4.2) dataframe, I want to insert this dataframe into a hive table.
df = spark.sparkContext.parallelize([["c1",21, 3], ["c1",32,4], ["c2",4,40089], ["c2",439,6889]]).toDF(["c", "n", "v"])
df.createOrReplaceTempView("df")
And I created a hive table:
spark.sql("create table if not exists sample_bucket(n INT, v INT)
partitioned by (c STRING) CLUSTERED BY(n) INTO 3 BUCKETS")
And then I tried to insert data from dataframe df into sample_bucket table:
spark.sql("INSERT OVERWRITE table SAMPLE_BUCKET PARTITION(c) select n, v, c from df")
Which gives me an error, saying:
Output Hive table `default`.`sample_bucket` is bucketed but Spark currently
does NOT populate bucketed output which is compatible with Hive.;
I tried couple of ways which didn't work, on of them is:
spark.sql("set hive.exec.dynamic.partition.mode=nonstrict")
spark.sql("set hive.enforce.bucketing=true")
spark.sql("INSERT OVERWRITE table SAMPLE_BUCKET PARTITION(c) select n, v, c from df cluster by n")
But no luck, can anyone help me!

Spark (current last 2.4.5) does not fully support Hive bucketed tables.
You can read bucketed tables (without any bucket effect) and even insert into it (in this case buckets will be ignoted and futher Hive reads can have unpredicted behaviour).
See https://issues.apache.org/jira/browse/SPARK-19256

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Partitioning in Spark while writing to delta - apache-spark

Related

Databricks Associate Practice Exam-Question 31

Should I reduce not required columns in DFs before join them in Spark?

Pushdown filter in case of spark structured Delta streaming

ValidationFailureSemanticException: Partition spec contains non-partition columns

Output Hive table is bucketed but Spark currently does NOT populate bucketed output which is compatible with Hive

Categories

Resources