Getting latest date in a partition by year / month / day using SparkSQL - apache-spark

I am trying to incrementally transform new partitions in a source table into a new table using Spark SQL. The data in both the source and target are partitioned as follows: /data/year=YYYY/month=MM/day=DD/. I was initially just going to select the MAX of year, month and day to get the newest partition, but that is clearly wrong. Is there is a good way to do this?
If I construct a date and take the max like MAX( CONCAT(year,'-','month','-',day)::date ) this would be quite ineffecient, right? Because it will need to scan all data to pull the newest partition.

Try below to get the latest partition without reading data at all, only metadata:
spark.sql("show partitions <table>").agg(max('partition)).show

You can use the result of show partitions as it would be more efficient as it will hit the metastore only. However, you can't just apply a max to the value there, we will need to construct the date first and then do the max.
Here's a sample:
from pyspark.sql import functions as F
df = sqlContext.sql("show partitions")
df.show(10, False)
date = F.to_date(F.regexp_replace(F.regexp_replace("partition", "[a-z=]", ""), "/", "-"))
df.select(F.max(date).alias("max_date")).show()
Input Values:
+------------------------+
|partition |
+------------------------+
|year=2019/month=11/day=5|
|year=2019/month=9/day=5 |
+------------------------+
Result:
+----------+
| max_date|
+----------+
|2019-11-05|
+----------+

Related

How to specify nested partitions in merge query while trying to merge incremental data with a base table?

I am trying to merge a dataframe that contains incremental data into my base table as per the databricks documentation.
base_delta.alias('base') \
.merge(source=kafka_df.alias('inc'),
condition='base.key1=ic.key1 and base.key2=inc.key2') \
.whenMatchedUpdateAll() \
.whenNotMatchedInsertAll() \
.execute()
The above operation is working fine but it takes lot time as expected since there are lot of unwanted partitions that are being scanned.
I came across a databricks documentation here, a merge query with partitions specified in it.
Code from that link:
spark.sql(s"""
|MERGE INTO $targetTableName
|USING $updatesTableName
|ON $targetTableName.par IN (1,0) AND $targetTableName.id = $updatesTableName.id
|WHEN MATCHED THEN
| UPDATE SET $targetTableName.ts = $updatesTableName.ts
|WHEN NOT MATCHED THEN
| INSERT (id, par, ts) VALUES ($updatesTableName.id, $updatesTableName.par, $updatesTableName.ts)
""".stripMargin)
The partitions are specified in the IN condition as 1,2,3... But in my case, the table is first partitioned on COUNTRY values USA, UK, NL, FR, IND and then every country has partition on YYYY-MM Ex: 2020-01, 2020-02, 2020-03
How can I specify the partition values if I have nested structure like I mentioned above ?
Any help is massively appreciated.
Yes, you can do that & it's really recommended, because Delta Lake needs to scan all the data that are matching to the ON condition. If you're using Python API, you just need to use correct SQL expression as condition, and you can put restrictions on the partition columns into it, something like this in your case (date is the column from the update date):
base.country = 'country1' and base.date = inc.date and
base.key1=inc.key1 and base.key2=inc.key2
if you have multiple countries, then you can use IN ('country1', 'country2'), but it would be easier to have country inside your update dataframe and match using base.country = inc.country

Cross Join for calculation in Spark SQL

I have a temporary view with only 1 record/value and I want to use that value to calculate the age of the customers present in another big table (with 100M rows). I used a CROSS JOIN clause, which is resulting in a performance issue.
Is there a better approach to implement this requirement which is will perform better ? Will a broadcast hint be suitable in this scenario ? What is the recommended approach to tackle such scenarios ?
Reference table: (contains only 1 value)
create temporary view ref
as
select to_date(refdt, 'dd-MM-yyyy') as refdt --returns only 1 value
from tableA
where logtype = 'A';
Cust table (10 M rows):
custid | birthdt
A1234 | 20-03-1980
B3456 | 09-05-1985
C2356 | 15-12-1990
Query (calculate age w.r.t birthdt):
select
a.custid,
a.birthdt,
cast((datediff(b.ref_dt, a.birthdt)/365.25) as int) as age
from cust a
cross join ref b;
My question is - Is there a better approach to implement this requirement ?
Thanks
Simply use withColumn!
df.withColumn("new_col", lit("10-05-2020").cast("date"))
Inside view you are using constant value, You can simply put same value in below query without cross join.
select
a.custid,
a.birthdt,
cast((datediff(to_date('10-05-2020', 'dd-MM-yyyy'), a.birthdt)/365.25) as int) as age
from cust a;
scala> spark.sql("select * from cust").show(false)
+------+----------+
|custid|birthdt |
+------+----------+
|A1234 |1980-03-20|
|B3456 |1985-05-09|
|C2356 |1990-12-15|
+------+----------+
scala> spark.sql("select a.custid, a.birthdt, cast((datediff(to_date('10-05-2020', 'dd-MM-yyyy'), a.birthdt)/365.25) as int) as age from cust a").show(false)
+------+----------+---+
|custid|birthdt |age|
+------+----------+---+
|A1234 |1980-03-20|40 |
|B3456 |1985-05-09|35 |
|C2356 |1990-12-15|29 |
+------+----------+---+
Hard to work out exactly your point, but if you cannot use Scala or pyspark and dataframes with .cache etc. then I think that instead of of using a temporary view, just create a single row table. My impression is you are using Spark %sql in a notebook on, say, Databricks.
This is my suspicion as it were.
That said a broadcastjoin hint may well mean the optimizer only sends out 1 row. See https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-hint-framework.html#specifying-query-hints

Spark SQL alternatives to groupby/pivot/agg/collect_list using foldLeft & withColumn so as to improve performance

I have a Spark DataFrame consisting of three columns:
id | col1 | col2
-----------------
x | p1 | a1
-----------------
x | p2 | b1
-----------------
y | p2 | b2
-----------------
y | p2 | b3
-----------------
y | p3 | c1
After applying df.groupBy("id").pivot("col1").agg(collect_list("col2")) I am getting the following dataframe (aggDF):
+---+----+--------+----+
| id| p1| p2| p3|
+---+----+--------+----+
| x|[a1]| [b1]| []|
| y| []|[b2, b3]|[c1]|
+---+----+--------+----+
Then I find the name of columns except the id column.
val cols = aggDF.columns.filter(x => x != "id")
After that I am using cols.foldLeft(aggDF)((df, x) => df.withColumn(x, when(size(col(x)) > 0, col(x)).otherwise(lit(null)))) to replace empty array with null. The performance of this code becomes poor when the number of columns increases. Additionally, I have the name of string columns val stringColumns = Array("p1","p3"). I want to get the following final dataframe:
+---+----+--------+----+
| id| p1| p2| p3|
+---+----+--------+----+
| x| a1 | [b1]|null|
| y|null|[b2, b3]| c1 |
+---+----+--------+----+
Is there any better solution to this problem in order to achieve the final dataframe?
You current code pays 2 performance costs as structured:
As mentioned by Alexandros, you pay 1 catalyst analysis per DataFrame transform so if you loop other a few hundreds or thousands columns, you'll notice some time spent on the driver before the job is actually submitted. If this is a critical issue for you, you can use a single select statement instead of your foldLeft on withColumns but this won't really change a lot the execution time because of the next point
When you use an expression such as when().otherwise() on columns in what can be optimized as a single select statement, the code generator will produce a single large method processing all the columns. If you have more than a couple hundred columns, it's likely that the resulting method won't be JIT-compiled by default by the JVM, resulting in very slow execution performance (max JIT-able method is 8k bytecode in Hotspot).
You can detect if you hit the second issue by inspecting the executor logs and check if you see a WARNING on a too large method that can't be JITed.
How to try and solve this ?
1 - Changing the logic
You can filter the empty cells before the pivot by using a window transform
import org.apache.spark.sql.expressions.Window
val finalDf = df
.withColumn("count", count('col2) over Window.partitionBy('id,'col1))
.filter('count > 0)
.groupBy("id").pivot("col1").agg(collect_list("col2"))
This may or may not be faster depending on actual dataset as the pivot also generates a large select statement expression by itself so it may hit the large method threshold if you encounter more than approximately 500 values for col1.
You may want to combine this with option 2 as well.
2 - Try and finesse the JVM
You can add an extraJavaOption on your executors to ask the JVM to try and JIT hot methods larger than 8k.
For example, add the option
--conf "spark.executor.extraJavaOptions=-XX:-DontCompileHugeMethods"
on your spark-submit and see how it impacts the pivot execution time.
It's difficult to guarantee a substantial speed increase without more details on your real dataset but it's definitely worth a shot.
If you look at https://medium.com/#manuzhang/the-hidden-cost-of-spark-withcolumn-8ffea517c015 then you see that withColumn with a foldLeft has known performance issues. Select is an alternative, as shown below - using varargs.
Not convinced collect_list is an issue. 1st set of logic I kept as well. pivot kicks off a Job to get distinct values for pivoting. It is an accepted approach imo. Trying to roll your own seems pointless to me, but the other answers may prove me wrong or Spark 2.4 has been improved.
import spark.implicits._
import org.apache.spark.sql.functions._
// Your code & assumig id is only col of interest as in THIS question. More elegant than 1st posting.
val df = Seq( ("x","p1","a1"), ("x","p2","b1"), ("y","p2","b2"), ("y","p2","b3"), ("y","p3","c1")).toDF("id", "col1", "col2")
val aggDF = df.groupBy("id").pivot("col1").agg(collect_list("col2"))
//aggDF.show(false)
val colsToSelect = aggDF.columns // All in this case, 1st col id handled by head & tail
val aggDF2 = aggDF.select((col(colsToSelect.head) +: colsToSelect.tail.map
(col => when(size(aggDF(col)) === 0,lit(null)).otherwise(aggDF(col)).as(s"$col"))):_*)
aggDF2.show(false)
returns:
+---+----+--------+----+
|id |p1 |p2 |p3 |
+---+----+--------+----+
|x |[a1]|[b1] |null|
|y |null|[b2, b3]|[c1]|
+---+----+--------+----+
Also a nice read BTW: https://lansalo.com/2018/05/13/spark-how-to-add-multiple-columns-in-dataframes-and-how-not-to/. The effects become more noticable with a higher number of columns. At the end a reader makes a relevant point.
I think that performance is better with select approach when higher number of columns prevail.
UPD: Over the holidays I trialed both approaches with Spark 2.4.x with little observable difference up to 1000 columns. That has puzzled me.

(SPARK) What is the best way to partition data on which multiple filters are applied?

I am working in Spark (on azure databricks) with a 15 billion rows file that looks like this :
+---------+---------------+----------------+-------------+--------+------+
|client_id|transaction_key|transaction_date| product_id|store_id|spend|
+---------+---------------+----------------+-------------+--------+------+
| 1| 7587_20121224| 2012-12-24| 38081275| 787| 4.54|
| 1| 10153_20121224| 2012-12-24| 4011| 1053| 2.97|
| 2| 6823_20121224| 2012-12-24| 561122924| 683| 2.94|
| 3| 11131_20121224| 2012-12-24| 80026282| 1131| 0.4|
| 3| 7587_20121224| 2012-12-24| 92532| 787| 5.49|
This data is used for all my queries, which consist mostly in groupby (product_id for example), sum and count distinct :
results = trx.filter(col("transaction_date") > "2018-01-01"
&
col("product_id").isin(["38081275", "4011"])
.groupby("product_id")
.agg(sum("spend").alias("total_spend"),
countdistinct("transaction_key").alias("number_trx"))
I never need to use 100% of this data, I always start with a filter on :
transaction_date (1 000 distinct values)
product_id (1 000 000 distinct values)
store_id (1 000 distinct values)
==> What is the best way to partition this data in a parquet file ?
I initially partitionned the data on transaction_date :
trx.write.format("parquet").mode("overwrite").partitionBy("transaction_date").save("dbfs:/linkToParquetFile")
This will create partitions that are approximately the same size.
However, most of the queries will require to keep at least 60% of the transaction_date, whereas only a few product_id are usually selected in 1 query.
(70% of the store_id kept usually)
==> Is there a way to build a parquet file taking this into account ?
It seems partitionning the data on product_id would create way too much partitions...
Thanks!
for example you can use multiple columns for partitioning (it creates sub folders) and spark can use partition filters
another good idea is bucketing more information here (to avoid extra shuffle)
Example with hive
trx.write.partitionBy("transaction_date", "store_id").bucketBy(1000, "product_id").saveAsTable("tableName")
to read it use
spark.table("tableName")

Using Dataframe instead of spark sql for data analysis

Below is the sample spark sql I wrote to get the count of male and female enrolled in an agency.I used sql to generate the output,
Is there a way to do similar thing using dataframe only not sql.
val districtWiseGenderCountDF = hiveContext.sql("""
| SELECT District,
| count(CASE WHEN Gender='M' THEN 1 END) as male_count,
| count(CASE WHEN Gender='F' THEN 1 END) as FEMALE_count
| FROM agency_enrollment
| GROUP BY District
| ORDER BY male_count DESC, FEMALE_count DESC
| LIMIT 10""".stripMargin)
Starting with Spark 1.6 you can use pivot + group by to achieve what you'd like
without sample data (and my own availability of spark>1.5) here's a solution that Should work (not tested)
val df = hiveContext.table("agency_enrollment")
df.groupBy("district","gender").pivot("gender").count
see How to pivot DataFrame? for a generic example

Resources